Data‐driven analysis and predictive modeling on COVID‐19

Sonam Sharma; Izzat Alsmadi; Rami S Alkhawaldeh; Bilal Al‐Ahmad

doi:10.1002/cpe.7390

. 2022 Nov 11;34(28):e7390. doi: 10.1002/cpe.7390

Data‐driven analysis and predictive modeling on COVID‐19

Sonam Sharma ¹, Izzat Alsmadi ², Rami S Alkhawaldeh ³, Bilal Al‐Ahmad ^3,^✉

PMCID: PMC9877906 PMID: 36718458

Summary

The coronavirus (COVID‐19) started in China in 2019, has spread rapidly in every single country and has spread in millions of cases worldwide. This paper presents a proposed approach that involves identifying the relative impact of COVID‐19 on a specific gender, the mortality rate in specific age, investigating different safety measures adopted by each country and their impact on the virus growth rate. Our study proposes data‐driven analysis and prediction modeling by investigating three aspects of the pandemic (gender of patients, global growth rate, and social distancing). Several machine learning and ensemble models have been used and compared to obtain the best accuracy. Experiments have been demonstrated on three large public datasets. The motivation of this study is to propose an analytical machine learning based model to explore three significant aspects of COVID‐19 pandemic as gender, global growth rate, and social distancing. The proposed analytical model includes classic classifiers, distinctive ensemble methods such as bagging, feature based ensemble, voting and stacking. The results show a superior prediction performance comparing with the related approaches.

Keywords: COVID‐19, gender of patients, global growth rate, predictive modeling, social distancing

1. INTRODUCTION

Coronaviruses ¹ are a large family of viruses that are known to cause illness ranging from the common cold to more severe diseases such as middle east respiratory syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). A novel coronavirus (CoV) includes new species of coronavirus that was discovered in 2019 and has not been previously identified in humans. The 2019 novel coronavirus (COVID‐19) pandemic appeared in Wuhan, China in December 2019 and has become a serious public health problem worldwide. This virus is linked to a large seafood and animal market and spreads from animal to human. It causes an outbreak of respiratory illness and symptoms that would be seen at least 15 days later. It is still spreading widely, and also new variants or strains are discovered occasionally. It can easily spread person to person. At this time, it's unclear how easily or sustainably this virus is spreading between people as various researches are periodically observing different behaviors of the virus. Previous investigations have shown that the SARS‐CoV virus is passed on from musk cats to humans, and the MERS‐CoV virus is relayed from dromedary camels to humans. COVID‐19 virus is being presumed to be transmitted from bats to humans.

Rapid respiratory transmission of the disease is one of the major reasons for the spread of this pandemic. Signs of infection include respiratory symptoms, fever, cough, and dyspnea. In more serious cases, the infection can cause pneumonia, severe acute respiratory syndrome, septic shock, multi‐organ failure, and death. It has been also determined that men are more infected than women and it is severely affecting the adult age group with known chronic diseases. People have been advised to take preventive measures like often cleaning hands using hand soap or sanitizers, maintaining a safe distance from anyone sick, to cover up nose and mouth while coughing or sneezing and to avoid unneeded visits to medical facilities.

Several studies ² , ³ , ⁴ , ⁵ attempted to investigate how the gender and sex could impacts the diseases distribution and death ratio over different countries. Results indicated that male have higher rate to die than female. In addition, many studies ⁶ , ⁷ , ⁸ aimed to investigate the growth rate of COVID‐19 pandemic by taking different geographical levels like states, cities, countries, and continents which would help to predict the recoveries and deaths. Moreover, The social distancing is an essential critical aspect of COVID‐19 pandemic which affect the global growth rate. Consequently, other different research studies ⁹ , ¹⁰ , ¹¹ investigated the impact of social distancing on the growth rate of COVID‐19 death ratio. These studies investigated the relationship between social distancing and COVID‐19 deaths rate to capture the influence of the social distancing on the spread of COVID‐19 in the highly top ten infected countries as USA, Spain, Italy, UK, France, Germany, Russia, Turkey, Iran, and China. The main challenges in the aforementioned studies are to explore all the critical aspects of the COVID‐19 pandemic in order to understand and predict the growth rate and the deaths ratios. Another study ¹² introduced multiple real‐time measurement of the uncertain epidemiological appearances of COVID‐19 infections.

In addition, other studies ¹³ , ¹⁴ , ¹⁵ , ¹⁶ used different machine learning and deep learning techniques to propose various prediction models. The study ¹³ introduced a neutrosophic soft set decision making for stock trending analysis. Also, the study ¹⁴ constructed a novel geo‐demographic prediction model. A review analysis of intrusion detection was stated by using machine learning techniques as in the work. ¹⁵ Moreover, a deep learning approach was suggested by the study ¹⁶ in purpose to improve image detection of automobile accident.

In this study, different aspects of COVID‐19 have been studied through the analysis of public relevant datasets. The research involves identifying the relative impact of COVID‐19 on a specific gender, the mortality rate in specific age, investigating different safety measures adopted by each country and their impact on the virus growth rate, global impact of social distancing on mobility.

This paper has three hypothesises:

(A)
Hypothesis 1. Investigate if the gender has a relative impact on COVID‐19 reported cases.
(B)
Hypothesis 2. Explore the impact of the global growth rate of COVID‐19 on the death ratio.
(C)
Hypothesis 3. Study the impact of social distancing measures of the growth of COVID‐19 on the death ratio.

The motivation of this study is to propose an analytical machine learning based model to explore three significant aspects of COVID‐19 pandemic as gender, global growth rate, and social distancing. The proposed analytical model includes classic classifiers, distinctive ensemble methods such as bagging, feature based ensemble, voting and stacking. Also, it uses different Python libraries, Rattle, RStudio, Anaconda, and Jupyter Notebook. This study shows superior prediction performance comparing with the related approaches and the classical machine learning approaches.

2. LITERATURE REVIEW

Studying various aspects of COVID‐19 pandemic is very important in order to predict the death ratio and growth rate as well as explore the impact of temporal and geographical factors to perceive the statistics about this critical pandemic. Firstly, several studies tried to investigate how the gender could impacts the diseases distribution and death ratio. The study ² aimed to compare the severity and mortality between male and female patients with COVID‐19 and SARS. The data has been extracted from 43 patients from hospitals, public data set of 37 patients who died of COVID‐19 and 1019 patients who survived in China, and 524 patients with SARS including 139 deaths from Beijing in early 2003. Also, a study ³ developed the global rapid gender analysis on COVID‐19 by using gender analysis toolkit. Likewise, the study ⁴ proposed binomial proportion test to be 50% within males and females for infection rate and mortality to investigate the impact of sex and gender bias in COVID‐19 infection and deaths over different 75 countries in the world including USA. Another study ⁵ is done to reconstruct COVID‐19 infection rates by age and sex from officially reported data for ten European countries (Belgium, Czechia, Denmark, Germany, Italy, Norway, Portugal, Spain, Switzerland, and the United Kingdom). The analysis reveals that the overall gender equality in total infections is achieved by a gender pattern by age in each country. Based on their outcomes, women diagnosed with COVID‐19 substantially outnumber infected men and there are more confirmed cases of COVID‐19 among men than among women. The preliminary study ² compares the severity and mortality between male and female patients with COVID‐19. In their research, they used data set that contains 43 patients who were treated at Wuhan Union Hospital by the medical team of Beijing Tongren Hospital from January 29, 2020 to February 15, 2020. In addition, they used another public data set for the first 37 cases of those who died and 1019 survived patients from COVID‐19. They compared the two groups by using the t‐test, Mann‐Whitney $U$ ‐test, and chi‐square test. In addition, the Kaplan‐Meier survival curves and the log‐rank test were used for testing the survival rates between males and females. Based on the results of the first dataset, fever (95.3%) and cough (65.1%) have been reported as the most common symptoms based on the gender distribution as 22 male and 21 female out of 43 patients. The performed Chi‐square test for trend indicated that men's cases of COVID‐19 tended to be more serious than women's (P = 0.035) according to the clinical classification of severity. Similarly, for the second dataset that has 37 patients, they reported that the fever (86.5%) and cough (67.6%) were found are the common symptoms for COVID‐19. Considering the age factor, the results of first dataset revealed that the older ages were associated with higher severity and mortality in patients with COVID‐19. Age was comparable between men and women in all data sets. However, men's cases tended to be more serious than women's (P = 0.035). Even in the second data set, the number of men who died from COVID‐19 was found to be 2.4 times that of women (70.3% vs. 29.7%, P = 0.016). Hence, it was concluded that even though men and women have the same prevalence, with COVID‐19 men are more at risk for worse outcomes and death regardless their age. Another study ¹⁷ summarized the deaths by sex along with the cases and deaths dis‐aggregated by both age and sex (per 100,000 people) over 69 entries. Each gender has different death cases and it varies among countries. In addition, it was concluded that in all countries, most people dying from COVID‐19 are men. Results indicated that men have higher rate to die than women. Moreover, the study ⁴ aimed to investigate the statistical significance of gender bias in COVID‐19 infection and deaths across 75 selected countries in the world, specifically for USA. The findings of this study show that the differential effect of gender in death counts in the US is statistically significant, with reported $p$ ‐values < 0.05. In the oldest US population (85+ years), females' death rate is higher due to the virus. Monthly deaths in the US were at its peak during March ‐ April 2020 for both males and females. Additionally, Table 1 shows the COVID‐19 fatality rate by age for all cases. In addition, the study ³ clarified how the gender impacts the COVID‐19 patients age over 80 years across different states, the case fatality rate was as high as 21.9%. COVID‐19 infects people of all ages, although the statistics showed greater risks for people over 60 years of age, as well as those with underlying medical conditions. Also, from the gender‐dis‐aggregated data, it was found that men are slightly more at risk with regards to morbidity than women, and at 51%, men made up a slight majority of the infected cases. The COVID‐19 fatality rate of by gender for confirmed cases and all cases.

TABLE 1.

COVID‐19 fatality rate by gender for all cases

Age	Death rate (all cases)
+80	14.8%
70–79	8%
60–69	3.6%
50–59	1.3%
40–49	0.4%
30–39	0.2%
20–29	0.2%
10–19	0.2%
0–9	0%

Feature	Correlation ratio
Age	0.198
Geoposition	0.104
City	0.064
TravelHistoryLocation	0.031
Symptoms	0.027
livesinWuhan	0.016

Feature	Correlation
Target_Growth	0.420
TARGETED_POP_GROUP_NO	0.081
CATEGORY_social distancing	0.079
MEASURE_school closure	0.071
CATEGORY_movement restrictions	0.067
MEASURE_international flight suspension	0.059
MEASURE_limit public gatherings	0.058
MEASURE_border closure	0.048
TARGETED_POP_GROUP_YES	0.046

Classifier	Accuracy
RF	78.6%
KNN	55.9%
LR	76.7%
GB	77.3%
SVC	57%
GNB	75.1%
XGB	79.9%
DT	76.7%

Max_Sample	Ensemble classifier	Accuracy
0.1	XGB	78.84%
0.2	DT_Gini	79.37%
0.3	DT_Entropy	79.60%
0.4	AB	80.27%
0.5	DT_Entropy	80.13%
0.6	DT_Entropy	80.37%
0.7	AB	80.41%
0.8	AB	80.37%
0.9	AB	81.46%
1.0	AB	80.61%

Classifier	ML	Bagged	Ensemble
RF	80.06%	80.49%	80.37%
KNN	56.41%	62.98%	58.74%
GB	78.08%	78.85%	78.66%
LR	77.27%	77.13%	77.56%
SVC	57.88%	57.88%	57.89%
XGB	80.42%	80.56%	81.6%
NB	74.61%	74.94%	76.66%
DT	78.51%	80.61%	79.28%
DT_Gini	79.37%	81.47%	79.71%
AB	80.99%	82.04%	90%

Classifier	ML	Bagged	Ensemble
RF	89%	90%	90%
KNN	57%	66%	61%
GB	88%	87%	84%
LR	84%	84%	88%
SVC	71%	50%	90%
XGB	91%	90%	75%
NB	80%	90%	83%
DT_Entropy	80%	90%	81%
DT_Gini	89%	90%	81%
AB	89%	90%	90%

	Precision		Recall
Classifier	ML	Ensemble	ML	Ensemble
RF	79.86%	80.55%	79.75%	80.37%
KNN	57.47%	59.63%	55.60%	58.74%
GB	80.96%	81.54%	78.13%	78.66%
LR	80.43%	83.14%	77.08%	77.56%
XGB	81.74%	83.44%	80.47%	81.61%
NB	79.94%	82.91%	74.61%	76.66%
DT_Entropy	78.92%	79.22%	78.94%	79.28%
DT_Gini	78.42%	79.65%	78.47%	79.70%

Criteria	Previous approaches	Proposed approach
Models	Reference 2‐statistical analysis	ML classifiers, bagged and feature related ensemble
Training Data	Initial 1056 patients in Wuhan, day to day global dis‐aggregated gender‐based data	17,777 patient's data(global)
Features base	Symptoms, chronic disease, and age	Age, symptoms, chronic disease, travel history, and Geo position
Evaluation base	Growth rate in past 4 days	Growth rate every day
Target	Death/Survival rate in specific gender and different age groups	COVID‐19 affected rate in specific gender
Result	More cases in men than women and maximum death rate in (age > 80)	More cases in men than women with underlying parameters

Classifier	Accuracy	AUC
RF	72.96%	91%
RF1	72.29%	92%
KNN	78.64%	93%
KNN1	76.12%	92%
GB	78.19%	93%
GB1	88%	97%
LR	61.18%	85%
SVC	71.35%	87%
XGB	87.95%	86%
XGB1	79%	86%
DT_Entropy	84.40%	88%
DT_Gini	84.29%	88%

Classifier	Accuracy
Voting_Ensemble 1 (GB, DT, RF)	84.23%
Voting_Ensemble 2 (GB1,DT_Gini, RF1)	62.98%
Voting_Ensemble 3 (XB, GB, RF)	83.34%
Voting_Ensemble 4 (XB1,GB1, RF1)	84.92%
Voting_Ensemble 5 (DT, RF, KNN)	82.95%
Voting_Ensemble 6 (DT_Gini, RF1, KNN1)	83.29%
Voting_Ensemble 7 (XB, GB1, RF1)	87.17%
Voting_Ensemble 8 (XB, GB, DT)	84.84%
Voting_Ensemble 9 (XB1, GB1, DT_Gini)	88.08%
Voting_Ensemble 10 (XB, GB1, hard)	82.03%
Voting_Ensemble 11 (XB, GB1, soft)	87.83%

Source	Model	Method
Auquan Data Science	Auquan	SEIR
GIT	GT_CHHS	Agent‐based
Iowa State University	ISU	spatiotemporal
North Eastern University	MOBS	SLIR

Model	Accuracy	AUC
Stacked $_{E n s e m b l e 1}$ (GB, DT, RF)	84.12%	95%
Stacked $_{E n s e m b l e 2}$ (XB1, GB1, RF)	83.12%	95%
Stacked $_{E n s e m b l e 3}$ (XB, GB1, RF)	83.73%	95%
Stacked $_{E n s e m b l e 4}$ (XB, GB1, DT)	83.43%	95%
Stacked $_{E n s e m b l e 5}$ (DT, RF, KNN)	84.76%	95%
Stacked $_{E n s e m b l e 6}$ (RF, XGB, GB, DT, KNN)	84.09%	91%
Stacked $_{E n s e m b l e 7}$ (RF1, XGB1, GB1, DT1,KNN)	84.06%	91%
Stacked $_{E n s e m b l e 8}$ (GB1, DT[LR])	83.65%	95%
Stacked $_{E n s e m b l e 9}$ (GB1, DT[XGB])	83.70%	91%
Stacked $_{E n s e m b l e 10}$ (XGB, GB1[LR])	88.05%	91%
Stacked $_{E n s e m b l e 11}$ (XGB, GB1[XGB])	87.97%	91%

Classifier	Accuracy
Adaboost	82.04%
XGBoost	81.6%
Decision Tree	81.47%

Classifier	Accuracy
Voting_Ensemble10	88.09%
Gradient Boosting	88%
XGBoost	87.95%

Criteria	Previous approaches	Proposed approach
Models	Report ^a profit model forecasts	ML classifiers and voting ensemble method
Data Trained	Till 5th April and made prediction for next month	Used data till 31 May and made predictions on the past data
Evaluation Base	Growth rate in past 4 days	Growth rate every day
Focused Area	Global, Country, State	Country
Target	Mortality Rate/Growth Rate‐mobility, social distancing	Growth rate
Prediction Accuracy	Based on predictions growth rate increased constantly as analyzed	88.09%

Classifier	Accuracy	AUC
RF	83.65%	96%
KNN	81.26%	94%
LR	65.75%	95%
GB	82.15%	85%
SVC	64.75%	87%
XGB	87.83%	91%
DT_Entropy	83.56%	88%
DT_Gini	83.76%	88%

Criteria	Previous approaches	Proposed approach
Model	Reference 11 SIRNET, SEIR, SLIR	MLclassifiers, stacking ensemble
Data Trained	Current data to train and 4 weeks future prediction	Used data till 31 May and made predictions on the past data
Analyzed data	Global, Country, State	Country
Focused Area	Growth rate with social distancing, Death rate/Growth mobility, social distancing	Impact of social distancing on growth rate
Accuracy	75%	88.6%

Classifier	Accuracy
Stacked_Ensemble10	88.06%
Stacked_Ensemble11	87.97%
XGBoost	87.84%

PERMALINK

Data‐driven analysis and predictive modeling on COVID‐19

Sonam Sharma

Izzat Alsmadi

Rami S Alkhawaldeh

Bilal Al‐Ahmad

Summary

1. INTRODUCTION

2. LITERATURE REVIEW

TABLE 1.

3. COVID‐19 KNOWLEDGE PREDICTION MODELS: DESIGN AND APPROACHES

FIGURE 1.

3.1. Data preprocessing

3.2. Features selection and reduction

3.3. Machine learning classification

3.3.1. Ensemble learning methods

4. EXPERIMENTAL SETTINGS AND RESULTS

4.1. Datasets description

4.2. Experimental results and discussion

TABLE 2.

FIGURE 2.

FIGURE 3.

FIGURE 4.

FIGURE 5.

TABLE 3.

TABLE 4.

TABLE 5.

TABLE 6.

TABLE 7.

TABLE 8.

TABLE 9.

FIGURE 6.

FIGURE 7.

TABLE 10.

TABLE 11.

TABLE 12.

TABLE 13.

FIGURE 8.

FIGURE 9.

TABLE 14.

TABLE 15.

TABLE 16.

4.3. Discussion

TABLE 17.

TABLE 18.

TABLE 19.

5. CONCLUSION

FUNDING INFORMATION

CONFLICT OF INTEREST

ENDNOTES

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases