Summary
The coronavirus (COVID‐19) started in China in 2019, has spread rapidly in every single country and has spread in millions of cases worldwide. This paper presents a proposed approach that involves identifying the relative impact of COVID‐19 on a specific gender, the mortality rate in specific age, investigating different safety measures adopted by each country and their impact on the virus growth rate. Our study proposes data‐driven analysis and prediction modeling by investigating three aspects of the pandemic (gender of patients, global growth rate, and social distancing). Several machine learning and ensemble models have been used and compared to obtain the best accuracy. Experiments have been demonstrated on three large public datasets. The motivation of this study is to propose an analytical machine learning based model to explore three significant aspects of COVID‐19 pandemic as gender, global growth rate, and social distancing. The proposed analytical model includes classic classifiers, distinctive ensemble methods such as bagging, feature based ensemble, voting and stacking. The results show a superior prediction performance comparing with the related approaches.
Keywords: COVID‐19, gender of patients, global growth rate, predictive modeling, social distancing
1. INTRODUCTION
Coronaviruses 1 are a large family of viruses that are known to cause illness ranging from the common cold to more severe diseases such as middle east respiratory syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). A novel coronavirus (CoV) includes new species of coronavirus that was discovered in 2019 and has not been previously identified in humans. The 2019 novel coronavirus (COVID‐19) pandemic appeared in Wuhan, China in December 2019 and has become a serious public health problem worldwide. This virus is linked to a large seafood and animal market and spreads from animal to human. It causes an outbreak of respiratory illness and symptoms that would be seen at least 15 days later. It is still spreading widely, and also new variants or strains are discovered occasionally. It can easily spread person to person. At this time, it's unclear how easily or sustainably this virus is spreading between people as various researches are periodically observing different behaviors of the virus. Previous investigations have shown that the SARS‐CoV virus is passed on from musk cats to humans, and the MERS‐CoV virus is relayed from dromedary camels to humans. COVID‐19 virus is being presumed to be transmitted from bats to humans.
Rapid respiratory transmission of the disease is one of the major reasons for the spread of this pandemic. Signs of infection include respiratory symptoms, fever, cough, and dyspnea. In more serious cases, the infection can cause pneumonia, severe acute respiratory syndrome, septic shock, multi‐organ failure, and death. It has been also determined that men are more infected than women and it is severely affecting the adult age group with known chronic diseases. People have been advised to take preventive measures like often cleaning hands using hand soap or sanitizers, maintaining a safe distance from anyone sick, to cover up nose and mouth while coughing or sneezing and to avoid unneeded visits to medical facilities.
Several studies 2 , 3 , 4 , 5 attempted to investigate how the gender and sex could impacts the diseases distribution and death ratio over different countries. Results indicated that male have higher rate to die than female. In addition, many studies 6 , 7 , 8 aimed to investigate the growth rate of COVID‐19 pandemic by taking different geographical levels like states, cities, countries, and continents which would help to predict the recoveries and deaths. Moreover, The social distancing is an essential critical aspect of COVID‐19 pandemic which affect the global growth rate. Consequently, other different research studies 9 , 10 , 11 investigated the impact of social distancing on the growth rate of COVID‐19 death ratio. These studies investigated the relationship between social distancing and COVID‐19 deaths rate to capture the influence of the social distancing on the spread of COVID‐19 in the highly top ten infected countries as USA, Spain, Italy, UK, France, Germany, Russia, Turkey, Iran, and China. The main challenges in the aforementioned studies are to explore all the critical aspects of the COVID‐19 pandemic in order to understand and predict the growth rate and the deaths ratios. Another study 12 introduced multiple real‐time measurement of the uncertain epidemiological appearances of COVID‐19 infections.
In addition, other studies 13 , 14 , 15 , 16 used different machine learning and deep learning techniques to propose various prediction models. The study 13 introduced a neutrosophic soft set decision making for stock trending analysis. Also, the study 14 constructed a novel geo‐demographic prediction model. A review analysis of intrusion detection was stated by using machine learning techniques as in the work. 15 Moreover, a deep learning approach was suggested by the study 16 in purpose to improve image detection of automobile accident.
In this study, different aspects of COVID‐19 have been studied through the analysis of public relevant datasets. The research involves identifying the relative impact of COVID‐19 on a specific gender, the mortality rate in specific age, investigating different safety measures adopted by each country and their impact on the virus growth rate, global impact of social distancing on mobility.
This paper has three hypothesises:
-
(A)
Hypothesis 1. Investigate if the gender has a relative impact on COVID‐19 reported cases.
-
(B)
Hypothesis 2. Explore the impact of the global growth rate of COVID‐19 on the death ratio.
-
(C)
Hypothesis 3. Study the impact of social distancing measures of the growth of COVID‐19 on the death ratio.
The motivation of this study is to propose an analytical machine learning based model to explore three significant aspects of COVID‐19 pandemic as gender, global growth rate, and social distancing. The proposed analytical model includes classic classifiers, distinctive ensemble methods such as bagging, feature based ensemble, voting and stacking. Also, it uses different Python libraries, Rattle, RStudio, Anaconda, and Jupyter Notebook. This study shows superior prediction performance comparing with the related approaches and the classical machine learning approaches.
2. LITERATURE REVIEW
Studying various aspects of COVID‐19 pandemic is very important in order to predict the death ratio and growth rate as well as explore the impact of temporal and geographical factors to perceive the statistics about this critical pandemic. Firstly, several studies tried to investigate how the gender could impacts the diseases distribution and death ratio. The study 2 aimed to compare the severity and mortality between male and female patients with COVID‐19 and SARS. The data has been extracted from 43 patients from hospitals, public data set of 37 patients who died of COVID‐19 and 1019 patients who survived in China, and 524 patients with SARS including 139 deaths from Beijing in early 2003. Also, a study 3 developed the global rapid gender analysis on COVID‐19 by using gender analysis toolkit. Likewise, the study 4 proposed binomial proportion test to be 50% within males and females for infection rate and mortality to investigate the impact of sex and gender bias in COVID‐19 infection and deaths over different 75 countries in the world including USA. Another study 5 is done to reconstruct COVID‐19 infection rates by age and sex from officially reported data for ten European countries (Belgium, Czechia, Denmark, Germany, Italy, Norway, Portugal, Spain, Switzerland, and the United Kingdom). The analysis reveals that the overall gender equality in total infections is achieved by a gender pattern by age in each country. Based on their outcomes, women diagnosed with COVID‐19 substantially outnumber infected men and there are more confirmed cases of COVID‐19 among men than among women. The preliminary study 2 compares the severity and mortality between male and female patients with COVID‐19. In their research, they used data set that contains 43 patients who were treated at Wuhan Union Hospital by the medical team of Beijing Tongren Hospital from January 29, 2020 to February 15, 2020. In addition, they used another public data set for the first 37 cases of those who died and 1019 survived patients from COVID‐19. They compared the two groups by using the t‐test, Mann‐Whitney ‐test, and chi‐square test. In addition, the Kaplan‐Meier survival curves and the log‐rank test were used for testing the survival rates between males and females. Based on the results of the first dataset, fever (95.3%) and cough (65.1%) have been reported as the most common symptoms based on the gender distribution as 22 male and 21 female out of 43 patients. The performed Chi‐square test for trend indicated that men's cases of COVID‐19 tended to be more serious than women's (P = 0.035) according to the clinical classification of severity. Similarly, for the second dataset that has 37 patients, they reported that the fever (86.5%) and cough (67.6%) were found are the common symptoms for COVID‐19. Considering the age factor, the results of first dataset revealed that the older ages were associated with higher severity and mortality in patients with COVID‐19. Age was comparable between men and women in all data sets. However, men's cases tended to be more serious than women's (P = 0.035). Even in the second data set, the number of men who died from COVID‐19 was found to be 2.4 times that of women (70.3% vs. 29.7%, P = 0.016). Hence, it was concluded that even though men and women have the same prevalence, with COVID‐19 men are more at risk for worse outcomes and death regardless their age. Another study 17 summarized the deaths by sex along with the cases and deaths dis‐aggregated by both age and sex (per 100,000 people) over 69 entries. Each gender has different death cases and it varies among countries. In addition, it was concluded that in all countries, most people dying from COVID‐19 are men. Results indicated that men have higher rate to die than women. Moreover, the study 4 aimed to investigate the statistical significance of gender bias in COVID‐19 infection and deaths across 75 selected countries in the world, specifically for USA. The findings of this study show that the differential effect of gender in death counts in the US is statistically significant, with reported ‐values < 0.05. In the oldest US population (85+ years), females' death rate is higher due to the virus. Monthly deaths in the US were at its peak during March ‐ April 2020 for both males and females. Additionally, Table 1 shows the COVID‐19 fatality rate by age for all cases. In addition, the study 3 clarified how the gender impacts the COVID‐19 patients age over 80 years across different states, the case fatality rate was as high as 21.9%. COVID‐19 infects people of all ages, although the statistics showed greater risks for people over 60 years of age, as well as those with underlying medical conditions. Also, from the gender‐dis‐aggregated data, it was found that men are slightly more at risk with regards to morbidity than women, and at 51%, men made up a slight majority of the infected cases. The COVID‐19 fatality rate of by gender for confirmed cases and all cases.
TABLE 1.
COVID‐19 fatality rate by gender for all cases
| Age | Death rate (all cases) |
|---|---|
| +80 | 14.8% |
| 70–79 | 8% |
| 60–69 | 3.6% |
| 50–59 | 1.3% |
| 40–49 | 0.4% |
| 30–39 | 0.2% |
| 20–29 | 0.2% |
| 10–19 | 0.2% |
| 0–9 | 0% |
Secondly, many studies attempted to investigate the growth rate of COVID‐19 pandemic by taking different geographical levels. The research work 6 clarify how the growth rates of COVID‐19 vary differently among various geographical areas like states, cities, countries, and continents. Additionally, the study 7 investigated the weekly ratio of increase in the number of COVID‐19 infections like weather and time variables based on experimental research on spread of SARS virus and COVID‐19 disease, the study shows how human population structure affects the inorganic environment on the growth rate of infections. Also, the study 8 introduced a simple forecasting iteration approach that helps to predict the recoveries and deaths. The forecasting shows the daily growth rates of COVID‐19 and it shows should the acceptable growth rate (like <5%). By considering different aspects of COVID‐19 pandemic, the study 6 used statistical techniques to analyze various shapes of local growth rate of COVID‐19 and clusters into different categories based on their shapes values. The study applied this methodology to the analysis of the daily occurrence of the COVID‐pandemic at two geographical scales (state‐level in USA) and country‐level inside Europe from February to May in 2020.
Thirdly, different studies investigated the impact of social distancing on the growth rate of COVID‐19 death ratio. Social distancing is considered as one of the essential factors that critically affect the increase ratio of COVID‐19 virus as stated by the study. 9 Moreover, the study 10 investigated the relationship between social distancing and COVID‐19 deaths rate to capture the influence of the social distancing on the spread of COVID‐19 in the highly top ten infected countries (i.e., USA, Spain, Italy, UK, France, Germany, Russia, Turkey, Iran, and China). Based on their findings, the daily growth rates in UK and USA were higher than other countries as they have highest social distancing. Moreover, the research study 11 explored the impact of social distancing and mobility with COVID 19. The study proposed a new hybrid machine learning model called (SIRNET) which aims to predict the spread of the COVID‐19 pandemic. SIRNET is an integrated approach among disease modeling, physical science, and machine learning. The research work employed mobility data metrics by using four key population modeling criteria: temporal coverage and geographical coverage, contemptuousness, and representativeness.
3. COVID‐19 KNOWLEDGE PREDICTION MODELS: DESIGN AND APPROACHES
Most of the previous studies used statistical approaches whereas this study introduces an analytical approach to explore certain significant aspects of COVID‐19 by applying different classical and ensemble machine learning models. This paper attempts to evaluate three hypotheses. The proposed approach has six phases (that are aligned with typical data analytic projects phases). This is illustrated in Figure 1.
First, the datasets have been collected.
Second, the collected datasets have been preprocessed using different techniques.
Third, several features reduction are used to obtain the most relevant features.
Fourth, nine different ML classifiers and some ensemble classifiers have been employed to build prediction models.
Fifth, to evaluate the prediction performance, various evaluation measures are used such as AUC, 18 accuracy, 19 precision and recall. 20
FIGURE 1.

Data analytics phases
3.1. Data preprocessing
Data preprocessing is one of an essential task to prepare the data for classification step. This step includes preparing the dataset to improve quality of the data and gain useful information to support the training models. The pre‐processing steps include cleaning up the dataset to make it more readable by cleaning the dataset wherever there was “nan”, “inf” or other not required values to remove such irrelevant values, scaling features, using label encoder to encode the categorical columns, splitting features, combining csv files, adding multi‐label target column, and Binarizing the dataset based on certain important features.
3.2. Features selection and reduction
Features selection 21 is an important task that is used to obtain the most relevant features. In this step, the data is preprocessed by formatting the respective dataset/file and then identify the attributes of the dataset that can be useful in analyzing the target column. Feature selection and reduction are used to reduce the number of features in each dataset. However, they do differ. Feature selection involves selecting and excluding given features without changing original features whereas feature reduction change features into a lower dimension. The set of features made by feature selection must be a subset of the original set of features which does apply to feature reduction. This study uses some tasks of feature selection as: removing features with missing values, replacing missing values with Boolean “false” or binary “0”: For Boolean columns converted “Missing‐Not‐Available” to “False” and for binary made it “0”, and removing highly correlated columns. Then, the dataset is eliminated and formatted as well as identifying the attributes of the dataset that would be the most relevant features for analyzing the target column (label).
3.3. Machine learning classification
This paper uses nine classical machine learning models as well as three popular ensemble learning methods (bagging, voting, and stacking).
A classifier 22 in machine learning is an algorithm that automatically orders or categorizes data into one or more of a set of classes. This paper demonstrated several different machine learning classifiers in order to obtain the best prediction performance, namely, random forest (RF), K‐nearest neighbors (KNN), logistic regression (LR), gradient boosting (GB), support vector classifier (SVC), Gaussian naive bayes (GNB), XGBoost (XGB), decision tree (DT), and Ada boosting (AB).
Random forest: This model is made up of many decision trees which uses two major concepts: 23 (i) random sampling of training data points when building the trees and, (ii) random subsets of features considered when splitting the nodes. “Gini” criterion is used to build the model along with other listed features below.
K‐nearest neighbors: The K‐nearest neighbors' algorithm (KNN) 24 is a non‐parametric method used for classification and regression. In both cases, the input consists of the K‐closest training examples in the feature space. The output depends on whether K‐NN is used for classification or regression. Clearly, it uses “auto” algorithm with “minkowski” distance measurement.
Logistic regression: This is a statistical method 25 for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a variable in which there are only two possible outcomes.
Gradient boosting: Gradient boosting 26 is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
Support vector classifier: SVC 27 is a nonparametric clustering algorithm that does not make any assumption on the number or shape of the clusters in the data. It works best for low‐dimensional data, so if data is high‐dimensional, a preprocessing step, for example, using principal component analysis, is usually required.
Gaussian Naive Bayes model: A Gaussian Naive Bayes 28 algorithm is a special type of NB 29 algorithm. It's specifically used when the features have continuous values. It's also assumed that all the features are following a Gaussian distribution that is, normal distribution.
XGBoost model: XGBoost 30 is a decision‐tree‐based ensemble machine learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (e.g., images, text, etc.) artificial neural networks, it tends to outperform all other algorithms or frameworks. It is an implementation of gradient boosted decision trees designed for speed and performance. The goal of this library is to push the extreme of the computation limits of machines to provide a scalable, portable and accurate data.
Decision tree: This is 31 one of the predictive models where the target variable can take a discrete set of values called classification trees. The leaves represent class labels and branches represent conjunctions of features that lead to those class labels.
Ada Boosting model: The main idea of the classifier ensemble is to combine the outputs obtained from a number of weak learners. One such famous classifier is Ada‐boost Classifier. 32 , 33 For every dimensionally reduced value, the optimal threshold classification function is determined by the weak learner such that the minimum numbers of samples are mis‐classified.
3.3.1. Ensemble learning methods
Ensemble methods are machine learning techniques that combine several base models to produce one optimal predictive model. This research work evaluates the following ensemble models:
Stacking classifier: Stacking classifier is used to stack different classifiers in one with different settings or classifier properties. The ensemble learning technique combines multiple classification models via a meta‐classifier. The meta‐classifier can either be trained on the predicted class labels or probabilities from the ensemble method. 34
Voting ensemble method: 35 This is a machine learning technique that combines several base models to produce one optimal predictive model. This paper used feature ensembling approaches like bagging to stack different classifiers in one with different setting or classifier properties.
Bagging: Bagging, 36 also known as bootstrap aggregating, is the aggregation of multiple versions of a predicted model. Each model is trained individually and results are combined using an averaging process. The primary focus of bagging is to achieve less variance than any model has individually.
This study uses three ensemble classifiers 37 on the used datasets to find their five best feature importance and then combined those selected features with models creating new set of features to train the models. Then, this study employs classifiers on that new feature set and studied the accuracy of different classifiers with existing features as compared to same classifiers with new feature set. This process includes the following tasks:
Evaluating accuracy and total runtime for each of the four classifiers.
Finding the five most important features for each model and then collecting all features and combined them in a new dataframe.
Selecting five best features supporting each of the four base models and combined them to make a feature data set holding five best features of each of our base model, and
Loading new dataset for analysis by creating new training and test data.
4. EXPERIMENTAL SETTINGS AND RESULTS
This research work aims to study the impact of gender on other factors of COVID‐19 pandemic, explores the impact of the global growth rate of COVID‐19 on the death ratio, and examines the impact of social distancing measures of the death ratio as a result of COVID‐19. This work utilized three public datasets to validate the three declared hypotheses of the study.
4.1. Datasets description
Three well‐known datasets have been demonstrated to validate the three hypotheses of this analytical study.
Patient medical data for novel coronavirus COVID‐19: The Wolfram patient dataset; Patient Medical Data for Novel Coronavirus COVID‐19 * . The dataset includes patient reports shared publicly and has 17,734 records with columns like administrative division, country, geographical position, gender, age, travel history, symptoms, any relation to Wuhan, chronic disease, death quotient, and discharging information. The sex/gender feature is the target label. This dataset is used to demonstrate the proposed approach for Hypothesis 1.
Johns Hopkins University CSSE Team COVID‐19: Johns Hopkins University has made an excellent dashboard using the affected cases data, namely, Johns Hopkins University CSSE Team COVID‐19 Dataset † . This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel corona virus. The dataset is a time series data that holds daily information on the affected cases along with recovery and deaths with respect to different provinces of each country in different csv files. It is a time series data from Jan 22, 2020. The affected numbers are still growing. The detailed descriptions are: (1) Sno–Serial number, (2) Observation Date–Date of the observation in MM/DD/YYYY, (3) Province/State–Province or state of the observation (Could be empty when missing), (4) Country/Region–Country of observation, (5) Last Update–Time in UTC at which the row is updated for the given province or country, (6) Confirmed–Cumulative number of confirmed cases till that date, (7). Deaths–Cumulative number of deaths till that date, and (8) Recovered–Cumulative number of recovered cases till that date. The label is Growth_Rate' which calculates the increase in confirmed cases each day in either global rate or U.S rate. This dataset is used in Hypothesis 2.
ACAPS COVID‐19 Government Measures: This dataset is created by combining the global time series data from 'Johns Hopkins University CSSE Team COVID‐19 Dataset' with the 'ACAPS COVID‐19 Government Measures Dataset'. ACAPS COVID‐19 Government Measures Dataset ‡ . This dataset contains the list of different measures taken by each country with correspondence to dates and other needed information. The dataset has 4275 rows with major columns like category, definitive measures, compliance parameters along with the measure implementation date in each country. The COVID‐19 Government Measures Dataset puts together all the measures implemented by governments worldwide in response to the Coronavirus pandemic. Data collection includes secondary data review. The researched information available falls into five categories: Social distancing, Movement restrictions, Public health measures, Social and economic measures, and Lockdowns. Each category is broken down into several types of measures. ACAPS consulted governments, media, United Nations, and other organizations as sources. Similarly, as in the second goal of this study, the Growth_Rate' is used as the label. This dataset is used in Hypothesis 3.
4.2. Experimental results and discussion
This section clarifies the three hypotheses using datasets and evaluates machine learning as validation techniques. These hypotheses represent the most crucial factors to the impact of COVID19 disease. The physique aspect of human body determines how it has an effect toward specific disease. As a consequence, it is crucial to investigate gender bias when infected with COVID19, as this reveals the cause for hypothesis 1. Hypothesis 2 focuses on COVID‐19's global growth rate and how that pace affects the number of deaths. Without a doubt, as the disease spreads, the number of diseases among people rises. This promotes in‐depth examinations of such concerns in a given region, which is why this study focuses on main insights related to the concept. Furthermore, one of the key reasons for the disease spreading quickly is the distance between infected persons. As a result, hypothesis 3 focuses on determining how the tight bond between people affects the disease's character.
To ensure fair comparison, we validate the models using default parameters setting to handle the bias of models on datasets. Since, the ML techniques behave as they are tuned using their own parameters that ensure how solid is it to predict the future records. The three hypothesis validate using the same ML models, their default parameters, and the datasets.
Hypothesis 1: Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, dropping the features which have correlation greater than 80%; w.r.t target column. Based on our findings, the top 6 correlated features with the sex target feature are ordered as: Age, Geo‐position, City, Travel‐History‐Location, Symptoms, and lives‐in‐Wuhan as shown in Table 2. Other steps include dropping columns with less variance. The Univariate feature selection used 'SelectKBest' univariate method to remove all but the specified number of highest scoring features. For the employed dataset, the highest accuracy with 14 features. The plotted death quotient (DeathQ) percentage in Male and Female showed that 19.23% of affected female died while in male it is 43.34%. Figures 2 and 3 show the percentage for confirmed cases and death cases for each gender respectively.
TABLE 2.
Top 6 correlated features for target gender feature
| Feature | Correlation ratio |
|---|---|
| Age | 0.198 |
| Geoposition | 0.104 |
| City | 0.064 |
| TravelHistoryLocation | 0.031 |
| Symptoms | 0.027 |
| livesinWuhan | 0.016 |
FIGURE 2.

Percentage of confirmed cases in each gender
FIGURE 3.

Percentage of death cases for each gender
This study employed principal component analysis as features reduction technique. The used dataset does not have high dimension issue. So, dimensional reduction based on PCA is not crucial but performed the steps to learn the implementation.
For second hypothesis, the data have been pre‐processed by formatting the respective dataset/file and then identify the attributes of the dataset that can be useful in analyzing the target column. This work used several preprocessing/feature selection techniques such as: (1) dropping the features data with null values, (2) filling null columns with 0 (if any), after populating “Growth_Rate” the data frame resulted in some “inf” values. So, ensure the data cleanup for any column being null then fill that with 0, (3) Removing highly correlated columns, correlation is a statistical term which in common usage refers to how close two variables are to having a linear relationship with each other such as dropping the features which have correlation greater than 70%; w.r.t the target column. The top 6 correlated features for the target feature 'Growth Rate' are observation_date, country_code, id, confirmed, deaths, and recovered. As per the correlation analysis, dropped column named deaths, recovered, and confirmed with correlation higher than the 70%, (4) Dropping columns with the numeric values which have less variance, and (5) for trend plots between features, the graphical analysis of top 5 affected countries based on the confirmed, deaths, active cases, and its respective growth rate as shown in Figures 4 and 5. Below is the trend plot for USA.
FIGURE 4.

Active and new COVID‐19 cases count
FIGURE 5.

Recovered and death counts for COVID‐19
As per the correlation analysis, dropped column named deaths, recovered, and confirmed with correlation higher than the 80%. The proposed approach drops the highly correlated columns features which have correlation greater than 85% w.r.t target column. Table 3 shows the top 8 correlated features with “Growth_Rate”.
TABLE 3.
Top 8 correlated features for “Growth_Rate” with respect to social distancing
| Feature | Correlation |
|---|---|
| Target_Growth | 0.420 |
| TARGETED_POP_GROUP_NO | 0.081 |
| CATEGORY_social distancing | 0.079 |
| MEASURE_school closure | 0.071 |
| CATEGORY_movement restrictions | 0.067 |
| MEASURE_international flight suspension | 0.059 |
| MEASURE_limit public gatherings | 0.058 |
| MEASURE_border closure | 0.048 |
| TARGETED_POP_GROUP_YES | 0.046 |
According to the obtained results, the XGB model outperforms the other ML models in terms of accuracy as given in Table 4. The accuracy value reflects how model predicts correctly the actual values from the input features. This parameter validate the robustness of ML models in predicting the future records.
TABLE 4.
Accuracy achieved for each ML classifiers
| Classifier | Accuracy |
|---|---|
| RF | 78.6% |
| KNN | 55.9% |
| LR | 76.7% |
| GB | 77.3% |
| SVC | 57% |
| GNB | 75.1% |
| XGB | 79.9% |
| DT | 76.7% |
This approach uses bagging for each classifier with different max_sample. And, then select the bagged classifier with highest accuracy from the list for that respective classifier. Table 5 shows the data with highest accuracy at each sample ratio for each bagged classifier. Among all classifiers, DT_Entropy achieved the highest accuracy as a bagged classifier, with a sample ratio 0.9 as it appears in Table 5.
TABLE 5.
Best accuracy achieved for ensemble learning models per each sample
| Max_Sample | Ensemble classifier | Accuracy |
|---|---|---|
| 0.1 | XGB | 78.84% |
| 0.2 | DT_Gini | 79.37% |
| 0.3 | DT_Entropy | 79.60% |
| 0.4 | AB | 80.27% |
| 0.5 | DT_Entropy | 80.13% |
| 0.6 | DT_Entropy | 80.37% |
| 0.7 | AB | 80.41% |
| 0.8 | AB | 80.37% |
| 0.9 | AB | 81.46% |
| 1.0 | AB | 80.61% |
To evaluate each model, different measures can be used. This study focused on accuracy and ROC curve on the test data to evaluate the proposed models. The true positive rate (TPR) and false positive rate (FPR) are graphed at various threshold levels on the ROC curve. This represents the model's ability to predict features based on varied perspective threshold values.
The influence of disease on the distinct population in specific areas is maintained by predicting the gender of patients based on certain features. This prevents the disease from spreading by regulating the gender in a given area so that reliable services can be delivered. As a result, it is crucial to study the performance measure of ML techniques for such purposes. In Tables 6 and 7, the AB model obtains the highest accuracy and ROC AUC metrics when compared to other models using ML, Bagged, and Ensemble techniques. The reason behind such results is that the nature of the AB model depends on building the classifier based on aggregating weak models into a global model in a sequence of related trees. The residual results of the prior weak model are improved by the successor weak model. This ensures that separating the features in different trees and aggregating the results handles the weakness of correlations related to the features and the target label. The XGBoost model also has reasonable results compared to the AB model with an approximately a 1% error rate. The XGBoost model is one of the tree‐based ML techniques that ensure the preceding assumption. To ensure fair comparison amongst models, the study employs the default parameter settings of the ML models. As a result, using the same ML model parameter values, the AB and XGBoost models produce reasonable results, indicating that they are a promising strategy for future application.
TABLE 6.
Accuracy for ML, bagged, and ensemble classifiers
| Classifier | ML | Bagged | Ensemble |
|---|---|---|---|
| RF | 80.06% | 80.49% | 80.37% |
| KNN | 56.41% | 62.98% | 58.74% |
| GB | 78.08% | 78.85% | 78.66% |
| LR | 77.27% | 77.13% | 77.56% |
| SVC | 57.88% | 57.88% | 57.89% |
| XGB | 80.42% | 80.56% | 81.6% |
| NB | 74.61% | 74.94% | 76.66% |
| DT | 78.51% | 80.61% | 79.28% |
| DT_Gini | 79.37% | 81.47% | 79.71% |
| AB | 80.99% | 82.04% | 90% |
TABLE 7.
AUC for ML, bagged, and ensemble classifiers
| Classifier | ML | Bagged | Ensemble |
|---|---|---|---|
| RF | 89% | 90% | 90% |
| KNN | 57% | 66% | 61% |
| GB | 88% | 87% | 84% |
| LR | 84% | 84% | 88% |
| SVC | 71% | 50% | 90% |
| XGB | 91% | 90% | 75% |
| NB | 80% | 90% | 83% |
| DT_Entropy | 80% | 90% | 81% |
| DT_Gini | 89% | 90% | 81% |
| AB | 89% | 90% | 90% |
The precision and recall metrics highlight the error types in ML techniques that are the False Positive (FP) and False Negative (FN) values. As shown in Table 8, the XBG model obtains high precision and recall values on ML and Ensemble techniques with low error rates compared to other models. Hence, boosting techniques provides high results to recognize the gender of COVID‐19 patients that could be maintained as promising models.
TABLE 8.
Precision and recall: regular versus feature related ensemble classifiers
| Precision | Recall | |||
|---|---|---|---|---|
| Classifier | ML | Ensemble | ML | Ensemble |
| RF | 79.86% | 80.55% | 79.75% | 80.37% |
| KNN | 57.47% | 59.63% | 55.60% | 58.74% |
| GB | 80.96% | 81.54% | 78.13% | 78.66% |
| LR | 80.43% | 83.14% | 77.08% | 77.56% |
| XGB | 81.74% | 83.44% | 80.47% | 81.61% |
| NB | 79.94% | 82.91% | 74.61% | 76.66% |
| DT_Entropy | 78.92% | 79.22% | 78.94% | 79.28% |
| DT_Gini | 78.42% | 79.65% | 78.47% | 79.70% |
Comparing our results with the approach. 2 The previous study 2 used different statistical approaches and distinct tests like Chi‐Square, log‐rank, Mann–Whitney ‐test, and Kaplan–Meier survival curves whereas this study used a number of traditional classifiers along with bagged and feature related ensemble methods with different settings. The study used the data till 31st May, 2020 and split the same in 0.85:0.15 to train and make predictions. However, the related studies actually made future prediction for coming four weeks using till date data for training. Table 9 briefs the different aspects of these forecasts with the analytical machine learning ‐based model.
TABLE 9.
Comparing the proposed approach and other related approaches with respect to gender
| Criteria | Previous approaches | Proposed approach |
|---|---|---|
| Models | Reference 2‐statistical analysis | ML classifiers, bagged and feature related ensemble |
| Training Data | Initial 1056 patients in Wuhan, day to day global dis‐aggregated gender‐based data | 17,777 patient's data(global) |
| Features base | Symptoms, chronic disease, and age | Age, symptoms, chronic disease, travel history, and Geo position |
| Evaluation base | Growth rate in past 4 days | Growth rate every day |
| Target | Death/Survival rate in specific gender and different age groups | COVID‐19 affected rate in specific gender |
| Result | More cases in men than women and maximum death rate in (age > 80) | More cases in men than women with underlying parameters |
Hypothesis 2: Research study § was done on Novel Corona Virus 2019 dataset to predict the major Asian/European countries will be in about months' time using profit based forecasting. The study uses the time series data till May 15th, 2020. The report shows the growth in confirmed, death and active cases in major Asian and European countries. The profit model forecasts the rapid growth in some major Asian countries as shown below. This study used the code through Kaggle editor and found different factors being calculated such as growth rate, death rate and recovery rate for respective countries. The code separated the data for each country in different data frames. Then analyzed the growth rate and death rate in respective countries. It also shows some trend plots for the recovered, active and death cases for each country.
The Study concluded that due to extremely high population density and widespread poverty in Asia, growth in confirmed cases is around 40% in Asian countries predicting the number of sick will reach unmanageable levels before end of April.
Another comparative visual analysis ¶ was done on COVID‐19 time series data till 5th April 2020 in order to study learning the mortality rate per 1000 in respective countries. The analysis in this study is like the analysis for growth rate. The research shows the impacted population worldwide, then narrow it to continents and then further to states from countries. This work includes executing the code in Kaggle. It shows some important comparative aspects of COVID‐19. this study also tried to verify these 10 days prediction with the actual growth during this time and found the predictions to be somewhat accurate.
The worldwide mortality rate being analyzed as 6.63 considering the time period till May 18th 2020. Also, COVID‐19 cases the mortality index for in different continents like Europe, Asia, South America and so forth. As per the predicted model for next 10 days, it forecasted that the global cases will reach 42.61 and deaths will reach 425.9 k by May 19th, 2020. Being at the timeline, verified the 10 days prediction with the actual growth during this time made by above analysis and found the predictions to be somewhat accurate.
Different data models have been demonstrated to process the sensed data to eliminate noise and choose the model with best performance metrics. Similarly, by analyzing this second hypothesis of this study, the proposed approach applies the same ML classifiers and ensemble features techniques as in the first hypothesis. Voting ensemble classifiers 38 are used to build predictive models.
This work evaluates a set of ML models and calculate the accuracy of each model and the ROC curve that represents the Area Under the Curve (AUC) for a multi‐label target. Figures 6 and 7 show the ROC values for Voting_Ensemble 1 and Voting_Ensemble 2 models respectively. In regular classifiers perspective, Table 10 shows the accuracy and AUC results that indicate the high performance for the GB and XGB models in predicting the global growth rate of COVID‐19. Table 11 shows the average accuracy of the voting ensemble models on three base classifiers. These classifiers are XGB, DT, and RF using information gain and Gini index as selectors of the features. The voting ensembles models 7 and 9 obtain high accuracy performance of 87.17% and 88.08% respectively. The base classifier that achieves such results is the XGB model as a stack ensemble method that supports the idea of enhancing the weak learner error rates into a strong and sequential model. The results indicate the possibility of predicting the global growth rate using six features. This could prevent or alleviate the rate of death by taking the necessary precautions.
FIGURE 6.

ROC for Voting_Ensemble 1 (GB,DT_Entropy,RF)
FIGURE 7.

ROC for Voting_Ensemble 2 (GB1,DT_Gini,RF1)
TABLE 10.
Accuracy and AUC for regular classifiers
| Classifier | Accuracy | AUC |
|---|---|---|
| RF | 72.96% | 91% |
| RF1 | 72.29% | 92% |
| KNN | 78.64% | 93% |
| KNN1 | 76.12% | 92% |
| GB | 78.19% | 93% |
| GB1 | 88% | 97% |
| LR | 61.18% | 85% |
| SVC | 71.35% | 87% |
| XGB | 87.95% | 86% |
| XGB1 | 79% | 86% |
| DT_Entropy | 84.40% | 88% |
| DT_Gini | 84.29% | 88% |
TABLE 11.
The Accuracy for voting ensemble models
| Classifier | Accuracy |
|---|---|
| Voting_Ensemble 1 (GB, DT, RF) | 84.23% |
| Voting_Ensemble 2 (GB1,DT_Gini, RF1) | 62.98% |
| Voting_Ensemble 3 (XB, GB, RF) | 83.34% |
| Voting_Ensemble 4 (XB1,GB1, RF1) | 84.92% |
| Voting_Ensemble 5 (DT, RF, KNN) | 82.95% |
| Voting_Ensemble 6 (DT_Gini, RF1, KNN1) | 83.29% |
| Voting_Ensemble 7 (XB, GB1, RF1) | 87.17% |
| Voting_Ensemble 8 (XB, GB, DT) | 84.84% |
| Voting_Ensemble 9 (XB1, GB1, DT_Gini) | 88.08% |
| Voting_Ensemble 10 (XB, GB1, hard) | 82.03% |
| Voting_Ensemble 11 (XB, GB1, soft) | 87.83% |
In this step, the model is chosen for our relative data based on the evaluations of accuracy of each model. When compared the proposed approach with the previous/related study, this study finds that previous researches have used distinct models like profit model or their own assembled model while this study has used a number of classic classifiers along with Voting ensemble method with different settings. This study used the data till 31st May 2020. Following table briefs the different aspects of these forecasts with the analytical machine learning ‐based model as shown in Table 12.
TABLE 12.
Comparing the proposed approach and related approaches with respect to global growth rate
| Criteria | Previous approaches | Proposed approach |
|---|---|---|
| Models | Report a profit model forecasts | ML classifiers and voting ensemble method |
| Data Trained | Till 5th April and made prediction for next month | Used data till 31 May and made predictions on the past data |
| Evaluation Base | Growth rate in past 4 days | Growth rate every day |
| Focused Area | Global, Country, State | Country |
| Target | Mortality Rate/Growth Rate‐mobility, social distancing | Growth rate |
| Prediction Accuracy | Based on predictions growth rate increased constantly as analyzed | 88.09% |
Hypothesis 3: The research study 11 shared the impact of social distancing and mobility with COVID‐1919. In this study, authors proposed a new hybrid machine learning model, SIRNET, for forecasting the spread of the COVID‐19 pandemic that couples with the epidemiological models. This study uses categorized spatio‐temporally explicit cellphone mobility data as surrogate markers for physical distancing, along with population weighted density and other local data points. They incorporated mobility data metrics into the model which succeeds on four key population modeling criteria which directly and indirectly impact the ability to model effective reproductive number: temporally and spatially explicit coverage, representative‐ness, and contemporaneous‐ness. They collected the metrics in a manner common to all regions which allows for rapid wide‐scale data interpolation and extrapolation and accurately reflects the unique geopolitical profile of regions, each with variable laws, customs, socioeconomic profiles, health care resources, and susceptibility rates on learning and forecasting the trends in time series via a hybrid model of neural networks and epidemiological models. The forecasting network, referred to as SIRNET, learns from a sequence of prior trends that carry long‐term contextual information (global time‐series) and more recent data inputs that are raw (local time‐series) and inform the forecasting of any abrupt changes. SIRNET is a hybrid between epidemic modeling, physical science, and machine learning.
The high‐level visualization of the SIRNET architecture consists of the RNN which is a linear network with input layer ∈ R6, hidden layer ∈ R4, and output layer ∈ R1, with ReLU activation. Bi is the intractable contact rate. The RNN is a deep LSTM whose internal state is fed as input to the SEIR cell.
They fed the historical time series and local raw data input to different types of recurrent neural networks (RNNs) or a Linear cell. RNNs can learn patterns from arbitrarily long spatio‐temporal data, through cyclic connection of nodes in the network. The SIRNET consists of a recurrent neural network to implement the temporal and population dynamics of an SEIR cell, and its framing as such allows introducing complex functions with learnable parameters, enabling mapping from salient input data to the underlying properties of the epidemiological model.
Their model shows that the epidemic is extremely sensitive to changes in mobility rates so implementing social distancing measures is crucial. In the mobility scenarios tested across countries or counties, mobility >0.7 leads to an un‐contained outbreak, mobility <0.5 results in a local elimination of the virus, and those in between having slower peaks. It also allows us to discover relationships between real‐world trends and the impact on the spread of COVID‐19, and model scenarios such as relaxing social distancing policies.
Another report # is presented to show day to day forecast by CDC with the help of research of different partners using different models on COVID‐19 data shows the cumulative reported COVID‐19 deaths since February and forecasted deaths for the next four weeks in the United States using models with various assumptions about the levels of social distancing and other interventions, which may not reflect recent changes in behavior. Forecasts are based on statistical or mathematical models aim to predict changes in national–and state‐level cumulative reported COVID‐19 deaths for the next four weeks. Forecasting teams predict numbers of deaths using different types of data (e.g., COVID‐19 data, demographic data, mobility data), methods and estimates of the impacts of interventions (e.g. social distancing, use of face coverings).
There are also state‐level forecasts which shows observed and forecasted state‐level cumulative COVID‐19 deaths in the US. Each state forecast uses a different scale, due to differences in the numbers of COVID‐19 deaths occurring in each state. Forecasts fall into one of two categories:
The Auquan, CAN, ERDC, GA_Tech, Geneva, Imperial, ISU, LANL, MIT, MOBS, PSI, SWC, UA, UCLA, UMass‐MB, and UT forecasts assume that existing control measures will remain in place during the prediction period.
The Columbia, COVIDSim, GT_CHHS, JHU, and YYG forecasts make different assumptions about how levels of social distancing will change in the future. CDC is working with partners to bring together weekly forecasts for COVID‐19 deaths in one place. Table 13 shows the most CDC partners for COVID‐19 deaths. These forecasts have been developed independently and shared publicly. These forecasts can help to understand how they compare with each other and how much uncertainty there is about what may happen in the upcoming four weeks.
TABLE 13.
CDC partners for COVID‐19 deaths
| Source | Model | Method |
|---|---|---|
| Auquan Data Science | Auquan | SEIR |
| GIT | GT_CHHS | Agent‐based |
| Iowa State University | ISU | spatiotemporal |
| North Eastern University | MOBS | SLIR |
Another report ‖ used a raw video file in which the number of people can be seen walking into the street of Oxford University to analyze the social distancing measure. The generated algorithm gives an insight view of how to track the people following social distancing measure. YOLO model is used to detect person presence and calculate distance between the boxes.
To evaluate any model, this study calculated the accuracy of each model and the ROC curve to represent the area under the curve (AUC) for multilabel target. Figures 8 and 9 show the two highest ROC/AUC values among all 11 ensemble stacking classifiers.
FIGURE 8.

ROC for stacked ensemble 10
FIGURE 9.

ROC for stacked ensemble 11
The target growth rate has a high correlation with social distances with less effect on the other features. Thus, the demand for a model to predict the growth rate of disease spreading is imperative. The experiments are conducted for such purposes using effective ML techniques. The results are shown in Table 14 where enhancement results are depicted in Table 15. These results show that the XGBoost gives high accuracy performance of 87.83%. The combination with the Gradient Boost technique using the stacking ensemble method, the highest accuracy of 88.05% where the time taken to train the model is 72.53 s. Hence, the best model is Stacked_Ensemble10 (XGB, GB1[LR]) with an accuracy value of 88.06% that uses XGB and Gradient boost as the base level models for training, then the Logistic Regression being the meta‐model/classifier that is trained on the output of the base‐level model‐like features. In addition, Figures 8 and 9 show the ROC AUC values for three classes of growth rate, with the micro‐average ROC curve for stacked ensemble 10 and 11 being 91%. As a result, the GB model family handles the prediction error rates in the inherited weak models to produce a global strong model. This gives a promising model that will assuredly be used to reduce the rate at which disease spreads.
TABLE 14.
Different models with accuracy and AUC value
| Classifier | Accuracy | AUC |
|---|---|---|
| RF | 83.65% | 96% |
| KNN | 81.26% | 94% |
| LR | 65.75% | 95% |
| GB | 82.15% | 85% |
| SVC | 64.75% | 87% |
| XGB | 87.83% | 91% |
| DT_Entropy | 83.56% | 88% |
| DT_Gini | 83.76% | 88% |
TABLE 15.
Accuracy for stacked ensemble classifiers
| Model | Accuracy | AUC |
|---|---|---|
| Stacked(GB, DT, RF) | 84.12% | 95% |
| Stacked(XB1, GB1, RF) | 83.12% | 95% |
| Stacked(XB, GB1, RF) | 83.73% | 95% |
| Stacked(XB, GB1, DT) | 83.43% | 95% |
| Stacked(DT, RF, KNN) | 84.76% | 95% |
| Stacked(RF, XGB, GB, DT, KNN) | 84.09% | 91% |
| Stacked(RF1, XGB1, GB1, DT1,KNN) | 84.06% | 91% |
| Stacked(GB1, DT[LR]) | 83.65% | 95% |
| Stacked(GB1, DT[XGB]) | 83.70% | 91% |
| Stacked(XGB, GB1[LR]) | 88.05% | 91% |
| Stacked(XGB, GB1[XGB]) | 87.97% | 91% |
The previous studies used distinct models like SIRNET, SEIR, and age structured SLIP models. On the other hand, our study uses various traditional classifiers along with stacking ensemble methods to build predictive modeling approach for COVID‐19 aspects. This study uses the data till 31 May, 2020 and split the dataset into 0.85:0.15 to train and make predictions. Table 16 shows the comparison between this approach and the related approaches.
TABLE 16.
Comparing the proposed approach and related approaches with respect to social distancing
| Criteria | Previous approaches | Proposed approach |
|---|---|---|
| Model | Reference 11 SIRNET, SEIR, SLIR | MLclassifiers, stacking ensemble |
| Data Trained | Current data to train and 4 weeks future prediction | Used data till 31 May and made predictions on the past data |
| Analyzed data | Global, Country, State | Country |
| Focused Area | Growth rate with social distancing, Death rate/Growth mobility, social distancing | Impact of social distancing on growth rate |
| Accuracy | 75% | 88.6% |
4.3. Discussion
The proposed approach involves identifying the relative impact of COVID‐19 on a specific gender, the mortality rate in specific age, investigating different safety measures adopted by each country and their impact on the virus growth rate and global impact of social distancing on mobility. This study proposes data‐driven analysis and prediction modeling on different aspects of COVID‐19 to lead to better future predictions. Several machine learning and ensemble models have been used and compared to obtain the best accuracy. Experiments have been demonstrated on large public datasets. The proposed approach exhibited the other approaches in terms of accuracy and show better prediction performance. The Tables 17, 18, 19 show the top results for the three respected goals of this study.
TABLE 17.
The top 3 classifiers for gender specific impact
| Classifier | Accuracy |
|---|---|
| Adaboost | 82.04% |
| XGBoost | 81.6% |
| Decision Tree | 81.47% |
TABLE 18.
The top 3 classifiers for COVID‐19—growth rate
| Classifier | Accuracy |
|---|---|
| Voting_Ensemble10 | 88.09% |
| Gradient Boosting | 88% |
| XGBoost | 87.95% |
TABLE 19.
The top 3 classifiers for impact of social distancing on global growth ate
| Classifier | Accuracy |
|---|---|
| Stacked_Ensemble10 | 88.06% |
| Stacked_Ensemble11 | 87.97% |
| XGBoost | 87.84% |
Classical machine learning classifiers have a good impact to have a good effective prediction models. Significantly, ensemble machine learning classifiers have significant prediction models comparing with classical classifiers. This study evaluates global impact of social distancing on mobility. Our study proposes data‐driven analysis and prediction modeling by investigating three aspects of COVID‐19 pandemic (gender of patients, global growth rate, and social distancing). Notably, it attempts to prove the three hypotheses by introducing an analytical machine learning ‐based approach to obtain prediction performance.
The significant findings by this study reveals superior prediction performance comparing with the related approaches and the classical machine learning approaches. Also, the feature selection has a strong impact on the performance of the various machine learning classifiers as shown in the given results.
5. CONCLUSION
This study proposed an analytical machine learning based model in order to explore three significant aspects of COVID‐19 pandemic. The proposed approach focuses on determining the relative impact of COVID‐19 on a specific gender, the mortality rate in specific age, safety measures adopted by each country and their impact on the virus growth rate. Our study presents data‐driven analysis and prediction models by investigating three aspects of COVID‐19 pandemic as gender of patients, global growth rate, and social distancing. The proposed analytical model includes classic classifiers, distinctive ensemble methods such as bagging, feature based ensemble, voting and stacking. The obtained results show superior prediction performance comparing with the related approaches.
FUNDING INFORMATION
The authors declare there is no funding for this research.
CONFLICT OF INTEREST
The authors declare there is no potential conflict of interest.
Sharma S, Alsmadi I, Alkhawaldeh RS, Al‐Ahmad B. Data‐driven analysis and predictive modeling on COVID‐19. Concurrency Computat Pract Exper. 2022;34(28):e7390. doi: 10.1002/cpe.7390
ENDNOTES
DATA AVAILABILITY STATEMENT
The data used to support the findings of this study was publicly accessible by the following links:
1. The data that support the findings of this study for the first dataset is openly available in [Wolfram Data Repository] at [https://datarepository.wolframcloud.com/resources/Patient‐Medical‐Data‐for‐Novel‐Coronavirus‐COVID‐19]; 2. The data that support the findings of this study for the second dataset is openly available in [COVID‐19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University] at [https://github.com/CSSEGISandData/COVID‐19]; 3. The data that support the findings of this study for the third dataset is openly available in [acaps.org] at [https://www.acaps.org/covid‐19‐government‐measures‐dataset?acaps_mode=slow#26;show_mode=1];
REFERENCES
- 1. Organization WH, others . Coronavirus disease (COVID‐19): weekly epidemiological update. 2020.
- 2. Jin JM, Bai P, He W, et al. Gender differences in patients with COVID‐19: focus on severity and mortality. Front Public Health. 2020;8:152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Huong NTT. CARE rapid gender analysis for COVID‐19. 2020.
- 4. Srivastava N, Bhattacharyya A, Seth A, others . Does nature have a systematic sex bias: prevalence, mortality, and trend of COVID‐19. Ann Public Health Reports. 2020;4(1):129‐135. [Google Scholar]
- 5. Sobotka T, Brzozowska Z, Muttarak R, Zeman K, Di Lego V. Age, gender and COVID‐19 infections. MedRxiv; 2020.
- 6. Srivastava A, Chowell G. Understanding spatial heterogeneity of COVID‐19 pandemic using shape analysis of growth rate curves. MedRxiv; 2020.
- 7. Merow C, Urban MC. Seasonality and uncertainty in global COVID‐19 growth rates. Proc Natl Acad Sci. 2020;117(44):27456‐27464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Utsunomiya YT, Utsunomiya ATH, Torrecilha RBP, Paulan SC, Milanesi M, Garcia JF. Growth rate and acceleration analysis of the COVID‐19 pandemic reveals the effect of public health measures in real time. Front Med. 2020;7:247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Qian M, Jiang J. COVID‐19 and social distancing. J Public Health. 2020;259‐261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Thu TPB, Ngoc PNH, Hai NM, others . Effect of the social distancing measures on the spread of COVID‐19 in 10 highly infected countries. Sci Total Environ. 2020;742:140430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Soures N, Chambers D, Carmichael Z, et al. SIRNet: understanding social distancing measures with hybrid neural network model for COVID‐19 infectious spread. ArXiv preprint arXiv: 2004:10376;2020.
- 12. Gupta M, Jain R, Taneja S, Chaudhary G, Khari M, Verdú E. Real‐time measurement of the uncertain epidemiological appearances of COVID‐19 infections. Appl Soft Comput. 2021;101:107039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Jha S, Kumar R, Son LH, et al. Neutrosophic soft set decision making for stock trending analysis. Evol Syst. 2019;10(4):621‐627. [Google Scholar]
- 14. Long HV, Son LH, Khari M, et al. A new approach for construction of geodemographic segmentation model and prediction analysis. Comput Intell Neurosci. 2019;2019:1‐10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Khari M, Karar A. Analysis on intrusion detection by machine learning techniques: a review. Int J Adv Res Comput Sci Softw Eng. 2013;3(4):1‐4. [Google Scholar]
- 16. Pillai MS, Chaudhary G, Khari M, Crespo RG. Real‐time image enhancement for an automatic automobile accident detection through CCTV using deep learning. Soft Comput. 2021;1‐12:11929‐11940. [Google Scholar]
- 17. Organization WH, others . Gender and COVID‐19: advocacy brief, 2020. Tech RepWorld Health Organization; 2020.
- 18. Halimu C, Kasem A, Newaz SS. Empirical comparison of area under ROC curve (AUC) and Mathew correlation coefficient (MCC) for evaluating machine learning algorithms on imbalanced datasets for binary classification. Proceedings of the 3rd International Conference on Machine Learning and Soft Computing; 2019;1‐6.
- 19. Gunawardana A, Shani G. A survey of accuracy evaluation metrics of recommendation tasks. J Mach Learn Res. 2009;10(12):2935‐2962. [Google Scholar]
- 20. Sajjadi MS, Bachem O, Lucic M, Bousquet O, Gelly S. Assessing generative models via precision and recall. ArXiv preprint arXiv: 1806.00035; 2018.
- 21. Shardlow M. An analysis of feature selection techniques. Univ Manch. 2016;1(2016):1‐7. [Google Scholar]
- 22. Mello RF, Ponti MA. Machine Learning: a Practical Approach on the Statistical Learning Theory. Springer; 2018. [Google Scholar]
- 23. Chaudhary A, Kolhe S, Kamal R. An improved random forest classifier for multi‐class classification. Inf Process Agric. 2016;3(4):215‐222. [Google Scholar]
- 24. Gou J, Ma H, Ou W, Zeng S, Rao Y, Yang H. A generalized mean distance‐based k‐nearest neighbor classifier. Exp Syst Appl. 2019;115:356‐372. [Google Scholar]
- 25. Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied Logistic Regression; 398. John Wiley & Sons; 2013. [Google Scholar]
- 26. Bentéjac C, Csörgő A, Martínez‐Muñoz G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev. 2021;54(3):1937‐1967. [Google Scholar]
- 27. Tharwat A. Parameter investigation of support vector machine classifier with kernel functions. Knowl Inf Syst. 2019;61(3):1269‐1302. [Google Scholar]
- 28. Ara A, Louzada F. Alpha skew Gaussian Naïve Bayes classifier. Int J Inf Technol Decis Making. 2021;21(1):441‐462. [Google Scholar]
- 29. Taheri S, Mammadov M. Learning the naive Bayes classifier with optimization models. Int J Appl Math Comput Sci. 2013;23(4):787‐795. [Google Scholar]
- 30. Chen T, He T, Benesty M, et al. Xgboost: Extreme Gradient Boosting. R package version 0.4‐2. 2015;1‐4.
- 31. Yu Y, Zhong‐liang F, Xiang‐hui Z, Wen‐fang C. Combining classifier based on decision tree. Paper presented at: 2009 WASE International Conference on Information Engineering; 2, IEEE; 2009;37‐40.
- 32. Subasi A, Dammas DH, Alghamdi RD, et al. Sensor based human activity recognition using adaboost ensemble classifier. Proced Comput Sci. 2018;140:104‐111. [Google Scholar]
- 33. Rajaguru H, Prabhakar SK. Analysis of adaboost classifier from compressed EEG features for epilepsy detection. Paper presented at: 2017 International Conference on Computing Methodologies and Communication (ICCMC), IEEE; 2017;981‐984.
- 34. Verma A, Mehta S. A comparative study of ensemble learning methods for classification in bioinformatics. Paper presented at: 2017 7th International Conference on Cloud Computing, Data Science & Engineering‐Confluence, IEEE; 2017;155‐158.
- 35. Zhou ZH. Ensemble Learning in Machine Learning. Springer; 2021:181‐210. [Google Scholar]
- 36. Tuysuzoglu G, Birant D. Enhanced bagging (eBagging): a novel approach for ensemble learning. Int Arab J Inf Technol. 2020;17(4):515‐528. [Google Scholar]
- 37. Carreira‐Perpiñán MÁ, Zharmagambetov A. Ensembles of bagged TAO trees consistently improve over random forests, AdaBoost and Gradient Boosting in FODS ; 2020;35‐46.
- 38. Gandhi I, Pandey M. Hybrid ensemble of classifiers using voting. Paper presented at: 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), IEEE; 2015;399‐404.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data used to support the findings of this study was publicly accessible by the following links:
1. The data that support the findings of this study for the first dataset is openly available in [Wolfram Data Repository] at [https://datarepository.wolframcloud.com/resources/Patient‐Medical‐Data‐for‐Novel‐Coronavirus‐COVID‐19]; 2. The data that support the findings of this study for the second dataset is openly available in [COVID‐19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University] at [https://github.com/CSSEGISandData/COVID‐19]; 3. The data that support the findings of this study for the third dataset is openly available in [acaps.org] at [https://www.acaps.org/covid‐19‐government‐measures‐dataset?acaps_mode=slow#26;show_mode=1];
