On the prediction of isolation, release, and decease states for COVID-19 patients: A case study in South Korea

Tarik Alafif; Reem Alotaibi; Ayman Albassam; Abdulelah Almudhayyani

doi:10.1016/j.isatra.2020.12.053

. 2021 Jan 5;124:191–196. doi: 10.1016/j.isatra.2020.12.053

On the prediction of isolation, release, and decease states for COVID-19 patients: A case study in South Korea

Tarik Alafif ^a,^⁎, Reem Alotaibi ^b, Ayman Albassam ^a, Abdulelah Almudhayyani ^a

PMCID: PMC7785285 PMID: 33451801

Abstract

A respiratory syndrome COVID-19 pandemic has become a serious public health issue nowadays. The COVID-19 virus has been affecting tens of millions people worldwide. Some of them have recovered and have been released. Others have been isolated and few others have been unfortunately deceased. In this paper, we apply and compare different machine learning approaches such as decision tree models, random forest, and multinomial logistic regression to predict isolation, release, and decease states for COVID-19 patients in South Korea. The prediction can help health providers and decision makers to distinguish the states of infected patients based on their features in early intervention to take an action either by releasing or isolating the patient after the infection. The proposed approaches are evaluated using Data Science for COVID-19 (DS4C) dataset. An analysis of DS4C dataset is also provided. Experimental results and evaluation show that multinomial logistic regression outperforms other approaches with 95% in a state prediction accuracy and a weighted average F1-score of 95%.

Keywords: COVID-19, Prediction, Isolation, Release, Decease, Classification, Decision tree, Random forest, Multinomial logistic regression

1. Introduction

The COVID-19 pandemic has first appeared in Wuhan, Hubei province, China in December 2019. It has become a serious public health issue worldwide as it spreads quickly. The COVID-19 virus is an infectious disease which directly affects people lungs. It is a branch of Coronaviruses family. It is believed that it has been transmitted from animals to human. The virus also transmits from human respiratory to another which is already noticed worldwide. The virus may cause mild or severe symptoms to affected people during the virus replications in their bodies. Affected people may develop symptoms such as dyspnea, fever, sore throat, cough, fatigue, and pneumonia. The virus has become a serious disease since it may cause death after developing the symptoms. Fortunately, many people have recovered and have been released while others are still isolated. While many people are recovered, the number of deaths is very few compared to the number of recovered ones.

Machine learning (ML) algorithms play an important role in today’s research because of the ability to automate complex tasks. These algorithms can learn from previous experience to predict future outcomes. Many research works have been developed and used in machine learning. Some of the well-known ML classifiers are Decision Trees (DTs), Random Forest (RF), and Multinomial Logistic Regression (MLR).

DT, so-called CART, is a popular supervised machine learning and a top-down predictive modeling approach [1]. It uses a simple tree structure representation to classify negative and positive examples recursively based on their input features. Various decision tree approaches have been proposed over the past years. ID3 [2], C4.5 [3] and CART [4] are the most common ones [5].

RF is an ensemble learning approach that is based on generating a large number of decision trees using a different subset of the training data [6], [7]. These subsets are chosen by random sampling of the original training data. The final predictions are made by taking the majority vote from all individual classification trees. RF is a powerful algorithm in machine learning. One of the advantages of RF is the ability to handle large datasets with high-dimensional feature space. It can handle thousands of input variables and identify the most significant variables.

In addition to DT and RF, regression approaches are also commonly used to predict future events in many research fields. In the machine learning context, logistic regression is a well-known technique that predicts binary outcomes. On the other hand, MLR is an extended version of a binary logistic regression approach that allows performing multi-class prediction where prediction outcome can be either dichotomous or continuous [8].

Our work is motivated by the success of CARTs, RF, and MLR approaches in many predictive applications such as in vegetation distributions [9], microRNA precursors [10], and response variables [11]. In this paper we propose to use them to predict isolation, release, and decease states for COVID-19 patients in South Korea. The proposed approaches are evaluated using Data Science for COVID-19 (DS4C) dataset. An analysis for DS4C dataset is provided. Experimental results and evaluation show that MLR outperforms other approaches with 95% in a state prediction accuracy and a weighted average F1-score of 95%.

The remainder of this paper is organized as follows. In Section 2, related work is reviewed. Section 3 introduces basic notations and mathematical models used throughout the paper. In Section 4, we briefly describe and analyze DS4C dataset. In Section 5, we present our work. In Section 6, experimental results and evaluation are provided using public DS4C dataset. We provide discussion details in Section 7. Finally, our conclusion and future work are provided in Section 8.

2. Related work

Many research works have been published recently to analyze and tackle the problem of COVID-19 virus in many fields.

Several methods were experimented in [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24] for COVID-19 analysis and predictions. Fanelli and Piazza [12] analyzed and forecasted COVID-19 spreading in China, Italy, and France. Nadia and Hazem [13] studied the effects of sex, region, infection reason, birth year, release date, and the diseased date on the recovered and deceased cases for COVID-19 patients. Hu et al. [14] applied a stacked auto-encoder and K-means to model the transmission dynamics of the epidemics based on provinces and cities. The model was built to forecast the confirmed cases of COVID-19. Remuzzi and Remuzzi [15] predicted the number of infected patients in Italy using an exponential curve. Tartai and Varallyay [16] attempted to predict the results of COVID-19 epidemic in a region using a logistic model. Pedersen and Meneghin [17] analyzed and predicted the spread of the COVID-19 virus using restrictions to reduce the epidemic in Italy. Peng et al. [18] proposed a SEIR model to analyze the epidemic in China. Petropoulos and Makridakis [19] forecasted the confirmed cases, death cases, and recovery cases of COVID-19 using time series forecasting approaches. Chatterjee et al. [20] and Jia et al. [21] developed mathematical models to study the impact of COVID-19 epidemic. Weissman et al. [22] proposed a non-learning predictive mathematical model, called SIR, to predict hospital capacity needs during the COVID-19 pandemic. However, the SIR model predicts only the epidemic spread over time for the whole population within a specific area. Yang et al. [23] used a long term short memory neural network to predict the COVID-19 epidemic in China. Similar to [23], Vattay [24] attempted to predict the epidemic in Italy.

Different from the aforementioned research works to tackle the COVID-19 problem from different perspectives, we apply and compare different machine learning approaches such as CARTS, RF, and MLR to predict isolation, release, and decease states for COVID-19 patients in South Korea. The proposed approaches are evaluated using DS4C dataset to measure the performance of the predictions.

3. Preliminaries

Let $D$ a set of patients records such that $D = {X_{1}, X_{2}, \dots, X_{n}}$ , where $n$ is the number of patients. Each patient record $X$ is belonging to one of the classes $j$ that is used for the classification task such that $D = {(X_{1}, j_{1}), (X_{2}, j_{2}), \dots, (X_{n}, Y_{n})}$ . Each patient record $X$ is composed of several input features such that $X = {x_{1}, x_{2}, \dots, x_{m}}$ , where $m$ is the total number of patient features. Then, each feature becomes a predictor variable of the class $j$ such that $X = {(x_{1}, x_{2}, \dots, x_{m}, j)}$ , $Y$ is the class variable taking values either in a set of classes $C = {1, 2, \dots, k}$ , where $k$ is the total number of classes.

Next, we show the basic mathematical equations used in the following approaches:

3.1. CARTs

In CARTs, Gini impurity is used to measure the purity in the decision tree models by knowing how good the split of input features in each node [25], [26]. Then, the Gini impurity is computed using the training examples with the class $j$ as shown in Eq. (1):

Gini (D | x) = 1 - \sum_{j = 1}^{k} {P_{j}}^{2}

(1)

where $x$ , $k$ and $P_{j}$ are the examined feature, the number of classes and the probability of the $j$ th class respectively.

The Gini Index is computed for each input feature partition and the average Gini Index for the examined feature $x$ is defined as:

Gini (D | x) = \sum_{i = 1}^{v a l s (x)} \frac{| D_{i} |}{| D |} Gini (D_{i} | x)

(2)

where $x$ , $D_{i}$ and $v a l s (x)$ are the examined feature, a data partition and number of discrete values in feature $x$ respectively. The feature with the lowest Gini Index is chosen as the splitting feature.

3.2. RF

An RF consists of a collection of decision trees $T = {1, \dots, n t r e e}$ . Each decision tree is trained on a different subset of the data using bootstrap sampling (bagging) out of $D$ . Final predictions are made using the majority votes. The RF employs random feature selection for each node of every tree in the forest, meaning that the splitting feature $x$ is chosen at random.

3.3. MLR

Logistic regression is defined as a generalized linear model that is usually used to classify binary outcomes [8]. Ridge estimator is proposed to improve model prediction [27]. Logistic regression approach is slightly modified to deal with a multi-class classification that is called the MLR, in case of predicting $k$ outcome classes for $n$ instances with $m$ observations, the parameter matrix of regression coefficients $B$ to be calculated will be an $m * (k - 1)$ matrix. $e$ is the exponential function. Class $j$ probability with the exception of the last class is shown in Eq. (3):

P_{j} (X_{i}) = \frac{e^{X_{i} B_{j}}}{\sum_{j = 1}^{k - 1} e^{X_{i} B_{j}} + 1}

(3)

The last class probability is shown in Eq. (4):

1 - \sum_{j = 1}^{k - 1} P_{j} X_{i} = \frac{1}{\sum_{j = 1}^{k - 1} e^{X_{i} B_{j}} + 1}

(4)

The (negative) multinomial log-likelihood is shown in Eq. (5):

L = - \sum_{j = 1}^{n} {\sum_{j = 1}^{k - 1} Y_{i j} * l n (P_{j} (X_{i})) + (1 - (\sum_{j = 1}^{k - 1} Y_{i j})) * l n (1 - \sum_{j = 1}^{k - 1} P_{j} (X_{i}))} + r i d g e * (B^{2})

(5)

3.4. Evaluation metrics

The accuracy for multi-class classification is defined as the number of instances correctly identified as either truly positive or truly negative out of the total as shown in Eq. (6). The error rate is the complement of accuracy.

accuracy = \frac{T P + T N}{T P + F P + T N + F N}

(6)

where $T P$ , $T N$ , $F P$ and $F N$ represent true positive, true negative, false positive and false negative respectively.

F1-score is averaging the harmonic mean of precision and recall as shown in Eqs. (7):

precision = \frac{T P}{T P + F P}

recall = \frac{T P}{T P + F N}

F1-score = \frac{2 * precision * recall}{precision + recall}

(7)

where precision and recall represent precision and recall respectively. The weighted average F1-score is computed by averaging F1-score class-by-class while weights are computed based on each target class frequency as shown in Eq. (8). Using such evaluation metric helps to keep an account for imbalance distribution of classes when computing the F1-score.

Weighted average F1-score =

\frac{\sum_{j = 1}^{k} w_{j} * F 1 - s c o r e_{j}}{\sum_{j = 1}^{k} w_{j}}

(8)

where $k$ is the total number of target classes, $w_{j}$ the weighting parameter of the $j$ th class and ${F1-score}_{j}$ is the computed F1-score of the $j$ th class.

4. DS4C dataset

DS4C [28] has been published on Kaggle recently into public in February 24th, 2020 in South Korea.1 The dataset consists of different data in four sheets (Case data, patient data, time-series data, and additional data). The case data describes the data for COVID-19 infectious cases. The patient data describes the epidemiological and route data of COVID-19 patients. The time-series data describes the status of time-series data of COVID-19 patients. Also, additional data is reported for regions, weather, and population. Each sheet is found in the form of a CSV file. In our work, we only use the patient data file to reach the aim of our research since it contains epidemiological data for COVID-19 patients in South Korea.

Patients data consists of 5,165 patients labeled records. Each patient record consists of several features such as sex, age, country, province, city, infection_case, infected_by, contact_number, symptom_onset_date, confirmed_date, released_date, deceased_date, and state. The state feature represents the label for each patient record. The dataset has three states. Each record in the dataset is either labeled isolated, released, or deceased.

Fig. 1 shows the states versus the number of infection cases in the dataset. The figure also shows clearly the unbalanced number of the sample’s distribution in this dataset which consists of 2,158 isolated states, 2,929 released states, and 78 deceased states. We also show the samples distribution for sex and age features in the dataset in Fig. 2, Fig. 3 respectively. We notice that the number of infected male is closed to the number of infected female. We also notice that the most infected age among patients is between 20 to 29 years old.

Fig. 2 — The samples distribution for sex feature in DS4C dataset.

Fig. 3 — The samples distribution for age feature in DS4C dataset.

We have investigated the causes of COVID-19 infection in South Korea patients in this dataset. We have found the most patients receive this infection are from the most frequent causes ”Contact patient”. Then, ”Overseas inflow” is followed. Fig. 4 depicts clearly the most common causes of COVID-19 infection in South Korea patients. Therefore, a social distancing is needed to halt or at least decrease the spread of the COVID-19 pandemic.

Fig. 4 — The causes for COVID-19 infection in South Korea according to the infection cases in DS4C dataset.

5. Our work

We propose to use DT models, RF, and MLR to classify the states (isolated, released, and deceased) for COVID-19 patients in South Korea. The DT models learn discriminative features from patient records to perform the prediction. The models predict the patient’s state based on several feature variables. We train our DT models using the features such as sex, age, country, province, city, infection_case, contact_number, symptom_onset_date, confirmed_date, released_date, deceased_date, and state. Patients case_id are excluded. Before training the models, we pre-process the patients’ records to manipulate missing data and non-integer values. The missing data are replaced by $-$ 1.

We apply different maximum DT depths to build our predictive intelligent models. The models are trained using the patients’ records. For DT models, we use maximum DT depths of 3, 5, and 10 in our work. Fig. 5 shows the generated DT architecture after training using the depth of 3. The DT architectures with the maximum depth of 5 and 10 generate a very large tree which consists of many branches and many leaf nodes and non-leaf nodes. A generalization capability is lost if the depth is more than 10. For the RF model, the model is trained with the default parameters according to randomForest package in R [6]. The default parameters in the randomForest algorithm: ntree, the number of classification tree (the default value is 500) and mtry, the number of features tested at each node (the default values is $s q r t (m)$ , where $m$ is the total number of features).

Fig. 5 — The generated DT architecture of our approach using the maximum DT depth of 3.

For the MLR model, the model-building process is carried out using weka tool with a ridge estimator which is considered as a stable implementation. The predictive model is fed with a full set of features. The Quasi-Newton method is used to perform the optimization task. Ridge values of $1 \times 1 0^{8}$ are used in the log-likelihood calculation.

6. Experimental results and evaluation

Tree-based approaches are implemented in R using rpart and randomForest packages [6], [29] while MLR approach was built using the Weka version 3.8.4 of machine learning tool [30]. A single laptop 1.6 GHz Dual-Core Intel Core i5 CPU with 8 GB of RAM is used to train and test the models. 10-fold cross-validation is applied for training and testing the models using DS4C dataset. A seed state is set to 1234.

Table 1 shows the accuracy, error rates, and the weighted average F1-score for the tested models. For DT models, we choose three different tree depths: 3, 5, and 10. In order to obtain the effect of the maximum tree depth parameters on the experimental results, we show the performance curve to verify the effectiveness of depth parameters on the models. Fig. 6 shows the models accuracy versus different tree depths. It clearly shows that the DT models accuracy stops improving at the maximum depth of 12.

Table 1.

Prediction accuracy, error rates and the weighted average F1-scores for the applied algorithms using DS4C dataset.

Algorithms implemented	Accuracy	Error rate	Weighted average F1-score
DT (Depth $=$ 3)	82.92%	17.08%	81.74%
DT (Depth $=$ 5)	85.63%	14.37%	84.67%
DT (Depth $=$ 10)	88.21%	11.79%	87.47%
RF	92.55%	07.45%	92.28%
MLR	95.00%	05.00%	95.00%

Open in a new tab

Fig. 6 — Maximum tree depth tuning based DT.

Fig. 7 shows the effect of the number of generated trees in the RF on the classification error rate. We can see that the error rate becomes stable when the number of trees is increased. The curve in black color represents out-of-bag (OOB) error rate and the other colors represent misclassification error rate curves. The out-of-bag error is estimated internally during building the DTs.

The confusion matrices using 10-fold cross-validation are provided in Table 2. The table shows true positives, true negatives, false positives, and false negatives test examples after applying each model. We notice that the DT models using different depths are unable to predict the true positives in the deceased state. This maybe due to the small number of examples found in the deceased state. Following the DT models, the RF approach is able to classify 22 true positive examples from the deceased state. Lastly, the MLR model classifies 44 true positive examples from the deceased state correctly. Besides, the small number of false negatives reveals an encouraging result. From Table 1, the MLR approach outperforms other approaches with 95% in the state prediction accuracy and with a margin of 2.45% to RF approach. To the best of our knowledge, there is no existing similar work to compare with ours since the COVID-19 research area and the DS4C dataset are new.

Table 2.

Confusion matrices for actual versus predicted patients’ states.

DT (Depth $=$ 3)
	Deceased	Isolated	Released

Deceased	0	0	0
Isolated	21	1,444	90
Released	57	714	2,839

DT (Depth $=$ 5)

	Deceased	Isolated	Released

Deceased	0	0	0
Isolated	23	1,586	92
Released	55	572	2,837

DT (Depth $=$ 10)

	Deceased	Isolated	Released

Deceased	0	0	0
Isolated	25	1,808	181
Released	53	350	2,748

RF

	Deceased	Isolated	Released

Deceased	22	18	38
Isolated	0	1,992	166
Released	0	163	2,766

MLR

	Deceased	Isolated	Released

Deceased	44	14	20
Isolated	15	2,042	101
Released	19	89	2,821

Open in a new tab

7. Discussion

The experimental results presented above have illustrated the ability of the proposed approaches to produce reliable predictions of COVID-19 patients who are isolated, released, and deceased. The proposed predictive instruments were built using a small bag of features which causes type I and type II errors to arise, as utilized features cover only limited aspects of patients’ personal and infection factors. Though, utilizing subject features allows to produce quality predictions of patient’s status even with the absence of medical-related aspects.

The proposed MLR model is able to predict 96.3% and 94.6% of the released and isolated cases respectively, while it predicts only 56.4% of the decease instances as such patients belong to minority state in the dataset. The COVID-19 DS4C dataset holds a significantly unbalanced class distribution where the minority instances belong to the decease state occupy only 1.5% of total instances in the dataset. The reason is that logistic regression model tends to produce more reliable classification outcomes when it deals with few numbers of features, and big amounts of training examples which is the case in the utilized dataset. Therefore, MLR provides most significant performance in prediction accuracy and the weighted average F1-score.

One can notice that the DT models using different depths have failed to predict the instances from the decease state as shown in Table 2. This is due to the small number of deceased state examples exist in the dataset. Therefore, the performances of the different depth of DT models have decreased. However, the DT model using the depth of 3 has achieved the best true positive rate only in the release state compared to other models.

On the other hand, the RF model performance relatively competes with the MLR model performance with a margin of 2.45% in state prediction accuracy. However, the RF model has lower true positive rates in all states compared to the MLR model as shown in Table 2. In general, based on the characteristics of features, and corresponding experimental results using DS4C dataset, we can conclude that the MLR model is more effective than other classification models for predicting COVID-19 patients states.

Unfortunately, none of the existing COVID-19 public datasets provide such adequate features and labeling for predicting the states of isolation, recovery, and decease. Most of the current public COVID-19 datasets focus on providing a statistical data about the death number, infection number, infected areas, patients tweets, infected X-ray scans, and infected CT scans from chests and brains. Also, the current DS4C public dataset lacks some of important features associated with the impact of the COVID-19 virus such as patients’ vital signs, medications usage, and chronic diseases. The lack of COVID-19 data and the mechanism of its collection and collaboration is considered a limitation, and a challenge in COVID-19 research area.

8. Conclusion

This research work attempts to apply and compare different machine learning approaches such as DT models, RF, and MLR to predict isolation, release, and decease states for COVID-19 patients in South Korea. The proposed approaches are evaluated using DS4C dataset. An analysis for DS4C dataset is also provided. This study finds that contact patient is the most cause for COVID-19 infection. Experimental results and evaluation show that MLR outperforms other approaches with 95% in the state prediction accuracy and the weighted average F1-score of 95%. The proposed machine learning models can help health providers and decision makers in early intervention to take an action either by releasing or isolating the patient after the infection.

Based on our observations from South Korean public COVID-19 statistical dashboard of daily confirmed cases, it reveals that South Korea has the highest number of infection in February, 2020 with 909 confirmed cases. Luckily, the current infection number per day has lowered to less than 50 since the beginning of March, 2020. Still, South Korea has a less number of infection rate compared to other countries which may be due to social distancing.

In the future, we plan to collect clinical data from local hospitals to analyze more important features and explore their patterns. These patterns may give a better understanding to the relations of the states. They are maybe beneficial to reduce the impact of the decease states. We also plan to use deep learning based methods such as deep neural networks to predict the state more accurately.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is funded by Research and Development Grants Program for National Research Institutions and Centers (GRANTS), Saudi Arabia, Target Research Program, Saudi Arabia, Infectious Disease Research Grant Program, Saudi Arabia, King Abdulaziz City for Science and Technology (KACST), Kingdom of Saudi Arabia , grant number (5-20-01-007-0001).

Footnotes

https://www.kaggle.com/kimjihoo/coronavirusdataset.

References

1.Rokach L., Maimon O.Z. World scientific; 2008. Data mining with decision trees: theory and applications, Vol. 69. [Google Scholar]
2.Quinlan R. Induction of decision trees. Mach Learn. 1986;1(1):81–106. doi: 10.1023/A:1022643204877. [DOI] [Google Scholar]
3.Quinlan R. Morgan Kaufmann Publishers Inc.; San Francisco, CA, USA: 1993. C4.5: Programs for machine learning. [Google Scholar]
4.Breiman L., H. Friedman J., A. Olshen R., J. Stone C. Chapman and Hall; New York: 1984. Classification and regression trees. [Google Scholar]
5.Priyam A., Abhijeeta G., Rathee A., Srivastava S. Comparative analysis of decision tree classification algorithms. Int J Cur Eng Technol. 2013;3(2):334–337. [Google Scholar]
6.Liaw A., Wiener M. Classification and regression by randomforest. R News. 2002;2(3):18–22. URL https://CRAN.R-project.org/doc/Rnews/ [Google Scholar]
7.Ho T.K. Third International Conference on Document Analysis and Recognition, ICDAR 1995, August 14 - 15, 1995, Montreal, Canada. Volume I. IEEE Computer Society; 1995. Random decision forests; pp. 278–282. [DOI] [Google Scholar]
8.Hosmer D.W. 3rd ed. Wiley; Hoboken, N.J: 2013. Applied logistic regression. (Wiley series in probability and statistics). [Google Scholar]
9.Moore D., Lees B., Davey S. A new method for predicting vegetation distributions using decision tree analysis in a geographic information system. Environ Manag. 1991;15(1):59–71. [Google Scholar]
10.Jiang P., Wu H., Wang W., Ma W., Sun X., Lu Z. Mipred: classification of real and pseudo microrna precursors using random forest prediction model with combined features. Nucl Acids Res. 2007;35(suppl_2):W339–W344. doi: 10.1093/nar/gkm368. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Breiman L., Friedman J.H. Predicting multivariate responses in multiple linear regression. J R Stat Soc Ser B Stat Methodol. 1997;59(1):3–54. [Google Scholar]
12.Fanelli D., Piazza F. Analysis and forecast of COVID-19 spreading in China, Italy and France. Chaos Solitons Fractals. 2020;134 doi: 10.1016/j.chaos.2020.109761. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Nadia A.-R., Hazem A.-N. Data analysis of coronavirus CoVID-19 epidemic in South Korea based on recovered and death cases. J Med Virol. 2020 doi: 10.1002/jmv.25850. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Hu Z., Ge Q., Jin L., Xiong M. 2020. Artificial intelligence forecasting of covid-19 in china. arXiv preprint arXiv:2002.07112. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Remuzzi A., Remuzzi G. COVID-19 and Italy: what next? Lancet. 2020 doi: 10.1016/S0140-6736(20)30627-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Tátrai D., Várallyay Z. 2020. COVID-19 epidemic outcome predictions based on logistic fitting and estimation of its reliability. arXiv preprint arXiv:2003.14160. [Google Scholar]
17.Pedersen M.G., Meneghini M. 2020. Quantifying undetected COVID-19 cases and effects of containment measures in Italy. ResearchGate Preprint (online 21 March 2020) DOI 10. [Google Scholar]
18.Peng L., Yang W., Zhang D., Zhuge C., Hong L. 2020. Epidemic analysis of COVID-19 in China by dynamical modeling. arXiv preprint arXiv:2002.06563. [Google Scholar]
19.Petropoulos F., Makridakis S. Forecasting the novel coronavirus COVID-19. PLoS One. 2020;15(3) doi: 10.1371/journal.pone.0231236. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Chatterjee K., Chatterjee K., Kumar A., Shankar S. Healthcare impact of COVID-19 epidemic in India: A stochastic mathematical model. Med J Armed Forces India. 2020 doi: 10.1016/j.mjafi.2020.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Jia L., Li K., Jiang Y., Guo X., et al. 2020. Prediction and analysis of coronavirus disease 2019. arXiv preprint arXiv:2003.05447. [Google Scholar]
22.Weissman G.E., Crane-Droesch A., Chivers C., Luong T., Hanish A., Levy M.Z., Lubken J., Becker M., Draugelis M.E., Anesi G.L., et al. Locally informed simulation to predict hospital capacity needs during the COVID-19 pandemic. Ann Intern Med. 2020 doi: 10.7326/M20-1260. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Yang Z., Zeng Z., Wang K., Wong S.-S., Liang W., Zanin M., Liu P., Cao X., Gao Z., Mai Z., et al. Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions. J Thorac Dis. 2020;12(3):165. doi: 10.21037/jtd.2020.02.64. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Vattay G. 2020. Predicting the ultimate outcome of the COVID-19 outbreak in Italy. arXiv preprint arXiv:2003.07912. [Google Scholar]
25.Rokach L., Maimon O. World Scientific Publishing Co., Inc.; River Edge, NJ, USA: 2008. Data Mining with Decision Trees: Theory and Applications. [Google Scholar]
26.Raileanu L.E., Stoffel K. Theoretical comparison between the gini index and information gain criteria. Ann Math Artif Intell. 2004;41(1):77–93. doi: 10.1023/B:AMAI.0000018580.96245.c6. [DOI] [Google Scholar]
27.le Cessie S., van Houwelingen J. Ridge estimators in logistic regression. Appl Stat. 1992;41(1):191–201. [Google Scholar]
28.Datartist . 2020. Data science for COVID-19 (DS4c) [Online; accessed 25-July-2020], https://www.kaggle.com/kimjihoo/coronavirusdataset. [Google Scholar]
29.Therneau T., Atkinson B., Ripley B. Rpart: Recursive partitioning and regression trees. r package version 4.1-8. R News. 2014 URL http://CRAN.R-project.org/package=rpart. [Google Scholar]
30.Frank E., Hall M.A., Witten I.H. Online Appendix for ”Data Mining: Practical Machine Learning Tools and Techniques”. 4th ed. Morgan Kaufmann; 2016. The WEKA workbench. [Google Scholar]

[b1] 1.Rokach L., Maimon O.Z. World scientific; 2008. Data mining with decision trees: theory and applications, Vol. 69. [Google Scholar]

[b2] 2.Quinlan R. Induction of decision trees. Mach Learn. 1986;1(1):81–106. doi: 10.1023/A:1022643204877. [DOI] [Google Scholar]

[b3] 3.Quinlan R. Morgan Kaufmann Publishers Inc.; San Francisco, CA, USA: 1993. C4.5: Programs for machine learning. [Google Scholar]

[b4] 4.Breiman L., H. Friedman J., A. Olshen R., J. Stone C. Chapman and Hall; New York: 1984. Classification and regression trees. [Google Scholar]

[b5] 5.Priyam A., Abhijeeta G., Rathee A., Srivastava S. Comparative analysis of decision tree classification algorithms. Int J Cur Eng Technol. 2013;3(2):334–337. [Google Scholar]

[b6] 6.Liaw A., Wiener M. Classification and regression by randomforest. R News. 2002;2(3):18–22. URL https://CRAN.R-project.org/doc/Rnews/ [Google Scholar]

[b7] 7.Ho T.K. Third International Conference on Document Analysis and Recognition, ICDAR 1995, August 14 - 15, 1995, Montreal, Canada. Volume I. IEEE Computer Society; 1995. Random decision forests; pp. 278–282. [DOI] [Google Scholar]

[b8] 8.Hosmer D.W. 3rd ed. Wiley; Hoboken, N.J: 2013. Applied logistic regression. (Wiley series in probability and statistics). [Google Scholar]

[b9] 9.Moore D., Lees B., Davey S. A new method for predicting vegetation distributions using decision tree analysis in a geographic information system. Environ Manag. 1991;15(1):59–71. [Google Scholar]

[b10] 10.Jiang P., Wu H., Wang W., Ma W., Sun X., Lu Z. Mipred: classification of real and pseudo microrna precursors using random forest prediction model with combined features. Nucl Acids Res. 2007;35(suppl_2):W339–W344. doi: 10.1093/nar/gkm368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b11] 11.Breiman L., Friedman J.H. Predicting multivariate responses in multiple linear regression. J R Stat Soc Ser B Stat Methodol. 1997;59(1):3–54. [Google Scholar]

[b12] 12.Fanelli D., Piazza F. Analysis and forecast of COVID-19 spreading in China, Italy and France. Chaos Solitons Fractals. 2020;134 doi: 10.1016/j.chaos.2020.109761. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b13] 13.Nadia A.-R., Hazem A.-N. Data analysis of coronavirus CoVID-19 epidemic in South Korea based on recovered and death cases. J Med Virol. 2020 doi: 10.1002/jmv.25850. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14] 14.Hu Z., Ge Q., Jin L., Xiong M. 2020. Artificial intelligence forecasting of covid-19 in china. arXiv preprint arXiv:2002.07112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b15] 15.Remuzzi A., Remuzzi G. COVID-19 and Italy: what next? Lancet. 2020 doi: 10.1016/S0140-6736(20)30627-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b16] 16.Tátrai D., Várallyay Z. 2020. COVID-19 epidemic outcome predictions based on logistic fitting and estimation of its reliability. arXiv preprint arXiv:2003.14160. [Google Scholar]

[b17] 17.Pedersen M.G., Meneghini M. 2020. Quantifying undetected COVID-19 cases and effects of containment measures in Italy. ResearchGate Preprint (online 21 March 2020) DOI 10. [Google Scholar]

[b18] 18.Peng L., Yang W., Zhang D., Zhuge C., Hong L. 2020. Epidemic analysis of COVID-19 in China by dynamical modeling. arXiv preprint arXiv:2002.06563. [Google Scholar]

[b19] 19.Petropoulos F., Makridakis S. Forecasting the novel coronavirus COVID-19. PLoS One. 2020;15(3) doi: 10.1371/journal.pone.0231236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b20] 20.Chatterjee K., Chatterjee K., Kumar A., Shankar S. Healthcare impact of COVID-19 epidemic in India: A stochastic mathematical model. Med J Armed Forces India. 2020 doi: 10.1016/j.mjafi.2020.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b21] 21.Jia L., Li K., Jiang Y., Guo X., et al. 2020. Prediction and analysis of coronavirus disease 2019. arXiv preprint arXiv:2003.05447. [Google Scholar]

[b22] 22.Weissman G.E., Crane-Droesch A., Chivers C., Luong T., Hanish A., Levy M.Z., Lubken J., Becker M., Draugelis M.E., Anesi G.L., et al. Locally informed simulation to predict hospital capacity needs during the COVID-19 pandemic. Ann Intern Med. 2020 doi: 10.7326/M20-1260. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b23] 23.Yang Z., Zeng Z., Wang K., Wong S.-S., Liang W., Zanin M., Liu P., Cao X., Gao Z., Mai Z., et al. Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions. J Thorac Dis. 2020;12(3):165. doi: 10.21037/jtd.2020.02.64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b24] 24.Vattay G. 2020. Predicting the ultimate outcome of the COVID-19 outbreak in Italy. arXiv preprint arXiv:2003.07912. [Google Scholar]

[b25] 25.Rokach L., Maimon O. World Scientific Publishing Co., Inc.; River Edge, NJ, USA: 2008. Data Mining with Decision Trees: Theory and Applications. [Google Scholar]

[b26] 26.Raileanu L.E., Stoffel K. Theoretical comparison between the gini index and information gain criteria. Ann Math Artif Intell. 2004;41(1):77–93. doi: 10.1023/B:AMAI.0000018580.96245.c6. [DOI] [Google Scholar]

[b27] 27.le Cessie S., van Houwelingen J. Ridge estimators in logistic regression. Appl Stat. 1992;41(1):191–201. [Google Scholar]

[b28] 28.Datartist . 2020. Data science for COVID-19 (DS4c) [Online; accessed 25-July-2020], https://www.kaggle.com/kimjihoo/coronavirusdataset. [Google Scholar]

[b29] 29.Therneau T., Atkinson B., Ripley B. Rpart: Recursive partitioning and regression trees. r package version 4.1-8. R News. 2014 URL http://CRAN.R-project.org/package=rpart. [Google Scholar]

[b30] 30.Frank E., Hall M.A., Witten I.H. Online Appendix for ”Data Mining: Practical Machine Learning Tools and Techniques”. 4th ed. Morgan Kaufmann; 2016. The WEKA workbench. [Google Scholar]

PERMALINK

On the prediction of isolation, release, and decease states for COVID-19 patients: A case study in South Korea

Tarik Alafif

Reem Alotaibi

Ayman Albassam

Abdulelah Almudhayyani

Abstract

1. Introduction

2. Related work