Abstract
A local outbreak of unknown pneumonia was detected in Wuhan (Hubei, China) in December 2019. It is determined to be caused by a severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) and called COVID-19 by scientists. The outbreak has since spread all over the world with a total of 120,815,512 cases and 2,673,308 deaths as of 16 March 2021. The health systems in the world collapsed in many countries due to the pandemic and many countries were negatively affected in the social life. In such situations, it is very important to predict the load that will occur in the health system of a country. In this study, the COVID-19 prevalence of Turkey is inspected. The infected cases, the number of deaths, and the recovered cases are predicted with Autoregressive Integrated Moving Average (ARIMA) and Artificial Neural Networks (ANN) in Turkey. The techniques are compared in terms of correlation coefficient and mean square error (MSE). The results showed that the used techniques used are very successful in the estimation of prevalence in Turkey.
Keywords: ARIMA, ANN, COVID-19, Turkey, Forecasting
Introduction
The COVID-19 pandemic is the virus outbreak that occurred in Wuhan, the capital of China’s Hubei region on December 1, 2019. The newly observed coronavirus called SARS-CoV-2, similar to pneumonia was diagnosed. Infectiousness of the COVID-19, transmission of the virus from person to person, grew in mid-January 2020. In a very short time, virus cases in various countries in Europe, North America, and Asia-Pacific started to be reported. A global pandemic was declared by the World Health Organization (WHO) on March 11, 2020. On March 13, 2020, WHO reported that the new epicenter of the coronavirus pandemic is Europe [1].
After the COVID-19 outbreak spreading across all over the world, the first detected COVID-19 cases were announced on 10 March 2020, by the Ministry of Health in Turkey. The first virus-related death in Turkey occurred on March 15, 2020. The total number of patients infected with the coronavirus was announced as 2,911,642 while 29,623 people died as of March 16, 2021, in Turkey. 2,734,862 patients recovered and gain immunity as of 16 March, 2021.
While the COVID-19 pandemic spreads over the world, literature has grown rapidly with the studies of scientist. Therefore, COVID-19 studies have received much attention. There are many studies in the literature about the medical aspect of COVID-19 [[2], [3], [4]]. On the other hand, sociological [5], economical [6], and statistical [7] inspections of COVID-19 are made by the researchers in many studies. Statistical studies generally focus on the country-based forecasting of the pandemic. Besides the many countries [[8], [9], [10], [11]], Turkey is also investigated in some studies [[12], [13], [14]]. Statistical and forecasting studies used different techniques such as time-series analyses, data mining techniques, growth models, nonlinear regression analysis, epidemiological models, and artificial intelligence (AI) techniques. One of the most effective time-series-based methods is ARIMA in the COVID-19 forecasting studies. Many country-based applications used ARIMA in the COVID-19 literature [8,9,11,[15], [16], [17], [18], [19]]. Furthermore, ANN has also been used for predicting the prevalence of COVID-19 in many studies and reported as a successful tool for prevalence prediction [10,[20], [21], [22], [23]].
In this paper, we inspected the dynamics of COVID-19 prevalence in Turkey. Infected cases, number of deaths, and recovered cases are handled, and prediction models are built by using ARIMA and ANN. This paper is organized as follows: The first section gives a brief review of COVID-19 and literature on forecasting the prevalence of the COVID-19 pandemic. The second section examines the materials and methods used for the prediction of the prevalence of pandemic. Results and discussions are given in the third section. Our conclusions are drawn in the final section.
Materials and methods
This section presents two approaches such as ARIMA and ANN for COVID-19 prevalence forecasting of Turkey in this study.
ARIMA method
ARIMA model, also known as Box and Jenkins, is one of the statistical methods used for future prediction. The Box–Jenkins method is used in the future prediction of univariate time series. It shows a systematic approach in establishing future prediction models of discrete and stationary time series consisting of observation values obtained with equal time intervals. The series consisting of observation values obtained with equal time intervals are important assumptions of the Box–Jenkins method which is discrete and stationary [24]. It is also a very effective tool in estimating time-series data. ARIMA method combines AR (autoregressive) and MA (moving averages) to analyze data. ARIMA models are used for stationary time series. Stabilization of the data is carried out by taking the difference in the I (Integrated-d) process. If the degree of autoregression parameter is , the degree of difference parameter is , and the degree of moving average parameter is this model is called Autoregressive Integrated Moving Average model in degrees and it is written as ARIMA [24].
The general expression of an ARIMA model is as follows:
(1) |
If the primary differences make the stationary series, the difference operator will be as follows:
(2) |
difference operator,
degree of difference,
differenced series.
The number of parameters to be calculated in the general ARIMA model used in the future estimation of series that do not show seasonal fluctuations is as much as in ARMA In ARIMA model, or can be zero. In this case, the model is reduced to either the AR or MA model types.
Artificial Neural Networks
ANN is one of the highly effective and successful data mining techniques in the literature. ANN is an information processing method inspired by the human brain. A brain learns from human experiences and ANN mimics the brain while processing the data. It is classified as supervised or unsupervised learning according to the knowledge of the output variable values. Generally, ANN consists of some basic elements: input, hidden, and output layers. An input layer is the information provider of the networks. The hidden layer constructs the nonlinear relations between input(s) and output(s) by adjusting weights and this step is called learning. Layers consist of different numbers of neurons and these neurons process the data via activation functions. On the other hand, the output layer gives the forecasting information.
It is proper to use ANN if there is no theoretical information about the functional form of the model or the nonlinear structure of the model. This leads us it is not a model-based technique; it is a data-based technique. The general architecture of ANN is given in Fig. 1 . Data are generally split into categories for training, testing, and validation purposes. In the training step, a network learns from the data. Stopping the training process is achieved by the validation step, while the prediction ability of a trained ANN is judged in the testing step [25].
A special type of feedforward neural network called multilayer perceptron (MLP) is used in this study. The MLP is the most widely used ANN model and generally contains one input layer, one output layer, and one or more hidden layers (Basheer and Hajmeer, 2000). Different training algorithms are used for MLP neural networks. The Broyden–Fletcher–Goldfarb–Shanno (BFGS) is one of the training algorithms usually used for nonlinear least squares is presented and the modified backpropagation algorithm is combined with the BFGS algorithm [26]. Therefore, BFGS is preferred in this study.
Results and discussion
The daily confirmed COVID-19 data from March 31, 2020 to March 16, 2021 are retrieved from the website of Turkish Ministry of Health (daily data not announced by the government has been neglected). On the other hand, population related data are gathered from the website of Turkish Statistical Institute. Only the patients that were confirmed by laboratory tests as positive are considered as infected cases by the Turkish government and we use these data for analyzes. In this study, no primary data collection is undertaken, no patient or public was involved in the study. By the way, we do not need any formal ethical assessment or informed consent. All anonymized data are collected from the official websites.
Results of the ARIMA
In our study, the ARIMA method is used for the prediction of the daily number of infected cases, the daily number of deaths, and the number of recovered cases. ARIMA model cannot be generated models for multiple outcomes. Therefore, we structure three different ARIMA models by using Minitab 17.3.1 software. As the first step of the ARIMA, the stationary condition of the time series is checked. And it is seen in Fig. 2 that our data are not stationary.
The daily number of infected cases is checked on the ACF graph in Fig. 3 . In the ACF graph, it is seen that the daily number of the infected case has a serial trend. Therefore, data are preprocessed by differencing. The stationary condition has been provided by differenced time series as seen in Fig. 4 .
Deciding on the values of and is the crucial point in the ARIMA model and these parameters affect the performance of the ARIMA model. In this study, ARIMA models are created with different combinations of and values, and their performances are compared. The ACF and Partial Autocorrelation Coefficient Function (PACF) graphs are plotted to choose the best performing ARIMA model in Fig. 5, Fig. 6 . According to the ACF and PACF graph, high autocorrelation is observed among the data.
ARIMA model gives the best forecasting results for the daily number of infected cases. The value obtained as for the daily number of infected cases in the level of significance. The same procedures are applied to forecasting the number of daily deaths and the number of daily recovered cases. Table 1 shows the Pearson correlation (R) value, sum of square error (SSE), MSE, and values obtained for all estimation parameters.
Table 1.
ARIMA (p,d,q) | R | SSE | MSE | p-Value | |
---|---|---|---|---|---|
Daily number of infected cases | ARIMA (1,1,0) | 0.987 | 15,606,683 | 55,343 | 0.000 |
Daily number of deaths | ARIMA (0,1,1) | 0.996 | 13,600.8 | 48.2 | 0.000 |
Daily number of recovered cases | ARIMA (1,1,1) | 0.998 | 815,523,606 | 2,912,584 | 0.000 |
As can be seen from Table 1, the highest correlation value is obtained from the daily number of recovered cases. According to SSE and MSE values, it is seen that the minimum errors are observed in the estimation of the daily number of deaths.
Results of ANN
We have a sample size of 285 days. The sample size is directly related to the generalization ability of ANN. ANN, generally, converges at the local minima with small sample sizes and yields poor generalization [27]. To overcome this issue, the data set is divided into three samples as 70% training, 15% testing, and 15% validation in this study. With the validation process, the generalization ability of constructed networks is tested. Each network is trained during 200 cycles and stopped when a 0.0000001 change in the error is occupied and 500 different network architecture is run. ANN structures are evolved from regression-based time series analysis models in our study.
Inputs and outputs of the model are defined as below:
-
•
Susceptible cases
-
•
Days
-
•
Curfews
-
•
Laboratory tests
Outputs:
-
•
Daily number of infected cases
-
•
Daily number of recovered cases
-
•
Daily number of deaths
To determine whether the input parameters are statistically significant for the ANN model, we check the p-values of input variables. p-Values of input parameters are found as 0.048, 0.000, 0.000 and 0.000 for susceptible cases, days, curfews and laboratory tests, respectively. Our finding support that all input parameters are statistically significant for the ANN model with p-values smaller than 0.05.
We calculate the susceptible case number for each day after the first recovered case was reported in Turkey. It is updated by the formulation . Non-pharmaceutical interventions such as school closures, travel restrictions, curfews, and quarantines are applied in Turkey during COVID-19. Travel restrictions, school closures, and quarantine policies are regularly applied after the first case was reported in Turkey. However, curfews varied during the pandemic. This is an important parameter that caused fluctuations in the number of infected people. Curfews are coded as 1 for the days applied and 0 for the other days. One of the most important issues of forecasting accuracy is that cases are confirmed and reported after the laboratory tests give positive results. By the way, this is another critical parameter for our model.
On the other hand, multicollinearity among independent variables is an important assumption of regression-based approaches. To check this assumption, we analyze independent data to detect multicollinearity. The most common approach to detect multicollinearity is that of the variance inflation factor (VIF). Depending on the rules of 4 or 10, multicollinearity among independent variables can be a possible or serious problem [28]. VIF values for our independent variables range between 1–2; therefore we will not discuss the multicollinearity problem among independent variables. ANN analyzes are carried out using the data mining module of the STATISTICA 10.0 software package.
The best network architecture is given in Table 2 and the SSE of the training, testing, and validation steps are given in Table 3 .
Table 2.
Training perf. | Testing perf. | Validation perf. | Training algorithm | Error function | Hidden activation | Output activation | |
---|---|---|---|---|---|---|---|
MLP 5-10-3 | 0.98 | 0.98 | 0.98 | BFGS | SSE | Hyperbolic tangent | Logistic |
Table 3.
Training error | Testing error | Validation error | |
---|---|---|---|
SSE | 1,061,922 | 1,160,705 | 1,628,951 |
MSE | 3726.04 | 4072.65 | 5715.62 |
Pearson Correlation coefficients are given in Table 4 . As seen in Table 4, the daily number of deaths, the daily number of recovered cases, and the daily number of infected cases have high correlation coefficient values and this indicates that the model developed has an acceptable generalization capability and accuracy to predict the prevalence of COVID-19 pandemic in Turkey.
Table 4.
R values of MLP 5-10-3 | Train | Test | Validation |
---|---|---|---|
Daily number of infected cases | 0.98 | 0.97 | 0.98 |
Daily number of deaths | 0.99 | 0.99 | 0.99 |
Daily number of recovered cases | 0.97 | 0.97 | 0.98 |
As depicted in Table 1, the best network is a multilayer perceptron network consists of five input neurons (curfews are considered as 2 different neurons because of the categorical structure of curfew data), ten neurons with a hidden layer, and three output neurons. The training algorithm is selected as BFGS. On the other hand, activation functions are selected as hyperbolic tangent and logistic functions for the hidden layer and output layer, respectively. Furthermore, the selected network accurately predicts the daily number of infected cases, daily number of deaths, and daily number of recovered cases as seen in Table 3, Table 4.
High correlation coefficients may suspect about linear relation between data or poor generalization ability of the developed network. However, as can be seen from Figs. 7 to 9 , the developed model is very successful in nonlinear estimation because the predicted and actual values of output curves overlap in the graphs. Fig. 7, Fig. 8, Fig. 9 give the time series predictions for 3 outputs.
Conclusion
The effect of the COVID-19 outbreak is growing steadily in the whole world. It becomes very important to forecast the prevalence of the pandemic for the health systems of countries. Accurate forecasting will be an insight into strengthening health systems and resource reallocation. In this manner, reliable prediction of the COVID-19 pandemic enables rapid responses, event-based political decisions, and to predict the future of the pandemic. Thereby, minimization of deaths and health system-caused failures is provided.
Time-series-based models such as ARIMA and ANN are very efficient tools for outbreak analysis. This study predicts the daily number of infected cases, daily number of deaths, and daily number of recovered cases for Turkey between 31 March–16 March by ANN and ARIMA models. We compared two techniques with some statistical indicators such as MSE and SSE. Following the results, ARIMA and ANN have almost the same forecasting performance. Consequently, ARIMA has high prediction accuracy in this study. R values are very high for predicting prevalence. Three different ARIMA models are developed by using different p, d, and q values. On the other hand, ARIMA has no ability to estimate multiple outputs simultaneously while ANN can construct models that can estimate three variables at the same time at an acceptable prediction level. Additionally, this study has highlighted the success of using artificial intelligence techniques in the estimation of pandemics.
For more precise estimation, data should be updated in real-time and new parameters that will affect the prevalence of pandemic should be taken into account. Vaccination studies have been started on 14 January 2021 in Turkey and it has been considered that the prevalence of pandemics will be affected by vaccination. Therefore, including the vaccination data explained by authorities in the study will be very effective for predicting the prevalence of the pandemic in Turkey in future works.
Funding
No funding sources.
Competing interests
None declared.
Ethical approval
Not required.
References
- 1.Vashist S.K. In vitro diagnostic assays for COVID-19: recent advances and emerging trends. Diagnostics. 2020;10(4):202. doi: 10.3390/diagnostics10040202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bai Y., Yao L., Wei T., Tian F., Jin D.-Y., Chen L., et al. Presumed asymptomatic carrier transmission of COVID-19. JAMA. 2020;323(14):1406–1407. doi: 10.1001/jama.2020.2565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mehta P., McAuley D.F., Brown M., Sanchez E., Tattersall R.S., Manson J.J. COVID-19: consider cytokine storm syndromes and immunosuppression. Lancet (London, England) 2020;395(10229):1033–1034. doi: 10.1016/S0140-6736(20)30628-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rothan H.A., Byrareddy S.N. The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak. J Autoimmun. 2020;109 doi: 10.1016/j.jaut.2020.102433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lancker W.V., Parolin Z. COVID-19, school closures, and child poverty: a social crisis in the making. Lancet Public Health. 2020;5(5):e243–e244. doi: 10.1016/S2468-2667(20)30084-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Fernandes N. Economic effects of coronavirus outbreak (COVID-19) on the world economy (SSRN Scholarly Paper ID 3557504) Soc Sci Res Netw. 2020 doi: 10.2139/ssrn.3557504. [DOI] [Google Scholar]
- 7.Roser M., Ritchie H., Ortiz-Ospina E. 2020. Coronavirus disease (COVID-19) — statistics and research. 45. [Google Scholar]
- 8.Al-qaness M.A.A., Ewees A.A., Fan H., Abd El Aziz M. Optimization method for forecasting confirmed cases of COVID-19 in China. J Clin Med. 2020;9(3):674. doi: 10.3390/jcm9030674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ceylan Z. Estimation of COVID-19 prevalence in Italy, Spain, and France. Sci Total Environ. 2020;729 doi: 10.1016/j.scitotenv.2020.138817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Moftakhar L., Seif M., Safe M.S. Exponentially increasing trend of infected patients with COVID-19 in Iran: a comparison of neural network and ARIMA forecasting models. Iran J Public Health. 2020;49(Supple 1):92–100. doi: 10.18502/ijph.v49iS1.3675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Perone G. 2020. An ARIMA model to forecast the spread and the final size of COVID-2019 epidemic in Italy.http://arxiv.org/abs/2004.00382 ArXiv:2004.00382 [q-Bio, Stat] [Google Scholar]
- 12.Arslan S., Ozdemir M.Y., Ucar A. Nowcasting and forecasting the spread of COVID-19 and healthcare demand in Turkey, a modelling study [Preprint] Public Global Health. 2020 doi: 10.1101/2020.04.13.20063305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Aslan Ibrahim H., Demir M., Wise M.M., Lenhart S. MedRxiv; 2020. Modeling COVID-19: forecasting and analyzing the dynamics of the outbreak in Hubei and Turkey. 2020.04.11.20061952. [DOI] [Google Scholar]
- 14.Özdi̇nç M., Şenel K., Öztürkcan S., Akgül A. Predicting the progress of COVID-19: the case for Turkey. Turk Klin J Med Sci. 2020;40(2):117–119. doi: 10.5336/medsci.2020-75741. [DOI] [Google Scholar]
- 15.Benvenuto D., Giovanetti M., Vassallo L., Angeletti S., Ciccozzi M. Application of the ARIMA model on the COVID-2019 epidemic dataset. Data Brief. 2020;29 doi: 10.1016/j.dib.2020.105340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Chakraborty T., Ghosh I. Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: a data-driven analysis. Chaos Solitons Fractals. 2020;135 doi: 10.1016/j.chaos.2020.109850. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Dehesh T., Mardani-Fard H.A., Dehesh P. MedRxiv; 2020. Forecasting of COVID-19 confirmed cases in different countries with ARIMA models. 2020.03.13.20035345. [DOI] [Google Scholar]
- 18.Ding G., Li X., Shen Y., Fan J. MedRxiv; 2020. Brief analysis of the ARIMA model on the COVID-19 in Italy. 2020.04.08.20058636. [DOI] [Google Scholar]
- 19.Gupta R., Pal S.K. MedRxiv; 2020. Trend analysis and forecasting of COVID-19 outbreak in India. 2020.03.26.20044511. [DOI] [Google Scholar]
- 20.Distante C., Pereira I.G., Goncalves L.M.G., Piscitelli P., Miani A. MedRxiv; 2020. Forecasting Covid-19 outbreak progression in Italian regions: a model based on neural network training from Chinese data. 2020.04.09.20059055. [DOI] [Google Scholar]
- 21.Ghazaly N.M., Abdel-Fattah M.A., El-Aziz A.A.A. Novel coronavirus forecasting model using nonlinear autoregressive artificial neural network. Int J Adv Sci Technol. 2020;29(5):19. [Google Scholar]
- 22.Hasan N. A methodological approach for predicting COVID-19 epidemic using EEMD-ANN hybrid model. Internet Things. 2020;11 doi: 10.1016/j.iot.2020.100228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Tamang S.K., Singh P.D., Datta B. Forecasting of Covid-19 cases based on prediction using artificial neural network curve fitting technique. Global J Environ Sci Manag. 2020;6(Special Issue (Covid-19)) doi: 10.22034/GJESM.2019.06.SI.06. [DOI] [Google Scholar]
- 24.Box G.E.P., Jenkins G.M., Reinsel G.C., Ljung G.M. John Wiley & Sons; 2015. Time series analysis: forecasting and control. [Google Scholar]
- 25.Yaghini M., Khoshraftar M.M., Fallahi M. A hybrid algorithm for artificial neural network training. Eng Appl Artif Intell. 2013;26(1):293–301. doi: 10.1016/j.engappai.2012.01.023. [DOI] [Google Scholar]
- 26.Nawi N.M., Ransing M.R., Ransing R.S. vol. 1. 2006. An improved learning algorithm based on the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method for back propagation neural networks; pp. 152–157. (Sixth International Conference on Intelligent Systems Design and Applications). [DOI] [Google Scholar]
- 27.Mao R., Zhu H., Zhang L., Chen A. vol. 1. 2006. A new method to assist small data set neural network learning; pp. 17–22. (Sixth International Conference on Intelligent Systems Design and Applications). [DOI] [Google Scholar]
- 28.O’brien R.M. A caution regarding rules of thumb for variance inflation factors. Qual Quant. 2007;41(5):673–690. [Google Scholar]