Abstract
Coronavirus disease 2019 (COVID-2019) has been recognized as a global threat, and several studies are being conducted using various mathematical models to predict the probable evolution of this epidemic. These mathematical models based on various factors and analyses are subject to potential bias. Here, we propose a simple econometric model that could be useful to predict the spread of COVID-2019. We performed Auto Regressive Integrated Moving Average (ARIMA) model prediction on the Johns Hopkins epidemiological data to predict the epidemiological trend of the prevalence and incidence of COVID-2019. For further comparison or for future perspective, case definition and data collection have to be maintained in real time.
Keywords: COVID-2019 epidemic, ARIMA model, Forecast, Infection control
Specifications Table
| Subject | Infectious Diseases |
| Specific subject area | Econometric models applied to infectious diseases epidemiological data to forecast the prevalence and incidence of COVID-2019 |
| Type of data | Chart Graph Figure |
| How data were acquired | Gretl 2019d http://gretl.sourceforge.net/win32/index_it.html |
| Data format | Data are in raw format and have been analyzed. An Excel file with data has been uploaded. |
| Parameters for data collection | Parameters used for ARIMA were model ARIMA (1,2,0) and ARIMA (1,0,4) |
| Description of data collection | The daily prevalence data of COVID-2019 from January 20, 2020 to February 10, 2020 were collected from the official website of Johns Hopkins university (https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html), and Excel 2019 was used to build a time-series database. Descriptive analysis of the data was performed, and to evaluate the incidence of new confirmed cases of COVID-2019 and to prevent eventual bias, the difference between the cases confirmed on that day and the cases confirmed on the previous day were calculated Δ(Xn-Xn-1). |
| Data source location | University Campus Bio-Medico of Rome |
| Data accessibility | Raw data can be retrieved from the Github repository https://github.com/CSSEGISandData/COVID-19 |
Value of the Data
|
1. Data description
The daily prevalence data of COVID-2019 from January 20, 2020 to February 10, 2020 were collected from the official website of Johns Hopkins University (https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html), and Excel 2019 was used to build a time-series database [1]. ARIMA model was applied to a dataset consisting of 22 number determinations. Fig. 1 shows that the overall prevalence of COVID-2019 presented an increasing trend that is reaching the epidemic plateau. The difference between cases of one day and cases of the previous day Δ(Xn-Xn-1) showed a nonconstant increase in the number of confirmed cases. Descriptive analysis of the data was performed to evaluate the incidence of new confirmed cases of COVID-2019 and to prevent eventual bias.
Fig. 1.
Correlogram and ARIMA forecast graph for the 2019-nCoV prevalence.
2. Experimental design, materials, and methods
The ARIMA model includes autoregressive (AR) model, moving average (MA) model, and seasonal autoregressive integrated moving average (SARIMA) model [2]. The Augmented Dickey-Fuller (ADF) [3] unit-root test helps in estimating whether the time series is stationary. Log transformation and differences are the preferred approaches to stabilize the time series [4]. Seasonal and nonseasonal differences were used to stabilize the term trend and periodicity.
Parameters of the ARIMA model were estimated by autocorrelation function (ACF) graph and partial autocorrelation (PACF) correlogram. To determine the prevalence of COVID-2019, ARIMA (1,0,4) was selected as the best ARIMA model, while ARIMA (1,0,3) was selected as the best ARIMA model for determining the incidence of COVID-2019. Gretl2019d statistical software [5] was used to perform statistical analysis on the prevalence and incidence datasets, and the statistical significance level was set at 0.05. A previous study was considered as reference for the methodology of the analysis [6].
Logarithmic transformation was performed to evaluate the influence of seasonality on the forecast. The correlogram reporting the ACF and PACF showed that both prevalence and incidence of COVID-2019 are not influenced by the seasonality. The forecast of prevalence and incidence data with relative 95% confidence intervals are reported in Table 1.
Table 1.
Forecast value for the 2 days after the analysis for the prevalence and for the incidence of the COVID-2019.
| Date | Forecast | 95% Confidence Interval | |
|---|---|---|---|
| Prevalence | 11/02/2020 | 43599.71 | 42347.53–44851.9 |
| 12/02/2020 | 45151.45 | 42084.88–48218.02 | |
| Incidence | 11/02/2020 | 2070.66 | 1305.23–2836.09 |
| 12/02/2020 | 2418.47 | 1534.43–3302.51 |
Although more data are needed to have a more detailed prevision, the spread of the virus seems to be slightly decreasing. Moreover, although the number of confirmed cases is still increasing, the incidence is slightly decreasing. If the virus does not develop new mutations, the number of cases should reach a plateau (Fig. 1, Fig. 2). The forecast and the estimate obtained are influenced by the “case” definition and the modality of data collection. For further comparison or for future perspective, case definition and data collection must be maintained in real time.
Fig. 2.
Correlogram and ARIMA forecast graph for the 2019-nCoV incidence.
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.dib.2020.105340.
Conflict of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A. Supplementary data
The following is the Supplementary data to this article:
References
- 1.Johns Hopkins University Center for Systems Science and Engineering. 2019. https://github.com/CSSEGISandData/COVID-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Fattah J., Ezzine L., Aman Z., El Moussami H., Lachhab A. Forecasting of demand using ARIMA model. Int. J. Eng. Bus. Manag. 2018;10 [Google Scholar]
- 3.Cao S., Wang F., Tam W., Tse L.A., Kim J.H., Liu J., Lu Z. A hybrid seasonal prediction model for tuberculosis incidence in China. BMC Med. Inf. Decis. Making. 2013;13:56. doi: 10.1186/1472-6947-13-56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cheung Y.-W., Lai K.S. Lag order and critical values of the augmented Dickey–Fuller test. J. Bus. Econ. Stat. 1995;13:277–280. [Google Scholar]
- 5.Baiocchi G., Distaso W. GRETL: econometric software for the GNU generation. J. Appl. Econom. 2003;18:105–110. [Google Scholar]
- 6.Wang Y.W., Shen Z.Z., Jiang Y. Comparison of ARIMA and GM(1,1) models for prediction of hepatitis B in China. PloS One. 2018;13:e0201987. doi: 10.1371/journal.pone.0201987. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


