Abstract
This paper is devoted to modeling and predicting COVID-19 confirmed cases through a multiple linear regression. Especially, prediction intervals of the COVID-19 cases are extensively studied. Due to long-memory feature of the COVID-19 data, a heterogeneous autoregression (HAR) is adopted with Growth rates and Vaccination rates; it is called HAR-G-V model. Top eight affected countries are taken with their daily confirmed cases and vaccination rates. Model criteria results such as root mean square error (RMSE), mean absolute error (MAE), , AIC and BIC are reported in the HAR models with/without the two rates. The HAR-G-V model performs better than other HAR models. Out-of-sample forecasting by the HAR-G-V model is conducted. Forecast accuracy measures such as RMSE, MAE, mean absolute percentage error and root relative square error are computed. Furthermore, three types of prediction intervals are constructed by approximating residuals to normal and Laplace distributions, as well as by employing bootstrap procedure. Empirical coverage probability, average length and mean interval score are evaluated for the three prediction intervals. This work contributes three folds: a novel trial to combine both growth rates and vaccination rates in modeling COVID-19; construction and comparison of three types of prediction intervals; and an attempt to improve coverage probability and mean interval score of prediction intervals via bootstrap technique.
Keywords: COVID-19, HAR model, Prediction interval, Mean interval score, Bootstrap procedure
1. Introduction
The COVID-19 pandemic has lasted for more than a year although vaccinations against the COVID-19 has been recently distributed around the world. The severe virus is still threatening human health, social activity and global economy, despite struggles and efforts of experts to overcome the crisis. Many researchers have attempted to help with health systems of COVID-19. Statisticians and econometricians contribute to preventing the spread of COVID-19 through a variety of time series models, by modeling and forecasting the COVID-19 confirmed and death cases. Notable time series studies have emerged to model the COVID-19 data. From the beginning of Benvenuto et al. [1], who used an ARIMA (autoregressive integrated moving average) model, there have been so many ARIMA time series analyses for the COVID-19 datasets [2], [3], [4], [5], [6], [7], [8], [9], [10]. Adding sophisticated theories of probability distributions and differential equations, mathematical models have been developed by [12], [13], [14], [15], [16], [17], [18], [19], [11] for efficient predictions of the COVID-19 pandemic. In the area of artificial intelligence and machine learning, stochastic modelings for the COVID-19 epidemic disease have been conducted to support the forecast analyses [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31].
Among recent remarkable works of the COVID-19 modeling, we refer to Chowell and Luo [32], Bracher et al. [33], Ekinci [34], Hwang and Yu [35]: Chowell and Luo [32] introduced ensemble methodology for forecasting growth process of epidemic outbreaks using differential equations. They achieved the improved coverage rate and mean interval score of prediction intervals. Bracher et al. [33] combined multiple models to improve reliability of forecast outputs. Ekinci [34] adopted an ARMA-GARCH (generalized autoregressive conditional heteroscedasticity) process for modeling the growth rate of daily COVID-19 confirmed cases, considering conditional variance and asymmetric effect, while [35] employed a HAR (heterogeneous AR) model to capture the long memory embedded in the COVID-19 data. The GARCH and HAR models in Ekinci [34], Hwang and Yu [35], which are popular and robust time series models for financial volatility, still show good predictive performance on the COVID-19 time series data.
Vaccinations against the COVID-19 have been widely distributed in the world. According to Tregoning et al. [36] it is reported that the development and deployment of vaccines for the COVID-19 are a remarkable science success story: within 16 months of the first vaccine trials, 2.8 billion vaccine doses have been administered. Thanks to Mathieu et al. [37] a global database of COVID-19 vaccinations are available. Shim [38] evaluated the impact of COVID-19 vaccination program through a mathematical model based on age-structured classification, while [39] studied forecasting of the number of fully vaccinated people and examined future vaccination rate by the ARIMA model. Tregoning et al. [36] and Higdon [40] conducted systematic reviews of COVID-19 vaccines efficacy and effectiveness against the SARS-CoV-2 infection and disease. Vaccines and booster shots affect the new COVID-19 spread in the local region. Vaccination rates are one of the primary components that influence strongly the COVID-19 confirmed cases.
This work concerns itself with a time series analysis for the COVID-19 in the world. Vaccination rates are considered as a regressor variable in a multiple time series regression for the purpose of the COVID-19 forecast analysis. In addition, the growth rates of daily COVID-19 confirmed cases are adopted as another regression factor because the growth rates directly determine new numbers in the near future. The COVID-19 confirmed cases have long-memory property with algebraically decaying autocorrelation coefficients. As in Hwang and Yu [35], we use the HAR model that is an additive cascade regression model with a main feature of the long memory. Overall, to forecast the COVID-19 confirmed cases, this paper suggests a HAR model with Growth rates and Vaccination rates, called the HAR-G-V model, which is a form of multiple linear regression. The multiple linear regression and its statistical theories are well-known (for example, see Montgomery and Peck [41]). Their coefficient parameter estimation and related analysis are frequently used, as a standard-level study, to achieve statistically significant results.
The HAR model is a powerful time series model especially for realized volatility of financial asset, (see Corsi [42], Hong et al. [43], Hwang and Hong [44], Wilms et al. [45]). Its background is deferred to a recent work [35] of author, which is also a HAR model-application to the COVID-19 data. The HAR model has moving-averages as regressor variables: for example, in a HAR(3) model with lags (1,7,14), the previous value, 7-day moving average and 14-day moving average are used as explanatory variables. The conventional HAR model is a kind of a linear AR model: for example, the HAR(3) model with lags (1,7,14) is theoretically a linear AR(14) model. However, adding the growth rates to the HAR model results in a nonlinear AR model. Not only the averages but also the growth rates are important factors that determine the COVID-19 spread. In the notable work [34], the growth rate itself of the daily COVID-19 confirmed cases is used for time series modeling. In contrast, the current work uses both averages and growth rates, in a form of nonlinear AR model, and at the same time, in a multiple linear regression model. The modified (nonlinear) HAR() model is employed with one-day, one-week, two-week moving growth rates as well as one-day, one-week, two-week moving averages, with lags (1,7,14), if . In addition, two vaccination rates are imposed in our time series model: 1st and 2nd vaccination rates, which are different from country to country. The following three main issues: (i) reflecting long memory of COVID-19 data, (ii) adopting a nonlinear autoregressive effect with both moving averages and moving rates, and (iii) taking vaccination rates that play a decisive role in the COVID-19 situation, motivate us to choose the multiple regression of the HAR-G-V model in this study.
For the empirical analysis of the COVID-19 time series data in the world, top eight affected countries are taken: Brazil, France, Germany, India, Italy, Russia, UK and USA. Their daily COVID-19 confirmed cases and vaccination rates are analyzed through HAR-G-V models of order 2, 3 with lags (1,7), (1,7,14), respectively. In the HAR models with/without growth rates and vaccination rates, model criteria measures such as root mean square error (RMSE), mean absolute error (MAE), , AIC and BIC are evaluated for the comparison. In addition, these are compared to ARIMA models, which are popular in time series literature including COVID-19. The proposed HAR-G-V model outperforms other HAR models and ARIMA models with respect to RMSE and MAE. It is because the both rates have strong influences on the spread of COVID-19. The fitted HAR-G-V models are well-matched to real daily COVID-19 confirmed cases.
To investigate forecast performance of the HAR-G-V model, out-of-sample forecasting is conducted to evaluate RMSE, MAE, mean absolute percentage error (MAPE) and root relative square error (RRSE) of forecasts. As a main goal of this work, prediction intervals are established. Their empirical coverage probability, average length and mean interval score are reported. The mean interval score was assessed in Chowell and Luo [32], to account the forecasting performances across a diversity of epidemic datasets. The interval score was suggested by Gneiting and Raftery [46]. It is a scoring rule for prediction interval, which rewards for a narrow prediction interval and incurs a penalty if the observation misses the interval. See [46] for its details, who dealt with the interval score and other scoring rules of prediction intervals.
Based on the residuals of the HAR-G-V model, prediction intervals, (), are constructed by approximating the residual distribution. The following three approaches are attempted: (i) Approximation by normal distribution; it does not provide the best performance because residuals do not follow the normal distribution. (ii) Approximation by Laplace distribution; it seems to fit well the residual distribution and to improve the normal approximation by enhancing coverage probability in some countries. However, the Laplace distribution makes longer prediction interval and thus it has the worst mean interval score in the comparison of 95% prediction intervals. (iii) In order to improve the previous two approximations, the bootstrap technique is applied to the residuals to construct the bootstrap prediction interval.
The validity of the bootstrap technique in a regression model was initially established by Freedman [47]. Also, the bootstrap procedure was used by [48] to estimate the mean squared error of forecast for the AR() model. As a useful application of the bootstrap, Thombs and Schucany [49] dealt with the bootstrap prediction intervals for the AR models. For weak dependence structure in a network vector autoregression, Hong and Hwang [50] implemented the bootstrap prediction intervals using the stationary bootstrap and the classical residual bootstrap. In the current work, in order to construct a better prediction interval, the classical residual-bootstrap method is used. This residual resampling-technique not only improves empirical coverage but also shortens average length of prediction interval, and thus the proposed bootstrap prediction interval yields the best mean interval score. So as to find how much the bootstrap method can improve the normal and Laplace approaches, bootstrap improvement ratio (BIR) of the mean interval score is suggested. In cases of India, Italy and Russia, which have so long prediction intervals by the normal and Laplace approximations, their BIRs are evaluated as more than 0.7, which means that mean interval scores of the bootstrap prediction intervals can be enhanced by more than 70% from those of other two prediction intervals.
This research has contributions in three aspects: (i) This present work combines both growth rates and vaccination rates, for modeling and forecasting the COVID-19 pandemic, through a nonlinear type of HAR models. It provides results of well-fitting to actual daily COVID-19 confirmed cases and good criteria measures such as small values of RMSE and MAE. (ii) Forecasting accuracy performances of RMSE, MAE, MAPE and RRSE are evaluated with small values and, moreover, in behalf of more informative interval estimation, three types of prediction intervals are constructed and compared. (iii) Bootstrap method can improve prediction intervals not only by yielding the empirical coverage, close to the nominal coverage, but also by shortening average length and thus resulting in better mean interval score. Consequently, the combination of the HAR-G-V model and the bootstrap technique achieves an accurate prediction interval with the best mean interval score. An outline of this study is drawn in the flow chart of Fig. 1 .
Fig. 1.
Flow chart for this research of HAR-G-V model.
The rest of the paper is organized as follows: In Section 2, model and data are presented. In Section 3 empirical results of estimation and prediction are explored. Conclusion and discussion are stated in Section 4.
2. Model and data
2.1. HAR-G-V model for a triangular array
We first introduce the HAR-G-V model, which is used for a triangular array of observations. Suppose that a sample of size is observed, and the dataset is a triangular array with mean zero and finite variance, but it is not necessarily stationary. In the expression , we omit subscript and use in the model just for notational simplicity.
The HAR-G-V model of order , , is defined as follows:
| (1) |
where the HAR components are given by
with positive integers , and the growth rate components are given by log difference of shifted values:
| (2) |
with some constant , called a shift threshold, such that for all . The last two regressors and are used as the 1st and 2nd vaccination rates at time , respectively. Coefficients are parameters to be estimated for model-fitting and forecasting, and is an i.i.d. sequence of random variables with mean zero and finite variance.
The HAR-G-V model of order in (1) can be regarded as a nonlinear AR model of order with additional regression variables, which make the model a multiple regression form. As mentioned above, as explanatory variables, the model has moving averages and moving rates, both of which are functions of the observations . In other words, we may write
for some real-valued function defined on -dimensional real space , i.e., is a composite of algebraic additions, divisions and logarithmic function, along with parameters and . The shift threshold is estimated as any positive number only with condition for all in the logarithm function. Instead of a general function , we consider a specific semiparametric AR model with the form of (1).
In the HAR-G-V model (1), if , then model (1) is denoted by HAR-V model, with only vaccination rates, and if , then it is HAR-G model, with only growth rates. If all , then model (1) is the HAR() model, which is used in the analysis of [35].
In the next section of an empirical study, the four models; HAR, HAR-V, HAR-G and HAR-G-V models of order , are compared together, by applying eight datasets of COVID-19 cases. Order is chosen as , and lags ( are (1,7) for and (1,7,14) for . According to Linton et al. [51] the incubation period of the COVID-19 is 14 to 21 days and so the recent work [35] of author dealt with (1,7,14) for and (1,7,14,21) for . However, since this current work handles additionally growth rates and vaccination rates, (if we use , then there are 10 coefficients in the multiple regression), and also since the results of order are expected to be similar to those of order , we omit an empirical study for .
2.2. COVID-19 data
This work considers the new daily COVID-19 confirmed cases in top eight affected countries with their growth rates and vaccination rates. The daily confirmed cases (7-day smoothed) and vaccination rates are taken from Our World in Data in https://ourworldindata.org/covid-vaccinations, (see Mathieu et al. [37]). The confirmed cases are applied by 7-day moving-averaging as in Ekinci [34] so as to smoothen the daily data. Eight countries: Brazil, France, Germany, India, Italy, Russia, UK and USA, are chosen to apply the HAR models and to construct the prediction intervals as a main goal. Statistics summaries of daily confirmed cases (7-day smoothed) are displayed in Table 1 along with those of their one-day, one-week and two-week growth rates.
Table 1.
Statistics of daily confirmed cases (7-day smoothed) and their growth rates; size is the period from the starting date in 2021 to 08/24/2021; SD = standard deviation.
| Brazil | France | Germany | India | Italy | Russia | UK | USA | |||
|---|---|---|---|---|---|---|---|---|---|---|
| Starting | 02/09 | 01/26 | 01/16 | 02/15 | 01/18 | 03/02 | 01/10 | 01/19 | ||
| Size | 190 | 204 | 214 | 184 | 212 | 169 | 220 | 211 | ||
| Daily confirmed cases | Mean | 56917.9 | 18765.2 | 8429.5 | 116687.6 | 9691.6 | 14028.4 | 14912.9 | 60591.1 | |
| SD | 13699.6 | 10707.5 | 6303.9 | 117743.3 | 6890.7 | 6435.6 | 13467.5 | 40198.2 | ||
| Min | 28337.6 | 1840.6 | 538.0 | 12751.6 | 639.3 | 8016.4 | 1517.7 | 11299.3 | ||
| Median | 60438.8 | 20287.4 | 7864.8 | 50359.8 | 9430.5 | 9488.6 | 8662.2 | 57301.1 | ||
| Max | 77327.6 | 45641.6 | 21595.7 | 391232.0 | 23211.9 | 24490.6 | 48675.4 | 171103.9 | ||
| Skewness | 0.5867 | 0.0849 | 0.4389 | 1.2000 | 0.3819 | 0.5397 | 0.8334 | 0.8423 | ||
| Kurtosis | 0.7192 | 0.6760 | 1.0002 | 0.0274 | 1.0363 | 1.5331 | 0.5727 | 0.0264 | ||
| One-day growth rate | Mean | 0.0027 | 0.0003 | 0.0033 | 0.0050 | 0.0031 | 0.0038 | 0.0018 | 0.0006 | |
| SD | 0.0386 | 0.0815 | 0.0683 | 0.0428 | 0.0451 | 0.0163 | 0.0737 | 0.0460 | ||
| Min | 0.2419 | 0.3246 | 0.1898 | 0.1225 | 0.1636 | 0.0295 | 0.5146 | 0.1538 | ||
| Median | 0.0002 | 0.0017 | 0.0032 | 0.0013 | 0.0047 | 0.0010 | 0.0030 | 0.0086 | ||
| Max | 0.2165 | 0.2546 | 0.2958 | 0.1337 | 0.1284 | 0.0706 | 0.5197 | 0.1960 | ||
| Skewness | 0.5975 | 0.3307 | 0.02783 | 0.2650 | 0.4591 | 1.5086 | 0.1375 | 1.0606 | ||
| Kurtosis | 13.4556 | 2.4453 | 1.6424 | 0.4308 | 1.1372 | 2.6062 | 0.5964 | 2.9285 | ||
| One-week growth rate | Mean | 0.0173 | 0.0057 | 0.0265 | 0.0353 | 0.0232 | 0.0309 | 0.0109 | 0.0035 | |
| SD | 0.1298 | 0.3277 | 0.3176 | 0.2838 | 0.2847 | 0.1006 | 0.3320 | 0.2420 | ||
| Min | 0.3572 | 0.7531 | 0.7623 | 0.4066 | 0.5087 | 0.1171 | 0.9072 | 0.4321 | ||
| Median | 0.0070 | 0.0234 | 0.0027 | 0.0531 | 0.0147 | 0.0072 | 0.0481 | 0.0379 | ||
| Max | 0.2275 | 1.1725 | 0.5079 | 0.5356 | 0.7688 | 0.3528 | 0.6126 | 0.6937 | ||
| Skewness | 0.6097 | 0.8542 | 0.2229 | 0.2691 | 0.6975 | 1.5848 | 0.3378 | 0.7216 | ||
| Kurtosis | 0.2087 | 1.2571 | 0.7937 | 1.0847 | 0.3750 | 1.6817 | 0.9006 | 0.2771 | ||
| Two-week growth rate | Mean | 0.0391 | 0.0141 | 0.0652 | 0.0722 | 0.0471 | 0.0693 | 0.0154 | 0.0005 | |
| SD | 0.1889 | 0.6056 | 0.5886 | 0.5626 | 0.5492 | 0.1876 | 0.5574 | 0.4625 | ||
| Min | 0.4629 | 1.1577 | 1.2154 | 0.7770 | 0.9424 | 0.1296 | 1.2253 | 0.6608 | ||
| Median | 0.0260 | 0.0170 | 0.0161 | 0.0896 | 0.0907 | 0.0020 | 0.1433 | 0.1470 | ||
| Max | 0.3476 | 1.6405 | 0.8989 | 1.0268 | 1.3463 | 0.5938 | 1.0605 | 1.0772 | ||
| Skewness | 0.2535 | 0.7179 | 0.2469 | 0.2241 | 0.6688 | 1.4191 | 0.4948 | 0.7940 | ||
| Kurtosis | 0.4413 | 0.4999 | 1.0432 | 1.1951 | 0.0393 | 0.7863 | 1.0508 | 0.4554 |
Because the vaccine supply differs from country to country, the sample size is different for each country. More specifically, the start date in the sample period is set as the first day of the 2nd vaccination, while the end date is all 08.24.2021. The size in each country, shown in Table 1, is the length between the start date and 08.24.2021.
As introduced above, model (1) is assumed to have mean zero, and so we adopt as standardized values of COVID-19 confirmed cases for each country. In other words, , for , where is daily COVID-19 confirmed case at time , and are its sample mean and sample standard deviation, respectively. Note that generates a triangular array with mean zero and variance one, (but not stationary).
With lags (1,7,14) for , the growth rates in (2) are computed as one-day, one-week and two-week growth rates of daily numbers shifted by . The shift threshold may be chosen as some positive number so that for all . Either , for any , or are possible. Here we select threshold because it implies that
which is log difference of original COVID-19 confirmed cases . These growth rates are depicted in Fig. 2 , where we see one-day, one-week, two-week growth rates of the eight countries. Positive values during a time period imply that the numbers of COVID-19 confirmed cases increase during the period; for example, in India, during the period before 60 on time axis, growth rates are positive, which means new daily COVID-19 confirmed numbers increase, while during the period between 75 and 140, growth rates are negative, which means new daily COVID-19 numbers decrease. These facts are also shown on the plot of India in Fig. 5. In Russia, between 80 and 110 on time axis, growth rates are positive, which indicates increasing COVID-19 during the period, as seen on the plot of Russia in Fig. 5. In Fig. 3 , the 1st and 2nd vaccination rates of the eight countries are shown, until date 08.24.2021. India and Russia have low vaccination rates, whereas Italy, UK and USA have high rates. Data in Figs. 2 and 3 are used as regressor variables in the HAR-G-V model of a triangular array.
Fig. 2.
Growth rates of daily COVID-19 confirmed cases (7-day smoothed).
Fig. 5.
Daily confirmed cases (7-day smoothed) and HAR-G-V model fitting with its residuals.
Fig. 3.
Vaccination rates.
A main characteristic of the HAR model is the long memory, which means that historical observation has a persistent impact on future value. The long memory can be identified by autocorrelation coefficient functions (ACFs) with algebraically decay rates. In order to see whether the COVID-19 data are suitable for the HAR model, their sample ACFs are illustrated in Fig. 4 . The COVID-19 confirmed cases have the long-memory effects as seen in Fig. 4, where the sample ACFs of the confirmed cases are depicted for the eight countries. The sample ACFs are decreasing very slowly. This feature leads us to adopt the HAR models for modeling the COVID-19 pandemic time series. In the next section, we discuss empirical results of the COVID-19 confirmed cases with their estimation and forecasting through the HAR models with/without the growth rates and vaccination rates. Because the proposed models are a type of multiple linear regression model, which is popular and standard in statistical analysis, and also because earlier works [43], [44] dealt with a simulation study in the HAR model, in this work, the simulation is omitted and real data analysis only is discussed.
Fig. 4.
Sample autocorrelation coefficient function (ACF) of daily COVID-19 confirmed cases (7-day smoothed).
3. Empirical study
3.1. Parameter estimation
This section presents the empirical results of the proposed models with daily confirmed cases and vaccination rates. Four models; HAR, HAR-V, HAR-G and HAR-G-V models, are compared by reporting some criteria measures of the parameter estimation. Additionally, these results are compared with those by popular ARIMA models.
Suppose the observations of size are relabelled by , where is chosen so that , with , 14, for , 3, respectively. We use the least squares method that minimizes the sum of squared errors (SSE):
The regression model in (1) is written as where and . The ordinary least squared (OLS) estimator is obtained by minimizing the SSE and given as
To see the performance of the OLS regression, we compute criteria measures such as root mean square error (RMSE), mean absolute error (MAE), , AIC and BIC.
Let . Using the residuals , RMSE and MAE are given by
Also, , AIC and BIC are given by
where is the log-likelihood function and is the number of parameters.
For the HAR(-G-V) models with order , these performance measures in the eight countries are reported in Tables 2 and 3 . In order to compare the results with existing model, Table 4 gives the performances by ARIMA models. In the columns of RMSE and MAE of Table 2, Table 3, Table 4, the values in the parentheses are the errors multiplied by the standard deviations of the daily confirmed cases in each country, where the standard deviations are given in Table 1. Because HAR and HAR-V models have fewer parameters, AIC and BIC values in these models are somewhat smaller than those in the HAR-G and HAR-G-V models.
Table 2.
Criteria of HAR() models with/without growth rates and vaccination rates; Brazil, France, Germany and India.
| RMSE | MAE | AIC | BIC | |||||
|---|---|---|---|---|---|---|---|---|
| Brazil | HAR | 0.14876(2038.06) | 0.09671(1324.91) | 0.979 | 167.2 | |||
| HAR-V | 0.14634(2004.85) | 0.09508(1302.64) | 0.980 | 156.3 | ||||
| HAR-G | 0.14782(2025.06) | 0.09731(1333.16) | 0.979 | 165.5 | 152.8 | |||
| HAR-G-V | 0.14493(1985.55) | 0.09363(1282.75) | 0.980 | 168.4 | 149.4 | |||
| HAR | 0.14864(2036.32) | 0.09726(1332.36) | 0.979 | 165.5 | ||||
| HAR-V | 0.14582(1997.74) | 0.09575(1311.75) | 0.980 | 152.4 | ||||
| HAR-G | 0.14698(2013.58) | 0.09684(1326.60) | 0.979 | 163.5 | 144.5 | |||
| HAR-G-V | 0.14473(1982.75) | 0.09437(1292.96) | 0.980 | 164.9 | 139.5 | |||
| France | HAR | 0.18192(1948.01) | 0.09762(1045.33) | 0.969 | 104.4 | 97.88 | ||
| HAR-V | 0.17893(1915.97) | 0.09195(984.65) | 0.970 | 106.7 | 93.69 | |||
| HAR-G | 0.17065(1827.26) | 0.09503(1017.59) | 0.973 | |||||
| HAR-G-V | 0.17042(1824.78) | 0.09471(1014.13) | 0.973 | 121.2 | 101.7 | |||
| HAR | 0.17907(1917.40) | 0.09266(992.16) | 0.970 | 108.4 | 98.65 | |||
| HAR-V | 0.17779(1903.79) | 0.08936(956.84) | 0.971 | 107.1 | 90.87 | |||
| HAR-G | 0.17061(1826.83) | 0.09438(1010.62) | 0.973 | |||||
| HAR-G-V | 0.17040(1824.58) | 0.09428(1009.51) | 0.973 | 117.3 | 91.27 | |||
| Germany | HAR | 0.10697(674.34) | 0.05891(371.62) | 0.989 | 322.5 | 315.9 | ||
| HAR-V | 0.10508(662.44) | 0.05684(358.35) | 0.989 | 325.6 | 312.4 | |||
| HAR-G | 0.10205(643.37) | 0.05991(377.65) | 0.990 | |||||
| HAR-G-V | 0.10146(639.62) | 0.05940(374.49) | 0.990 | 335.6 | 315.9 | |||
| HAR | 0.10563(665.93) | 0.05684(358.29) | 0.989 | 325.5 | ||||
| HAR-V | 0.10475(660.37) | 0.05563(350.72) | 0.989 | 324.9 | 308.4 | |||
| HAR-G | 0.10194(642.59) | 0.05952(375.18) | 0.990 | 314.0 | ||||
| HAR-G-V | 0.10124(638.23) | 0.05919(373.15) | 0.990 | 332.5 | 306.1 | |||
| India | HAR | 0.01323(1558.30) | 0.00919(1082.32) | 1.000 | 984.0 | |||
| HAR-V | 0.01303(1533.87) | 0.00896(1055.56) | 1.000 | 972.9 | ||||
| HAR-G | 0.01318(1552.68) | 0.00922(1085.87) | 1.000 | 981.3 | 968.7 | |||
| HAR-G-V | 0.01299(1530.08) | 0.00903(1062.89) | 1.000 | 982.2 | 963.4 | |||
| HAR | 0.01186(1396.15) | 0.00783(921.56) | 1.000 | 1019 | ||||
| HAR-V | 0.01181(1390.10) | 0.00776(913.84) | 1.000 | 1017 | 1001 | |||
| HAR-G | 0.01139(1340.87) | 0.00744(875.97) | 1.000 | 1008 | ||||
| HAR-G-V | 0.01125(1324.41) | 0.00731(860.22) | 1.000 | 1002 |
Table 3.
Criteria of HAR() models with/without growth rates and vaccination rates; Italy, Russia, UK and USA.
| RMSE | MAE | AIC | BIC | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Italy | HAR | 0.03619(249.42) | 0.01897(130.77) | 0.999 | 748.3 | ||||
| HAR-V | 0.03608(248.61) | 0.01882(129.71) | 0.999 | 745.6 | 732.5 | ||||
| HAR-G | 0.03565(245.71) | 0.01891(130.32) | 0.999 | 737.1 | |||||
| HAR-G-V | 0.03553(244.85) | 0.01881(129.64) | 0.999 | 747.7 | 727.9 | ||||
| HAR | 0.03613(248.97) | 0.01888(130.15) | 0.999 | 747.0 | |||||
| HAR-V | 0.03603(248.27) | 0.01866(128.57) | 0.999 | 744.2 | 727.7 | ||||
| HAR-G | 0.03541(244.01) | 0.01887(130.06) | 0.999 | 729.3 | |||||
| HAR-G-V | 0.03530(243.28) | 0.01871(128.96) | 0.999 | 746.2 | 719.9 | ||||
| Russia | HAR | 0.01644(105.83) | 0.01193(76.79) | 1.000 | 829.5 | 823.4 | |||
| HAR-V | 0.01628(104.77) | 0.01203(77.43) | 1.000 | 828.6 | 816.5 | ||||
| HAR-G | 0.01585(101.98) | 0.01218(78.38) | 1.000 | 827.0 | |||||
| HAR-G-V | 0.01548(101.35) | 0.01224(78.79) | 1.000 | 816.7 | |||||
| HAR | 0.01629(104.84) | 0.01198(77.07) | 1.000 | 803.5 | 821.3 | ||||
| HAR-V | 0.01615(103.98) | 0.01207(77.68) | 1.000 | 829.0 | 813.8 | ||||
| HAR-G | 0.01534(98.74) | 0.01184(76.21) | 1.000 | ||||||
| HAR-G-V | 0.01530(98.50) | 0.01188(76.49) | 1.000 | 839.8 | 815.4 | ||||
| UK | HAR | 0.03639(490.06) | 0.02126(286.43) | 0.998 | 776.6 | 769.9 | |||
| HAR-V | 0.03526(474.89) | 0.02144(288.81) | 0.999 | 785.5 | 772.2 | ||||
| HAR-G | 0.03612(486.43) | 0.02111(284.39) | 0.998 | 775.6 | 762.3 | ||||
| HAR-G-V | 0.03435(462.64) | 0.02109(284.11) | 0.999 | ||||||
| HAR | 0.03288(442.83) | 0.01987(267.61) | 0.999 | 816.3 | 806.3 | ||||
| HAR-V | 0.03144(423.47) | 0.01976(266.07) | 0.999 | 830.7 | |||||
| HAR-G | 0.03165(426.26) | 0.01932(260.16) | 0.999 | 826.0 | 806.1 | ||||
| HAR-G-V | 0.03081(414.95) | 0.01927(259.54) | 0.999 | 806.5 | |||||
| USA | HAR | 0.03952(1588.63) | 0.02708(1088.51) | 0.998 | 709.9 | ||||
| HAR-V | 0.03914(1573.53) | 0.02707(1088.02) | 0.998 | 709.7 | 696.6 | ||||
| HAR-G | 0.03897(1566.44) | 0.02712(1090.36) | 0.998 | 698.3 | |||||
| HAR-G-V | 0.03862(1552.69) | 0.02700(1085.47) | 0.998 | 710.9 | 691.2 | ||||
| HAR | 0.03912(1572.87) | 0.02690(1081.42) | 0.998 | 711.9 | |||||
| HAR-V | 0.03886(1562.19) | 0.02692(1082.19) | 0.998 | 710.5 | 694.1 | ||||
| HAR-G | 0.03843(1544.88) | 0.02624(1054.95) | 0.998 | 693.2 | |||||
| HAR-G-V | 0.03821(1536.29) | 0.02633(1058.66) | 0.998 | 711.1 | 684.9 |
Table 4.
Comparison with ARIMA models.
| RMSE | MAE | AIC | BIC | |||||
|---|---|---|---|---|---|---|---|---|
| Brazil | ARIMA(1,1,1) | 0.14527(1990.21) | 0.09655(1322.72) | 0.979 | 186.9 | 177.23 | ||
| France | ARIMA(1,1,1) | 0.17485(1872.25) | 0.10083(1079.63) | 0.969 | 125.8 | 115.9 | ||
| Germany | ARIMA(1,2,1) | 0.10476(660.43) | 0.05947(374.92) | 0.989 | 347.0 | 336.9 | ||
| India | ARIMA(1,1,1) | 0.011297(1330.12) | 0.00671(789.64) | 1.000 | 1112.7 | 1103.1 | ||
| Italy | ARIMA(1,2,1) | 0.03474(239.36) | 0.01829(126.06) | 0.999 | 808.6 | 798.6 | ||
| Russia | ARIMA(1,2,1) | 0.01615(103.95) | 0.01252(80.57) | 1.000 | 897.9 | 888.6 | ||
| UK | ARIMA(1,2,0) | 0.02763(370.39) | 0.01674(224.51) | 0.999 | 942.2 | 935.4 | ||
| USA | ARIMA(1,2,1) | 0.03854(1549.26) | 0.02661(1069.80) | 0.998 | 760.9 | 750.9 |
As for RMSE in Tables 2 and 3, HAR-V models have smaller errors than HAR models, and also HAR-G-V models have smaller errors than HAR-G models. It implies that vaccination rates affect significantly the model identification. In Tables 2 and 3, bold numbers are the smallest errors in the given order in each country, whereas mark indicates the best one among the two smallest errors in both orders in each country. All eight countries have the smallest values of RMSE in HAR-G-V model of order 3. Consequently, HAR-G-V model of order is the most suitable model in the sense of RMSE. As for MAE in Tables 2 and 3, Brazil, India and UK have the smallest errors in HAR-G-V model of order 3. In other countries, we see that the models with vaccination rates, HAR-V or HAR-G-V, are better than the models without them, HAR or HAR-G, respectively, with respect to MAE. Table 4 presents empirical results of ARIMA() models for the COVID-19 data. RMSE, MAE, , AIC and BIC by our proposed HAR(-G-V) models are compared with those in Table 4. Orders of ARIMA() models are selected through a unit-root test, i.e., ADF test, as well as ACF and PACF plots, and through consideration of smallest RMSE and MAE. Bold numbers in Table 4 of ARIMA models indicate smaller errors than HAR(-G-V) model errors in each country. Most cases demonstrate that HAR(-G-V) models fit with the COVID-19 data better than the ARIMA models. To see the model-fitting of HAR-G-V models, Fig. 5 depicts the actual daily COVID-19 confirmed cases and HAR-G-V models along with their residuals. For all the eight countries, HAR-G-V models are well-matched with the actual data. It is concluded that HAR-G-V model of order 3 is the best one with reduced errors, compared to other models. Therefore, in the next subsection of prediction, we adopt the HAR-G-V model of order 3 and discuss prediction intervals by approximating the residual distribution.
3.2. Prediction analysis
Now we investigate the forecasting performance by the HAR-G-V model of order 3, applied to the daily COVID-19 confirmed cases. For the first prediction result, one-, two-, three-step ahead forecasts are computed and their forecasting performance accuracy is reported. In order to provide the forecasting analysis, rolling windows are used. Out of the given observations of size , we take two subsamples; in-sample and out-of-sample. To assess the -step ahead forecast, (), let , and let with relabelled observations, (). is the in-sample used to estimate parameters. The -step ahead predicted value is computed from , (), and accuracy measures of RMSE, MAE, MAPE and RRSE are evaluated along with observation in the out-of-sample as follows:
Root Mean Square Error (RMSE):
Mean Absolute Error (MAE):
Mean Absolute Percentage Error (MAPE):
Root Relative Square Error (RRSE):
where .
In Table 5 , RMSE, MAE, MAPE and RRSE for the -step ahead forecasts, (), are evaluated with size . The values in the parentheses on the columns of RMSE and MAE indicate the errors multiplied by the standard deviations of in-sample observations. As expected, it is shown in Table 5 that all the errors are reasonably small, and the farther step ahead, the larger errors.
Table 5.
Forecasting performance of one-, two-, three-step ahead ( forecasts in HAR-G-V model with size 30.
| RMSE | MAE | MAPE | RRSE | ||||
|---|---|---|---|---|---|---|---|
| Brazil | 1 | 0.20620(2824.94) | 0.11714(1604.88) | 4.35666 | 0.45181 | ||
| 2 | 0.31809(4357.75) | 0.22947(3143.68) | 8.57290 | 0.68883 | |||
| 3 | 0.37590(5149.75) | 0.29034(3977.58) | 11.25078 | 0.85273 | |||
| France | 1 | 0.05194(556.22) | 0.04302(460.71) | 2.19886 | 0.20366 | ||
| 2 | 0.07343(786.35) | 0.06043(647.13) | 2.99611 | 0.33470 | |||
| 3 | 0.09759(1045.04) | 0.07635(817.59) | 3.71851 | 0.52219 | |||
| Germany | 1 | 0.02556(161.13) | 0.02103(132.58) | 5.51255 | 0.12122 | ||
| 2 | 0.02581(162.73) | 0.02124(133.93) | 4.82855 | 0.11271 | |||
| 3 | 0.05738(361.72) | 0.04367(275.31) | 11.58729 | 0.23259 | |||
| India | 1 | 0.01886(2220.95) | 0.01117(1315.23) | 3.22558 | 0.71244 | ||
| 2 | 0.03623(4265.88) | 0.02432(2864.33) | 7.00961 | 1.28780 | |||
| 3 | 0.05698(6709.39) | 0.03789(4462.14) | 10.89532 | 1.88765 | |||
| Italy | 1 | 0.01446(99.64) | 0.01134(78.14) | 1.5562 | 0.12081 | ||
| 2 | 0.03627(249.95) | 0.02876(198.19) | 3.92301 | 0.33632 | |||
| 3 | 0.06402(441.16) | 0.05005(344.91) | 6.74034 | 0.65573 | |||
| Russia | 1 | 0.01622(104.42) | 0.01343(86.45) | 0.39060 | 0.11079 | ||
| 2 | 0.02954(190.16) | 0.02418(155.64) | 0.71247 | 0.20343 | |||
| 3 | 0.04626(297.73) | 0.03795(244.27) | 1.13091 | 0.31719 | |||
| UK | 1 | 0.06500(871.42) | 0.04350(583.18) | 1.78073 | 0.16298 | ||
| 2 | 0.15677(2101.66) | 0.10299(1380.77) | 4.32968 | 0.47361 | |||
| 3 | 0.27381(3670.67) | 0.18412(2468.32) | 8.04029 | 1.02424 | |||
| USA | 1 | 0.08251(3316.91) | 0.06854(2755.29) | 3.23027 | 0.10518 | ||
| 2 | 0.11920(4791.86) | 0.09111(3662.80) | 4.45301 | 0.15395 | |||
| 3 | 0.14519(5836.68) | 0.11731(4715.98) | 5.42386 | 0.19048 |
For the second prediction result, we construct % prediction intervals using the residual distributions. In the HAR-G-V model with estimated parameter vector , the residuals are approximated by three approaches. The distributions of the residuals are plotted with histograms in Fig. 6 , where the normal and Laplace distributions by estimated mean and standard deviation of the residuals are shown as well. The % Box–Jenkins prediction interval by the normal approximation is constructed as , with for , respectively, where and are the estimated mean and standard deviation of the residuals. However, as shown in Fig. 6, the normal density is not well fitted with distribution of the residuals. Hence we adopt the Laplace distribution for better fitting of the density. The Laplace distribution, which is well-known as two-sided exponential distribution, has the following probability density function with location parameter and scale parameter : . Its mean and variance are and , respectively. In Fig. 6, red curves indicate Laplace density functions with estimated parameters from the residuals. Using the Laplace distribution, % prediction intervals are constructed as with the estimated mean and scale parameter, and . By rescaling these intervals, the one-step prediction intervals of the daily COVID-19 confirmed cases are constructed.
Fig. 6.
Distribution of residuals by HAR-G-V models.
In Table 6 , for the last days, , empirical coverage probability, average length and mean interval score of 80% prediction intervals are displayed. Moreover, those of 95% prediction intervals are given in Table 7 . The empirical coverage probability is the ratio of observations that lie in the prediction interval. The interval score is a scoring rule for prediction interval, which gives a penalty if observation misses the interval and rewards for a narrow interval, (see Chowell and Luo [32], Gneiting and Raftery [46]). The mean interval score (MIS) is defined by
| (3) |
where and are lower and upper bounds of the % prediction interval for level , respectively, is the indicator function and is observation. If coverage probability is one, then the mean interval score is the same as the average length. Lower coverage probability yields more penalty as much as the last two terms in (3). For the case of Brazil in Tables 6 and 7, the normal and Laplace approximations address small coverage probabilities, not close to, but lower than the nominal coverage, and thus they make much larger (or worse) mean interval scores with high penalty, compared to the average lengths.
Table 6.
Coverage probability (CP), average length (AL) and mean interval score (MIS) of 80% prediction intervals by normal, laplace distributions and Bootstrap procedure in HAR-G-V residuals for the last days.
| Normal |
Laplace |
Bootstrap |
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CP | AL | MIS | CP | AL | MIS | CP | AL | MIS | |||||
| Brazil | 20 | 0.7 | 2447 | 3790 | 0.65 | 2176 | 3924 | 0.85 | 3526 | 3749 | |||
| 50 | 0.6 | 2176 | 8631 | 0.54 | 1935 | 8887 | 0.76 | 3509 | 7780 | ||||
| 100 | 0.53 | 2096 | 9145 | 0.46 | 1863 | 9475 | 0.72 | 3422 | 7926 | ||||
| France | 20 | 1.0 | 2089 | 2089 | 1.0 | 1858 | 1858 | 1.0 | 2636 | 2636 | |||
| 50 | 0.9 | 2250 | 2588 | 0.88 | 2000 | 2472 | 0.9 | 2806 | 2985 | ||||
| 100 | 0.66 | 2231 | 4910 | 0.62 | 1984 | 5105 | 0.81 | 3429 | 4244 | ||||
| Germany | 20 | 1.0 | 1192 | 1192 | 1.0 | 1060 | 1060 | 1.0 | 1218 | 1218 | |||
| 50 | 1.0 | 1234 | 1234 | 0.94 | 1097 | 1126 | 1.0 | 1251 | 1251 | ||||
| 100 | 0.77 | 1203 | 2022 | 0.66 | 1070 | 2073 | 0.84 | 1413 | 1861 | ||||
| India | 20 | 1.0 | 24234 | 24234 | 1.0 | 21547 | 21547 | 0.95 | 2703 | 2706 | |||
| 50 | 1.0 | 26141 | 26141 | 1.0 | 23242 | 23242 | 0.78 | 2792 | 6798 | ||||
| 100 | 1.0 | 31015 | 31015 | 1.0 | 27575 | 27575 | 0.76 | 3063 | 6000 | ||||
| Italy | 20 | 1.0 | 1313 | 1313 | 1.0 | 1167 | 1167 | 1.0 | 359 | 359 | |||
| 50 | 1.0 | 1389 | 1389 | 1.0 | 1235 | 1235 | 0.82 | 368 | 486 | ||||
| 100 | 1.0 | 1361 | 1361 | 1.0 | 1210 | 1210 | 0.82 | 390 | 599 | ||||
| Russia | 20 | 1.0 | 1357 | 1357 | 1.0 | 1206 | 1206 | 0.7 | 253 | 335 | |||
| 50 | 1.0 | 1272 | 1272 | 1.0 | 1131 | 1131 | 0.74 | 251 | 366 | ||||
| 100 | 0.78 | 787 | 1006 | 0.75 | 699 | 948 | 0.6 | 213 | 573 | ||||
| UK | 20 | 1.0 | 2434 | 2434 | 1.0 | 2164 | 2164 | 0.8 | 688 | 979 | |||
| 50 | 0.86 | 2373 | 3235 | 0.82 | 2110 | 3190 | 0.56 | 569 | 4030 | ||||
| 100 | 0.93 | 2413 | 2844 | 0.91 | 2146 | 2686 | 0.62 | 441 | 2495 | ||||
| USA | 20 | 0.75 | 6952 | 9895 | 0.65 | 6181 | 10339 | 0.52 | 3245 | 14417 | |||
| 50 | 0.82 | 7186 | 9892 | 0.78 | 6389 | 9894 | 0.56 | 2871 | 13218 | ||||
| 100 | 0.91 | 7597 | 8950 | 0.89 | 6755 | 8507 | 0.7 | 2663 | 8044 | ||||
Table 7.
Coverage probability (CP), average length (AL) and mean interval score (MIS) of 95% prediction intervals by normal, laplace distributions and Bootstrap procedure in HAR-G-V residuals for the last days.
| Normal |
Laplace |
Bootstrap |
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CP | AL | MIS | CP | AL | MIS | CP | AL | MIS | |||||
| Brazil | 20 | 0.9 | 3747 | 4584 | 0.95 | 4050 | 4465 | 1.0 | 8011 | 8011 | |||
| 50 | 0.74 | 3332 | 21773 | 0.78 | 3602 | 20856 | 0.96 | 8095 | 17968 | ||||
| 100 | 0.69 | 3209 | 208 | 0.74 | 3469 | 22065 | 0.95 | 8137 | 16182 | ||||
| France | 20 | 1.0 | 3200 | 3200 | 1.0 | 3458 | 3458 | 1.0 | 9449 | 9449 | |||
| 50 | 1.0 | 3445 | 3445 | 1.0 | 3723 | 3723 | 1.0 | 034 | 10034 | ||||
| 100 | 0.82 | 3416 | 7901 | 0.87 | 3692 | 7342 | 1.0 | 10361 | 10361 | ||||
| Germany | 20 | 1.0 | 1825 | 1825 | 1.0 | 1973 | 1973 | 1.0 | 3227 | 3227 | |||
| 50 | 1.0 | 1890 | 1890 | 1.0 | 2042 | 2042 | 1.0 | 3439 | 3439 | ||||
| 100 | 0.89 | 1842 | 3190 | 0.9 | 1991 | 3054 | 0.99 | 3563 | 3602 | ||||
| India | 20 | 1.0 | 37109 | 37109 | 1.0 | 40107 | 40107 | 1.0 | 6340 | 6340 | |||
| 50 | 1.0 | 40028 | 40028 | 1.0 | 43261 | 43261 | 0.94 | 5867 | 13861 | ||||
| 100 | 1.0 | 47492 | 47492 | 1.0 | 51328 | 51328 | 0.94 | 5490 | 11140 | ||||
| Italy | 20 | 1.0 | 2011 | 2011 | 1.0 | 2173 | 2173 | 1.0 | 869 | 869 | |||
| 50 | 1.0 | 2127 | 2127 | 1.0 | 2299 | 2299 | 1.0 | 981 | 981 | ||||
| 100 | 1.0 | 2084 | 2084 | 1.0 | 2253 | 2253 | 0.98 | 1093 | 1156 | ||||
| Russia | 20 | 1.0 | 2078 | 2078 | 1.0 | 2254 | 2254 | 0.95 | 437 | 478 | |||
| 50 | 1.0 | 1948 | 1948 | 1.0 | 2105 | 2105 | 0.92 | 433 | 525 | ||||
| 100 | 0.88 | 1205 | 1716 | 0.91 | 1302 | 1771 | 0.80 | 355 | 1154 | ||||
| UK | 20 | 1.0 | 3745 | 3745 | 1.0 | 4047 | 4047 | 0.95 | 1996 | 2007 | |||
| 50 | 0.94 | 3653 | 4344 | 0.96 | 3948 | 4393 | 0.78 | 1703 | 9220 | ||||
| 100 | 0.97 | 3716 | 4061 | 0.98 | 4016 | 4239 | 0.83 | 1366 | 5315 | ||||
| USA | 20 | 0.9 | 10645 | 11671 | 0.95 | 11505 | 11596 | 0.7 | 6284 | 25698 | |||
| 50 | 0.94 | 11004 | 14384 | 0.96 | 11892 | 14556 | 0.72 | 5264 | 24559 | ||||
| 100 | 0.97 | 11634 | 13324 | 0.98 | 12573 | 13905 | 0.85 | 4835 | 14012 | ||||
The Laplace distribution seems to fit the residual distributions better and to improve coverage probabilities of the 95% prediction intervals in Brazil, France, Germany and Russia (Table 7). Also, the Laplace distribution makes the mean interval scores better in most countries for the 80% prediction intervals (Table 6), whereas this is not true for the 95% prediction intervals (Table 7). However, average lengths of prediction intervals by both normal and Laplace distributions are large in some countries and so coverage probability is always equal to one. Too long intervals make always coverage one. Thus we need to improve prediction intervals with proper coverage probability close to the nominal coverage, as well as with shorter length. For this purpose, we consider the bootstrap procedure in the next subsection.
3.3. Bootstrap prediction interval
In order to assess the bootstrap forecast, we first generate a bootstrap sample of residuals. Using the residuals , we define the empirical distribution by
We draw an i.i.d. sequence from the empirical distribution , which is a bootstrap sample to be used in constructing a bootstrap prediction interval. One-step ahead bootstrap forecast is given by
An % bootstrap prediction interval is constructed by the - and -bootstrap quantiles, and , respectively, of the bootstrap sample , where is the number of bootstrap replications. The % bootstrap prediction interval is given by
In Figs. 7 and 8 , the right columns depict the 95% bootstrap prediction intervals for the COVID-19 confirmed cases, where is used to get the bootstrap quantiles, while the left columns illustrate the Box–Jenkins prediction intervals by the normal approximation. Because all the lengths by the Laplace approximation in Table 7 are larger than those by the normal approximation, plots of Laplace prediction intervals are omitted. The interval plots in Figs. 7 and 8 describe the numerical results in Table 7 with 95% one-step prediction intervals for the last 15 days. The left and right columns use the same scales on the vertical axis in each country so that the Box–Jenkins and bootstrap prediction intervals are compared with each other at sight. In most countries the prediction intervals include the actual values of the COVID-19 confirmed cases. On the left columns of Figs. 7 and 8, the prediction intervals by the Box–Jenkins are so large with coverage probability one in some countries. The lengths of the bootstrap prediction intervals on the right columns are not too large or too small because bootstrap distribution mimics the distribution of the residuals.
Fig. 7.
One-day ahead predicted values and 95% prediction intervals by HAR-G-V models: Box–Jenkins prediction intervals (left column) and Bootstrap prediction intervals (right column); Brazil, France, Germany and India.
Fig. 8.
One-day ahead predicted values and 95% prediction intervals by HAR-G-V models: Box–Jenkins prediction intervals (left column) and Bootstrap prediction intervals (right column); Italy, Russia, UK and USA.
To add some discussion of the bootstrap prediction intervals, we go back to Tables 6 and 7, whose last columns report results of the 80%, 95% bootstrap prediction intervals. In Tables 6 and 7 we see that the bootstrap procedure not only improves coverage probability but also makes better mean interval score of prediction intervals. In Brazil, France and Germany, the coverage probabilities are improved to the values close to the norminal coverage by enlarging the average length of the intervals. In India, Italy and Russia, the bootstrap prediction intervals are shortened with the improved coverage probability as well, which are in cases of so large average length by the normal and Laplace distributions. Consequently, the bootstrap technique produces the best mean interval scores: bold numbers in columns of MIS in Tables 6 and 7 indicate the best one among three scores of the normal, Laplace and bootstrap approaches.
Finally, in order to show how much the bootstrap method improves the normal and Laplace approximations, in the countries with the best mean interval scores in all three cases of Table 7, i.e., in Brazil, India, Italy and Russia, the coverage probability and mean interval score of 80%, 95% prediction intervals are depicted in Fig. 9, Fig. 10, Fig. 11, Fig. 12 . Instead of in Tables 6 and 7, are set on the horizontal axis in Fig. 9, Fig. 10, Fig. 11, Fig. 12. Using the last days, the empirical coverage probability and the mean interval score are computed and then the two measures are illustrated on the -axis of the horizon. Red graphs in the figures indicate the results by the bootstrap while green and blue ones are by the normal and Laplace, respectively. Red graphs in the coverage probability (left columns of Fig. 9, Fig. 10, Fig. 11, Fig. 12) are closer to the nominal coverage in both 80% and 95% intervals. In the plots of coverage probability, the grey horizontal lines in the middle lie on the nominal coverage. Coverage probabilities of the 80% bootstrap prediction intervals average to 0.8470, 0.8708, 0.9191 and 0.7717 in Brazil, India, Italy and Russia, respectively, whereas those of the normal and Laplace approximations are all one except for Brazil, which has the averages 0.7073 and 0.6564. Also, the averages of coverage probabilities of the 95% bootstrap prediction intervals are 0.9761, 0.9624, 1.0 and 0.9691 in the four countries, respectively, whereas those of others are all one, but for Brazil with 0.8543 and 8894.
Fig. 9.
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Brazil.
Fig. 10.
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for India.
Fig. 11.
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Italy.
Fig. 12.
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Russia.
As for the mean interval score (right columns of Fig. 9, Fig. 10, Fig. 11, Fig. 12), red graphs of the bootstrap are much smaller values than others except for a few values of Brazil. As shown in the figures, the Laplace approximation does not significantly improve the normal one. However, the bootstrap method considerably lowers and enhances the mean interval scores from both normal and Laplace approximations. To understand the bootstrap improvement with an evaluated measure, we suggest a bootstrap improvement ratio (BIR) of the mean interval score. The BIR is defined as
where is average of the mean interval scores by the underlying approach for :
with in (3), and stands for the bootstrap. The BIR values are displayed in Table 8 , whose results are computed from the mean interval scores (right columns) of Fig. 9, Fig. 10, Fig. 11, Fig. 12. In the case of Brazil, it is reported that the BIR values of the 95% prediction intervals are not as good as those in others because the mean interval scores of the bootstrap are higher (or worse) for , as seen in Fig. 9, than other two approaches. In India and Russia, BIR values are between 0.7 and 0.8, which means that even up to 70–80% of the scores are improved either from the normal to the bootstrap or from the Laplace to the bootstrap. In conclusion, we claim that the bootstrap technique can improve prediction intervals with having more proper empirical coverages (closer to nominal one) as well as shorter average lengths and thus better mean interval scores.
Table 8.
Bootstrap improvement ratio (BIR) of prediction interval (PI) on the mean interval scores in Fig. 9, Fig. 10, Fig. 11, Fig. 12.
| Normal to Bootstrap |
Laplace to Bootstrap |
|||||||
|---|---|---|---|---|---|---|---|---|
| 80% PI | 95% PI | 80% PI | 95% PI | |||||
| Brazil | 0.0246 | 0.0101 | 0.0434 | 0.0172 | ||||
| India | 0.8031 | 0.7132 | 0.7786 | 0.7346 | ||||
| Italy | 0.7103 | 0.5754 | 0.6741 | 0.6071 | ||||
| Russia | 0.7683 | 0.7781 | 0.7394 | 0.7947 | ||||
4. Conclusion and discussion
The COVID-19 pandemic might continue over the next few years in spite of the vaccination supply all around the world. Health scientists and researchers will make more efforts to reverse the virus-contaminated Earth to the uncontaminated one again. For this purpose, more efficient models as well as more accurate predictions are needed to be reflected in social policies such as lockdown and social distancing. In the statistical aspect, rather than point-estimation of predicted values, interval-estimation such as prediction intervals can give more information with high probability. To this end, this study suggests a multiple regression with a form of nonlinear HAR along with growth rates and vaccination rates as its additional regressors, and aims to provide more accurate prediction intervals for the daily COVID-19 confirmed cases.
Model selections of the HAR models with/without the growth rates and vaccination rates are given by the criteria measures such as RMSE, MAE, R, AIC and BIC. Our proposed HAR-G-V model addresses the smallest values of RMSE and MAE in top eight affected countries. We conclude that the HAR-G-V model is a suitable and useful tool for obtaining accurate prediction. For the forecasting, point-estimation of predicted values as well as interval-estimation through three kinds of prediction intervals are extensively provided. RMSE, MAE, MAPE and RRSE are evaluated for the forecast performance accuracy. In order to assess accurate prediction intervals, three approaches are carried out with the residuals of the HAR-G-V model. Normal and Laplace approximations, and bootstrap method are tried to construct prediction intervals. The last one improves prediction intervals with maximum 70–80% of mean interval score, because bootstrap procedure mimics the residual distribution.
This work has the following contributions: (i) A time series model collaborated with both growth rates and vaccination rates is proposed in order to forecast the daily COVID-19 confirmed cases. As a result, the proposed model yields smaller errors in RMSE and MAE. (ii) This work mainly focuses on the prediction intervals to be more informative, by establishing the 80%, 95% prediction intervals through the normal, Laplace and bootstrap approaches. All three prediction intervals produce good empirical coverage probabilities. (iii) The bootstrap technique can improve the mean interval score of prediction intervals. Thus we conclude that the HAR-G-V model combined with the bootstrap method provides accurate prediction intervals of the COVID-19 pandemic.
As a next step of the forecast analysis of the COVID-19, author plans to propose and study a HAR(-G-V) model with partially periodicity as mentioned as in the last section of Hwang and Yu [35]. This current work has handled the data of 7-day smoothed COVID-19 cases as in Ekinci [34]. Empirical results in Ekinci [34] and in this work can provide useful predictive information in deciding the COVID-19 social policies such as social distancing or pandemic relief money. It is because such a social policy is executed for a rather long period, specifically, much longer than 7 days. Due to irregular PCR tests from day to day, for example, rare on Sunday or frequent on Wednesday, the 7-day smoothed data analysis is reasonable and sufficient to understand the overall COVID-19 situation. However, as an interesting and challenging topic, author is interested in modeling the partially periodic COVID-19 cases. As seen in the patterns of COVID-19 data (unsmoothed) of USA, Germany, Brazil, India etc., there is a phenomenon such as strong oscillation with 7-day periodicity, which makes 7-day periodic ACFs. The oscillation seems to appear as the magnitude of the COVID-19 data is larger. In other words, if the magnitude becomes larger, then the oscillation becomes stronger; together with no oscillation on small magnitude. In order to reflect such a feature, partial oscillations not only having some truncation but also proportional to the magnitude might be more desirable than full oscillations. Such a time series modeling for the partially periodic COVID-19 cases will be a promising and challenging study. It will need a mathematical refinement or probability distributional complement (like [18]) as well as an experimental design along with the basis of the statistical inference.
CRediT authorship contribution statement
Eunju Hwang: Conceptualization, Methodology, Software, Data curation, Writing – original draft, Visualization, Investigation, Supervision, Validation, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was supported by National Research Foundation of Korea (NRF-2018R1D1A1B07048745).
References
- 1.Benvenuto D., Giovanetti M., Vassallo L., Angeletti S., Ciccozzi M. Application of the ARIMA model on the COVID-2019 epidemic dataset. Date Brief. 2020;29:105340. doi: 10.1016/j.dib.2020.105340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hernandez-Matamoros A., Fujita F., Hayashi T., Perez-Meana H. Forecasting of COVID19 per regions using ARIMA models and polynomial functions. Appl Soft Comput J. 2020;96:106610. doi: 10.1016/j.asoc.2020.106610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sahai A.K., Rath N., Sood V., Singh M. ARIMA modelling and forecasting of COVID-19 in top five affected countries. Diabetes Metab Syndr. 2020;14(5):1419–1427. doi: 10.1016/j.dsx.2020.07.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ceylan Z. Estimation of COVID-19 prevalence in Italy, Spain, and France. Sci Total Environ. 2020;729:138817. doi: 10.1016/j.scitotenv.2020.138817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kirbas I., Sozen A., Tuncer A.D., Kazancioglu F.S. Comparative analysis and forecasting of COVID-19 cases in various european countries with ARIMA, NARNN and LSTM approaches. Chaos, Solitons Fractals. 2020;138:110015. doi: 10.1016/j.chaos.2020.110015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ribeiro M.H.D.M., Silva R.G.D., Mariani V.C., Coelho L.D.S. Short-term forecasting COVID-19 cumulative confirmed cases: perspectives for Brazil. Chaos, Solitons Fractals. 2020;135:109853. doi: 10.1016/j.chaos.2020.109853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mustafa H.I., Fareed N.Y. 2nd Al-noor international conference for science and technology (NICST) 2020. COVID-19 cases in Iraq; forecasting incidents using Box–Jenkins ARIMA model; pp. 22–26. [Google Scholar]
- 8.Soukhovolsky V., Kovalev A., Pitt A., Kessel B. A new modelling of the COVID 19 pandemic. Chaos, Solitons Fractals. 2020;139:110039. doi: 10.1016/j.chaos.2020.110039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Atangana A., Araz I.S. Modeling and forecasting the spread of COVID-19 with stochastic and deterministic approaches: Africa and Europe. Adv Differ Equ. 2021;57:2021. doi: 10.1186/s13662-021-03213-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Katoch R., Sidhu A. An application of ARIMA model to forecast the dynamics of COVID-19 epidemic in India. Global Bus Rev. 2021 doi: 10.1177/0972150920988653. [DOI] [Google Scholar]
- 11.Chatterjee S., Asad A., Shayak B., Bhattacharya S., Alam S., Verma M.K. Evolution of COVID-19 pandemic: power-law growth and saturation. Int license. 2020 doi: 10.1101/2020.05.05.20091389. [DOI] [Google Scholar]
- 12.Sanchez-Cabalero S., Selles M.A., Peydro M.A., Perez-Bernabeu E. An efficient COVID-19 prediction model validated with the cases of China, Italy and Spain: total or partial lockdowns? J Clin Med. 2020;9:1547. doi: 10.3390/jcm9051547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Reis R.F., Quintela B.D.M., Campos J.D.O., Gomes J.M., Rocha B.M., Lobosco M., et al. Characterization of the COVID-19 pandemic and the impact of uncertainties, mitigation strategies, and underreporting of cases in South Korea, Italy, and Brazil. Chaos, Solitons Fractals. 2020;136:109888. doi: 10.1016/j.chaos.2020.109888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Sarkar K., Khajanchi S., Nieto J.J. Modeling and forecasting the COVID-19 pandemic in India. Chaos, Solitons Fractals. 2020;139:110049. doi: 10.1016/j.chaos.2020.110049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Mohammed A.A., Ewees A.A., Fan H., Abd El Aziz M. Optimization method for forecasting confirmed cases of COVID-19 in China. J Clin Med. 2020;9:674. doi: 10.3390/jcm9030674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Maleki M., Mahmoudi M.R., Heydari M.H. Modeling and forecasting the spread and death rate of coronavirus (COVID-19) in the world using time series models. Chaos, Solitons Fractals. 2020;140:110151. doi: 10.1016/j.chaos.2020.110151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Petropoulos F., Makridakis S., Stylianou N. COVID-19: forecasting confirmed cases and deaths with a simple time series model. Int J Forecast. 2021 doi: 10.1016/j.ijforecast.2020.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Khan N.M., Bakouch H.S., Soobhug A.D., Scotto M.G. Insights on the trend of the novel coronavirus 2019 series in some small island developing states: a thinning-based modelling approarch. Alexandria Eng J. 2021;60:2535–2550. [Google Scholar]
- 19.Pitchaimani M., Brasanna Devi M. Stochastic probical strategies in a delay virus infection model to combat COVID-19. Chaos, Solitons Fractals. 2021;152:111325. doi: 10.1016/j.chaos.2021.111325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mehra A.H.A., Shafieirad M., Abbasi Z., Zamani I. Parameter estimation and predicition of COVID-19 epidemic teruning point and ending time of a case study on SIR/SQAIR epidemic models. Comput Math Methods Med. 2020 doi: 10.1155/2020/1465923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chaurasia V., Pal S. Application of machine learning time series analysis for prediction COVID-19 pandemic. Res Biomed Eng. 2020 doi: 10.1007/s42600-020-00105-4. [DOI] [Google Scholar]
- 22.Salgotra R., Gandomi M., Gandomi A.H. Time series analysis and forecast of the COVID-19 pandemic in India using genetic programming. Chaos, Solitons Fractals. 2020;138:109945. doi: 10.1016/j.chaos.2020.109945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Feroze N. Forecasting the patterns of COVID-19 and causal impacts of lockdown in top five affected countries using Bayesian structural time series models. Chaos, Solitons Fractals. 2020;140:110196. doi: 10.1016/j.chaos.2020.110196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Prasanth S., Singh U., Kumar A., Tikkiwal V.A., Chong P.H.J. Forecasting spread of COVID-19 using google trends: a hybrid GWO-deep learning approach. Chaos, Solitons Fractals. 2021;142:110336. doi: 10.1016/j.chaos.2020.110336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Balli S. Data analysis of COVID-19 pandemic and short-term cumulative case forecasting using machine learning time series methods. Chaos, Solitons Fractals. 2021;142:110512. doi: 10.1016/j.chaos.2020.110512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kalantari M. Forecasting COVID-19 pandemic using optimal singular spectrum analysis. Chaos, Solitons Fractals. 2021;142:110547. doi: 10.1016/j.chaos.2020.110547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Katris C. A time series-based statistical approach for outbreak spread forecating: application of COVID-19 in Greece. Expert Syst Appl. 2021;166:114077. doi: 10.1016/j.eswa.2020.114077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Marmarelis V.Z. Predictive modeling of COVID-19 data in the US: adaptive phase-space approach. J Eng Med Biol. 2020:3008313. doi: 10.1109/OJEMB.2020.3008313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nikolopoulos K., Punia S., Schäfers A., Tsinopoulos C., Vasilakis C. Forecasting and planning during a pandemic: COVID-19 growth rates, supply chain disruptions, and governmental decisions. Eur J Oper Res. 2021;290:99–115. doi: 10.1016/j.ejor.2020.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Liu L., Moon H.R., Schorfheide F. Panel forecasts of country-level COVID-19 infections. J Econom. 2021;220:2–22. doi: 10.1016/j.jeconom.2020.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Luo X., Duan H., Xu K. A novel grey model based on traditional richards model and its application in COVID-19. Chaos, Solitons Fractals. 2021;142:110480. doi: 10.1016/j.chaos.2020.110480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Chowell G., Luo R. Ensenble bootstrap methodology for forecasting dynamic growth processes using differential equations: application to epidemic outbreaks. BMC Med Res Methodol. 2021;21:34. doi: 10.1186/s12874-021-01226-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Bracher J., Wolffram D., Deuschel J., et al. A pre-registered short-term forecasting study of COVID-19 in Germany and Poland during the second wave. Nat Commun. 2021;12:5173. doi: 10.1038/s41467-021-25207-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ekinci A. Modeling and forecasting of growth rate of new COVID-19 cases in top nine affected countries: considering conditional variance and asymmetric effect. Chaos, Solitons Fractals. 2021;151:111227. doi: 10.1016/j.chaos.2021.111227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hwang E., Yu S. Modeling and forecasting the COVID-19 pandemic with heterogeneous autoregression approaches: South Korea. Results Phys. 2021;29:104631. doi: 10.1016/j.rinp.2021.104631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Tregoning J.S., Flight K.E., Higham S.L., et al. Progress of the COVID-19 vaccine effort: viruses, vaccines and variants versus efficacy, effectiveness and escape. Nat Rev Immunol. 2021;21:626–636. doi: 10.1038/s41577-021-00592-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Mathieu E., Ritchie H., Ortiz-Ospina E., et al. A global database of COVID-19 vaccinations. Nat Hum Behav. 2021;5:947–953. doi: 10.1038/s41562-021-01122-8. [DOI] [PubMed] [Google Scholar]
- 38.Shim E. Projecting the impact of SARS-CoV-2 variants and the vaccination program on the fourth wave of the COVID-19 pandemic in South Korea. Int J Environ Res Public Health. 2021;18:7578. doi: 10.3390/ijerph18147578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Cihan P. Forecasting fully vaccinated people against COVID-19 and examining future vaccination rate for herd immunity in the US, Asia, Europe, Africa, South America, and the world. Appl Soft Comput. 2021;111:107708. doi: 10.1016/j.asoc.2021.107708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Higdon M.M., et al. A systematic review of COVID-19 vaccines efficacy and effectiveness against the SARS-CoV-2 infection and disease. medRxiv2021. Preprint. doi: 10.1101/2021.09.17.21263549. [DOI]
- 41.Montgomery D.C., Peck E.A. 2nd ed. Wiley Series in Probability and Mathematical Statistics; 1991. Introduction to linear regression analysis. [Google Scholar]
- 42.Corsi F. A simple approximate long-memory model of realized volatility. J Financ Econom. 2009;7:174–196. [Google Scholar]
- 43.Hong W.T., Lee J., Hwang E. A note on the asymptotic normality theory of the least squares estimates in multivariate HAR-RV models. Mathematics. 2020;8:2083. [Google Scholar]
- 44.Hwang E., Hong W.T. A multivariate HAR-RV model with heteroscedastic errors and its WLS estimation. Econ Lett. 2021;203:109855. [Google Scholar]
- 45.Wilms I., Rombouts J., Croux C. Multivariate volatility forecasts for stock market indices. Int J Forecast. 2021;37:484–499. [Google Scholar]
- 46.Gneiting T., Raftery A.E. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;477:359–378. [Google Scholar]
- 47.Freedman D.A. Bootstrapping regression models. Ann Stat. 1981;9:1218–1228. [Google Scholar]
- 48.Stine R.A. Estimating properties of autoregressive forecasts. J Am Stat Assoc. 1987;82:1072–1078. [Google Scholar]
- 49.Thombs L.A., Schucany W.R. Bootstrap prediction intervals for autoregression. J Am Stat Assoc. 1990;85:486–492. [Google Scholar]
- 50.Hong M., Hwang E. Bootstrap inference for network vector autoregression in large-scale social network. J Korean Stat Soc. 2021;50:1238–1258. doi: 10.1007/s42952-021-00115-7. [DOI] [Google Scholar]
- 51.Linton N.M., Kobayashi T., Yang Y., Hayashi K., Akhmetzhanov A.R., Jung S., et al. Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: astatistical analysis of publicly available case data. J Clin Med. 2020;9:538. doi: 10.3390/jcm9020538. [DOI] [PMC free article] [PubMed] [Google Scholar]












