Skip to main content
PLOS One logoLink to PLOS One
. 2020 Mar 31;15(3):e0230405. doi: 10.1371/journal.pone.0230405

Data-based analysis, modelling and forecasting of the COVID-19 outbreak

Cleo Anastassopoulou 1,*, Lucia Russo 2, Athanasios Tsakris 1, Constantinos Siettos 3,*
Editor: Sreekumar Othumpangat4
PMCID: PMC7108749  PMID: 32231374

Abstract

Since the first suspected case of coronavirus disease-2019 (COVID-19) on December 1st, 2019, in Wuhan, Hubei Province, China, a total of 40,235 confirmed cases and 909 deaths have been reported in China up to February 10, 2020, evoking fear locally and internationally. Here, based on the publicly available epidemiological data for Hubei, China from January 11 to February 10, 2020, we provide estimates of the main epidemiological parameters. In particular, we provide an estimation of the case fatality and case recovery ratios, along with their 90% confidence intervals as the outbreak evolves. On the basis of a Susceptible-Infectious-Recovered-Dead (SIDR) model, we provide estimations of the basic reproduction number (R0), and the per day infection mortality and recovery rates. By calibrating the parameters of the SIRD model to the reported data, we also attempt to forecast the evolution of the outbreak at the epicenter three weeks ahead, i.e. until February 29. As the number of infected individuals, especially of those with asymptomatic or mild courses, is suspected to be much higher than the official numbers, which can be considered only as a subset of the actual numbers of infected and recovered cases in the total population, we have repeated the calculations under a second scenario that considers twenty times the number of confirmed infected cases and forty times the number of recovered, leaving the number of deaths unchanged. Based on the reported data, the expected value of R0 as computed considering the period from the 11th of January until the 18th of January, using the official counts of confirmed cases was found to be ∼4.6, while the one computed under the second scenario was found to be ∼3.2. Thus, based on the SIRD simulations, the estimated average value of R0 was found to be ∼2.6 based on confirmed cases and ∼2 based on the second scenario. Our forecasting flashes a note of caution for the presently unfolding outbreak in China. Based on the official counts for confirmed cases, the simulations suggest that the cumulative number of infected could reach 180,000 (with a lower bound of 45,000) by February 29. Regarding the number of deaths, simulations forecast that on the basis of the up to the 10th of February reported data, the death toll might exceed 2,700 (as a lower bound) by February 29. Our analysis further reveals a significant decline of the case fatality ratio from January 26 to which various factors may have contributed, such as the severe control measures taken in Hubei, China (e.g. quarantine and hospitalization of infected individuals), but mainly because of the fact that the actual cumulative numbers of infected and recovered cases in the population most likely are much higher than the reported ones. Thus, in a scenario where we have taken twenty times the confirmed number of infected and forty times the confirmed number of recovered cases, the case fatality ratio is around ∼0.15% in the total population. Importantly, based on this scenario, simulations suggest a slow down of the outbreak in Hubei at the end of February.

Introduction

An outbreak of “pneumonia of unknown etiology” in Wuhan, Hubei Province, China in early December 2019 has spiraled into an epidemic that is ravaging China and threatening to reach a pandemic state [1]. The causative agent soon proved to be a new betacoronavirus related to the Middle East Respiratory Syndrome virus (MERS-CoV) and the Severe Acute Respiratory Syndrome virus (SARS-CoV). The novel coronavirus SARS-CoV-2 disease has been named “COVID-19” by the World Health Organization (WHO) and on January 30, the COVID-19 outbreak was declared to constitute a Public Health Emergency of International Concern by the WHO Director-General [2]. Despite the lockdown of Wuhan and the suspension of all public transport, flights and trains on January 23, a total of 40,235 confirmed cases, including 6,484 (16.1%) with severe illness, and 909 deaths (2.2%) had been reported in China by the National Health Commission up to February 10, 2020; meanwhile, 319 cases and one death were reported outside of China, in 24 countries [3].

The origin of COVID-19 has not yet been determined although preliminary investigations are suggestive of a zoonotic, possibly of bat, origin [4, 5]. Similarly to SARS-CoV and MERS-CoV, the novel virus is transmitted from person to person principally by respiratory droplets, causing such symptoms as fever, cough, and shortness of breath after a period believed to range from 2 to 14 days following infection, according to the Centers for Disease Control and Prevention (CDC) [1, 6, 7]. Preliminary data suggest that older males with comorbidities may be at higher risk for severe illness from COVID-19 [6, 8, 9]. However, the precise virologic and epidemiologic characteristics, including transmissibility and mortality, of this third zoonotic human coronavirus are still unknown.

Using the serial intervals (SI) of the two other well-known coronavirus diseases, MERS and SARS, as approximations for the true unknown SI, Zhao et al. estimated the mean basic reproduction number (R0) of SARS-CoV-2 to range between 2.24 (95% CI: 1.96-2.55) and 3.58 (95% CI: 2.89-4.39) in the early phase of the outbreak [10]. Very similar estimates, 2.2 (95% CI: 1.4-3.9), were obtained for R0 at the early stages of the epidemic by Imai et al. 2.6 (95% CI: 1.5-3.5) [11], as well as by Li et al., who also reported a doubling in size every 7.4 days [1]. Wu et al. estimated the R0 at 2.68 (95% CI: 2.47–2.86) with a doubling time every 6.4 days (95% CI: 5.8–7.1) and the epidemic growing exponentially in multiple major Chinese cities with a lag time behind the Wuhan outbreak of about 1–2 weeks [12].

Amidst such an important ongoing public health crisis that also has severe economic repercussions, we reverted to mathematical modelling that can shed light to essential epidemiologic parameters that determine the fate of the epidemic [13]. Here, we present the results of the analysis of time series of epidemiological data available in the public domain [1416] (WHO, CDC, ECDC, NHC and DXY) from January 11 to February 10, 2020, and attempt a three-week forecast of the spreading dynamics of the emerged coronavirus epidemic in the epicenter in mainland China.

Methodology

Our analysis was based on the publicly available data of the new confirmed daily cases reported for the Hubei province from the 11th of January until the 10th of February [1416]. Based on the released data, we attempted to estimate the mean values of the main epidemiological parameters, i.e. the basic reproduction number R0, the case fatality (γ^) and case recovery (β^) ratios, along with their 90% confidence intervals. However, as suggested [17], the number of infectious, and consequently the number of recovered, people is likely to be much higher. Thus, in a second scenario, we have also derived results by taking twenty times the number of reported cases for the infectious and forty times the number for the recovered cases, while keeping constant the number of deaths that is more likely to be closer to the real number. Furthermore, by calibrating the parameters of the SIRD model to fit the reported data, we also provide tentative forecasts until the 29th of February.

The basic reproduction number (R0) is one of the key values that can predict whether the infectious disease will spread into a population or die out. R0 represents the average number of secondary cases that result from the introduction of a single infectious case in a totally susceptible population during the infectiousness period. Based on the reported data of confirmed cases, we provide estimations of the R0 from the 16th up to the 20th of January in order to satisfy as much as possible the hypothesis of SN that is a necessary condition for the computation of R0.

We also provide estimations of the case fatality (γ^) and case recovery (β^) ratios over the entire period using a rolling window of one day from the 11th of January to the 16th of January to provide the very first estimations.

Furthermore, we calibrated the parameters of the SIRD model to fit the reported data. We first provide a coarse estimation of the recovery (β) and mortality rates (γ) of the SIRD model using the first period of the outbreak. Then, an estimation of the infection rate α is accomplished by “wrapping” around the SIRD simulator an optimization algorithm to fit the reported data from the 11th of January to the 10th of February. We have started our simulations with one infected person on the 16th of November, which has been suggested as a starting date of the epidemic and run the SIR model until the 10th of February. Below, we describe analytically our approach.

Let us start by denoting with S(t), I(t), R(t), D(t), the number of susceptible, infected, recovered and dead persons respectively at time t in the population of size N. For our analysis, we assume that the total number of the population remains constant. Based on the demographic data for the province of Hubei N = 59m. Thus, the discrete SIRD model reads:

S(t)=S(t1)αNS(t1)I(t1) (1)
I(t)=I(t1)+αNS(t1)I(t1)βI(t1)γI(t1) (2)
R(t)=R(t1)+βI(t1) (3)
D(t)=D(t1)+γI(t1) (4)

The above system is defined in discrete time points t = 1, 2, …, with the corresponding initial condition at the very start of the epidemic: S(0) = N − 1, I(0) = 1, R(0) = D(0) = 0. Here, β and γ denote the “effective/apparent” per day recovery and fatality rates. Note that these parameters do not correspond to the actual per day recovery and mortality rates as the new cases of recovered and deaths come from infected cases several days back in time. However, one can attempt to provide some coarse estimations of the “effective/apparent” values of these epidemiological parameters based on the reported confirmed cases using an assumption and approach described in the next section.

Estimation of the basic reproduction number from the SIRD model

Let us first start with the estimation of R0. Initially, when the spread of the epidemic starts, all the population is considered to be susceptible, i.e. SN. Based on this assumption, by Eqs (2), (3) and (4), the basic reproduction number can be estimated by the parameters of the SIRD model as:

R0=αβ+γ (5)

Let us denote with ΔI(t) = I(t) − I(t − 1), ΔR(t) = R(t) − R(t − 1), ΔD(t) = D(t) − D(t − 1), the reported new cases of infectious, recovered and dead at time t, with CΔI(t), CΔR(t), CΔD(t) the cumulative numbers of confirmed cases at time t. Thus:

CΔX(t)=i=1tΔX(t), (6)

where, X = I, R, D.

Let us also denote by Δ X(t) = [ΔX(1), ΔX(2), ⋯, ΔX(t)]T the t × 1 column vector containing all the reported new cases up to time t and by C Δ X(t) = [CΔX(1), CΔX(2), ⋯, CΔX(t)]T, the t × 1 column vector containing the corresponding cumulative numbers up to time t. On the basis of Eqs (2), (3) and (4), one can provide a coarse estimation of the parameters R0, β and γ as follows.

Starting with the estimation of R0, we note that as the province of Hubei has a population of 59m, one can reasonably assume that for any practical means, at least at the beginning of the outbreak, SN. By making this assumption, one can then provide an approximation of the expected value of R0 using Eqs (5), (2), (3) and (4). In particular, substituting in Eq (2), the terms βI(t − 1) and γI(t − 1) with ΔR(t) = R(t) − R(t − 1) from Eq (3), and ΔD(t) = D(t) − D(t − 1) from Eq (4) and bringing them into the left-hand side of Eq (2), we get:

I(t)I(t1)+R(t)R(t1)+D(t)D(t1)=αNS(t1)I(t1) (7)

Adding Eqs (3) and (4), we get:

R(t)R(t1)+D(t)D(t1)=βI(t1)+γI(t1) (8)

Finally, assuming that for any practical means at the beginning of the spread that S(t − 1) ≈ N and dividing Eq (7) by Eq (8) we get:

I(t)I(t1)+R(t)R(t1)+D(t)D(t1)R(t)R(t1)+D(t)D(t1)=αβ+γ=R0 (9)

Note that one can use directly Eq (9) to compute R0 with regression, without the need to compute first the other parameters, i.e. β, γ and α.

At this point, the regression can be done either by using the differences per se, or by using the corresponding cumulative functions (instead of the differences for the calculation of R0 using Eq (9)). Indeed, it is easy to prove that by summing up both sides of Eqs (7) and (8) over time and then dividing them, we get the following equivalent expression for the calculation of R0.

CΔI(t)+CΔR(t)+CΔD(t)CΔR(t)+CΔD(t)=αβ+γ=R0 (10)

Here, we used Eq (10) to estimate R0 in order to reduce the noise included in the differences. Note that the above expression is a valid approximation only at the beginning of the spread of the disease.

Thus, based on the above, a coarse estimation of R0 and its corresponding confidence intervals can be provided by solving a linear regression problem using least-squares problem as:

R0^=([CΔR(t)+CΔD(t)]T[CΔR(t)+CΔD(t)])1[CΔR(t)+CΔD(t)]T[CΔI(t)+CΔR(t)+CΔD(t)], (11)

Estimation of the case fatality and case recovery ratios for the period January11-February 10

Here, we denote by γ^ the case fatality and by β^ the case recovery ratios. Several approaches have been proposed for the calculation of the case fatality ratio (see for example the formula used by the National Health Commission (NHC) of the People’s Republic of China [18] for estimating the mortality ratio for the COVID-19 and also the discussion in [19]). Here, we adopt the one used also by the NHC which defines the case mortality ratio as the proportion of the total cases of infected cases, that die from the disease.

Thus, a coarse estimation of the case fatality and recovery ratios for the period under study can be calculated using the reported cumulative infected, recovered and dead cases, by solving a linear regression problem, which for the case fatality ratio reads:

γ^=[CΔI(t)TCΔI(t)]1CΔI(t)TCΔD(t), (12)

Accordingly, in an analogy to the above, the case recovery ratio reads:

β^=[CΔI(t)TCΔI(t)]1CΔI(t)TCΔR(t), (13)

As the reported data are just a subset of the actual number of infected and recovered cases including the asymptomatic and/or mild ones, we have repeated the above calculations considering twenty times the reported number of infected and forty times the reported number of recovered in the toal population, while leaving the reported number of dead the same given that their cataloguing is close to the actual number of deaths due to COVID-19.

Estimation of the “effective” SIRD model parameters

Here we note that the new cases of recovered and deaths at each time time t appear with a time delay with respect to the actual number of infected cases. This time delay is generally unknown but an estimate can be given by clinical studies. However, one could also attempt to provide a coarse estimation of these parameters based only on the reported data by considering the first period of the outbreak and in particular the period from the 11th of January to the 16th of January where the number of infected cases appear to be constant. Thus, based on Eqs (3) and (4), and the above assumption, the “effective” per day recovery rate β and the “effective” per day mortality rate γ were computed by solving the least squares problems (see Eqs (2) and (4):

γ=[(CΔI(t1)CΔD(t1)CΔR(t1))T(CΔI(t1)CΔD(t1)CΔR(t1))]1(CΔI(t1)CΔD(t1)CΔR(t1))TΔD(t), (14)

and

β=[(CΔI(t1)CΔD(t1)CΔR(t1))T(CΔI(t1)CΔD(t1)CΔR(t1))]1(CΔI(t1)CΔD(t1)CΔR(t1))TΔR(t), (15)

As noted, these values do not correspond to the actual per day mortality and recovery rates as these would demand the exact knowledge of the corresponding time delays. Having provided an estimation of the above “effective” approximate values of the parameters β and γ, an approximation of the “effective” infected rate α, that is not biased by the assumption of S = N, can be obtained by using the SIRD simulator. In particular, in the SIRD model, the values of the β and γ parameters were set equal to the ones found using the reported data solving the corresponding least squares problems given by Eqs (14) and (15). As initial conditions we have set one infected person on the 16th of November and ran the simulator until the last date for which there are available data (here up to the 10th of February). Then, the optimal value of the infection rate α that fits the reported data was found by “wrapping” around the SIRD simulator an optimization algorithm (such as a nonlinear least-squares solver) to solve the problem:

argminα{t=1M(w1ft(α;β,γ)2+w2gt(α;β,γ)2+w3ht(α;β,γ)2)}, (16)

where

ft(α;β,γ)=CΔISIRD(t)CΔI(t),gt(α;β,γ)=CΔRSIRD(t)CΔR(t),ht(α;β,γ)=CΔDSIRD(t)CΔD(t)

where, CΔXSIRD(t), (X = I, R, D) are the cumulative cases resulting from the SIRD simulator at time t; w1, w2, w3 correspond to scalars serving in the general case as weights to the relevant functions. For the solution of the above optimization problem we used the function “lsqnonlin” of matlab [20] using the Levenberg-Marquard algorithm.

Results

As discussed, we have derived results using two different scenarios (see in Methodology). For each scenario, we first present the results for the basic reproduction number as well as the case fatality and case recovery ratios as obtained by solving the least squares problem using a rolling window of an one-day step. For their computation, we used the first six days i.e. from the 11th up to the 16th of January to provide the very first estimations. We then proceeded with the calculations by adding one day in the rolling window as described in the methodology until the 10th of February. We also report the corresponding 90% confidence intervals instead of the more standard 95% because of the small size of the data. For each window, we also report the corresponding coefficients of determination (R2) representing the proportion of the variance in the dependent variable that is predictable from the independent variables, and the root mean square of error (RMSE). The estimation of R0 was based on the data until January 20, in order to satisfy as much as possible the hypothesis underlying its calculation by Eq (9).

Then, as described above, we provide coarse estimations of the “effective” per day recovery and mortality rates of the SIRD model based on the reported data by solving the corresponding least squares problems. Then, an estimation of the infection rate α was obtained by “wrapping” around the SIRD simulator an optimization algorithm as described in the previous section. Finally, we provide tentative forecasts for the evolution of the outbreak based on both scenarios until the end of February.

Scenario I: Results obtained using the exact numbers of the reported confirmed cases

Fig 1 depicts an estimation of R0 for the period January 16-January 20. Using the first six days from the 11th of January, R0^ results in ∼ 4.80 (90% CI: 3.36-6.67); using the data until January 17, R0^ results in ∼ 4.60 (90% CI: 3.56-5.65); using the data until January 18, R0^ results in ∼ 5.14 (90%CI: 4.25-6.03); using the data until January 19, R0^ results in ∼ 6.09 (90% CI: 5.02-7.16); and using the data until January 20, R0^ results in ∼ 7.09 (90% CI: 5.84-8.35).

Fig 1. Scenario I. Estimated values of the basic reproduction number (R0) as computed by least squares using a rolling window with initial date the 11th of January.

Fig 1

The solid line corresponds to the mean value and dashed lines to lower and upper 90% confidence intervals.

Fig 2 depicts the estimated values of the case fatality (γ^) and case recovery (β^) ratios for the period January 16 to February 10. The confidence intervals are also depicted with dashed lines. Note that the large variation in the estimated values of β^ and γ^ should be attributed to the small size of the data and data uncertainty. This is also reflected in the corresponding confidence intervals. As more data are taken into account, this variation is significantly reduced. Thus, using all the available data from the 11th of January until the 10th of February, the estimated value of the case fatality ratio γ^ is ∼ 2.94% (90% CI: 2.9%-3%) and that of the case recovery ratio β^ is ∼ 0.05 (90% CI: 0.046-0.055). It is interesting to note that as the available data become more, the estimated case recovery ratio increases significantly from the 31th of January (see Fig 2).

Fig 2. Scenario I. Estimated values of the case fatality (γ^) and case recovery ratios (β^) as computed by least squares using a rolling window.

Fig 2

Solid lines correspond to the mean values and dashed lines to lower and upper 90% confidence intervals.

In Figs 3, 4 and 5, we show the coefficients of determination (R2) and the root of mean squared errors (RMSE) for R0^, β^ and γ^, respectively.

Fig 3. Scenario I. Coefficient of determination (R2) and root mean square error (RMSE) resulting from the solution of the linear regression problem with least-squares for the basic reproduction number (R0).

Fig 3

Fig 4. Scenario I. Coefficient of determination (R2) and root mean square error (RMSE) resulting from the solution of the linear regression problem with least-squares for the case recovery ratio (β^).

Fig 4

Fig 5. Scenario I. Coefficient of determination (R2) and root mean square error (RMSE) resulting from the solution of the linear regression problem with least-squares for the case fatality ratio (γ^).

Fig 5

The computed approximate values of the “effective” per day mortality and recovery rates of the SIRD model were γ ∼ 0.01 and β ∼ 0.064 (corresponding to a recovery period of ∼ 15 d). Note that because of the extremely small number of the data used, the confidence intervals have been disregarded. Instead, for our calculations, we have considered intervals of 20% around the expected least squares solutions. Hence, for γ, we have taken the interval (0.008 and 0.012) and for β, we have taken the interval between (0.05 and 0.077) corresponding to recovery periods from 13 to 20 days. As described in the methodology, we have also used the SIRD simulator to provide an estimation of the “effective” infection rate α by optimization with w1 = 1, w2 = 2, w3 = 2. Thus, we performed the simulations by setting β = 0.064 and γ = 0.01, and as initial conditions one infected, zero recovered and zero dead on November 16th 2019, and ran until the 10th of February. The optimal, with respect to the reported confirmed cases from the 11th of January to the 10th of February, value of the infected rate (α) was ∼ 0.191 (90% CI: 0.19-0.192). This corresponds to a mean value of the basic reproduction number R0^2.6. Note that this value is lower compared to the value that was estimated using solely the reported data.

Finally, using the derived values of the parameters α, β, γ, we performed simulations until the end of February. The results of the simulations are given in Figs 6, 7 and 8. Solid lines depict the evolution, when using the expected (mean) estimations and dashed lines illustrate the corresponding lower and upper bounds as computed at the limits of the confidence intervals of the estimated parameters.

Fig 6. Scenario I. Simulations until the 29th of February of the cumulative number of infected as obtained using the SIRD model.

Fig 6

Dots correspond to the number of confirmed cases from the 16th of January to the 10th of February. The initial date of the simulations was the 16th of November with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters α = 0.191, β = 0.064d−1, γ = 0.01; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters.

Fig 7. Scenario I. Simulations until the 29th of February of the cumulative number of recovered as obtained using the SIRD model.

Fig 7

Dots correspond to the number of confirmed cases from the 16th of January to the 10th of February. The initial date of the simulations was the 16th of November with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters α = 0.191, β = 0.064d−1, γ = 0.01; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters.

Fig 8. Scenario I. Simulations until the 29th of February of the cumulative number of deaths as obtained using the SIRD model.

Fig 8

Dots correspond to the number of confirmed cases from 16th of January to the 10th of February. The initial date of the simulations was the 16th of November with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters α = 0.191, β = 0.064d−1, γ = 0.01; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters.

As Figs 6 and 7 suggest, the forecast of the outbreak at the end of February, through the SIRD model is characterized by high uncertainty. In particular, simulations result in an expected number of ∼ 180,000 infected cases but with a high variation: the lower bound is at ∼ 45,000 infected cases while the upper bound is at ∼ 760,000 cases. Similarly for the recovered population, simulations result in an expected number of ∼ 60,000, while the lower and upper bounds are at ∼ 22,000 and ∼ 170,000, respectively. Finally, regarding the deaths, simulations result in an average number of ∼ 9,000, with lower and upper bounds, ∼ 2,700 and ∼ 34,000, respectively.

Thus, the expected trends of the simulations suggest that the mortality rate is lower than the estimated with the current data and thus the death toll is expected to be significantly less compared with the expected trends of the predictions.

As this paper was revised, the reported number of deaths on the 22th February was 2,344, while the expected number of the forecast was ∼4300 with a lower bound of ∼1,300. Regarding the number of infected and recovered cases by February 20, the cumulative numbers of confirmed reported cases were 64,084 infected and 15,299 recovered, while the expected trends of the forecasts were ∼83,000 for the infected and ∼28,000 for the recovered cases. Hence, based on this estimation, the evolution of the epidemic was well within the bounds of our forecasting.

Scenario II. Results obtained based by taking twenty times the number of infected and forty times the number of recovered people with respect to the confirmed cases

For our illustrations, we assumed that the number of infected is twenty times the number of the confirmed infected and forty times the number of the confirmed recovered people. Based on this scenario, Fig 9 depicts an estimation of R0 for the period January 16-January 20. Using the first six days from the 11th of January to the 16th of January, R0^ results in 3.2 (90% CI: 2.4-4.0); using the data until January 17, R0^ results in 3.1 (90% CI: 2.5-3.7); using the data until January 18, R0^ results in 3.4 (90% CI: 2.9-3.9); using the data until January 19, R0^ results in 3.9 (90% CI: 3.3-4.5) and using the data until January 20, R0^ results in 4.5 (90% CI: 3.8-5.3).

Fig 9. Scenario II. Estimated values of the basic reproduction number (R0) as computed by least squares using a rolling window with initial date the 11th of January.

Fig 9

The solid line corresponds to the mean value and dashed lines to lower and upper 90% confidence intervals.

It is interesting to note that the above estimation of R0 is close enough to the one reported in other studies (see in the Introduction for a review).

Fig 10 depicts the estimated values of the case fatality (γ^) and case recovery (β^ ratios for the period January 16 to February 10. The confidence intervals are also depicted with dashed lines. Note that the large variation in the estimated values of β^ and γ^ should be attributed to the small size of the data and data uncertainty. This is also reflected in the corresponding confidence intervals. As more data are taken into account, this variation is significantly reduced. Thus,using all the (scaled) data from the 11th of January until the 10th of February, the estimated value of the case fatality ratio γ^ now drops to ∼ 0.147% (90% CI: 0.144%-0.15%) while that of the case recovery ratio is ∼ 0.1 (90% CI: 0.091-0.11). It is interesting also to note that as the available data become more, the estimated case recovery ratio increases slightly (see Fig 10), while the case fatality ratio (in the total population) seems to be stabilized at a rate of ∼ 0.15%.

Fig 10. Scenario II. Estimated values of case fatality (γ^) and case recovery (β^) ratios, as computed by least squares using a rolling window (see in Methodology).

Fig 10

Solid lines correspond to the mean values and dashed lines to lower and upper 90% confidence intervals.

In Figs 11, 12 and 13, we show the coefficients of determination (R2) and the root of mean squared errors (RMSE), for R0^, β^ and γ^, respectively.

Fig 11. Scenario II. Coefficient of determination (R2) and root mean square error (RMSE) resulting from the solution of the linear regression problem with least-squares for the basic reproduction number (R0).

Fig 11

Fig 12. Scenario II. Coefficient of determination (R2) and root mean square error (RMSE) resulting from the solution of the linear regression problem with least-squares for the recovery rate (β^).

Fig 12

Fig 13. Scenario II. Coefficient of determination (R2) and root mean square error (RMSE) resulting from the solution of the linear regression problem with least-squares for the mortality rate (γ^).

Fig 13

The computed values of the “effective” per day mortality and recovery rates of the SIRD model were γ ∼ 0.0005 and β ∼0.16d−1 (corresponding to a recovery period of ∼ 6 d). Note that because of the extremely small number of the data used, the confidence intervals have been disregarded. Instead, for calculating the corresponding lower and upper bounds in our simulations, we have taken intervals of 20% around the expected least squares solutions. Hence, for γ we have taken the interval (0.0004 and 0.0006) and for β, we have taken the interval between (0.13 and 0.19) corresponding to an interval of recovery periods from 5 to 8 days.

Again, we used the SIRD simulator to provide estimation of the infection rate by optimization setting w1 = 1, w2 = 400, w3 = 1 to balance the residuals of deaths with the scaled numbers of the infected and recovered cases. Thus, to find the optimal infection transmission rate, we used the SIRD simulations with β = 0.16d−1, and γ = 0.0005 and as initial conditions one infected, zero recovered, zero deaths on November 16th 2019, and ran until the 10th of February.

The optimal, with respect to the reported confirmed cases from the 11th of January to the 10th of February value of the infected rate (α) was found to be ∼ 0.319(90% CI: 0.318-0.32). This corresponds to a mean value of the basic reproduction number R0^2.

Finally, using the derived values of the parameters α, β, γ, we have run the SIRD simulator until the end of February. The simulation results are given in Figs 14, 15 and 16. Solid lines depict the evolution, when using the expected (mean) estimations and dashed lines illustrate the corresponding lower and upper bounds as computed at the limits of the confidence intervals of the estimated parameters.

Fig 14. Scenario II. Simulations until the 29th of February of the cumulative number of infected as obtained using the SIRD model.

Fig 14

Dots correspond to the number of confirmed cases from 16th of Jan to the 10th of February. The initial date of the simulations was the 16th of November with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters α = 0.319, β = 0.16d−1, γ = 0.0005; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters.

Fig 15. Scenario II. Simulations until the 29th of February of the cumulative number of recovered as obtained using the SIRD model.

Fig 15

Dots correspond to the number of confirmed cases from 16th of January to the 10th of February. The initial date of the simulations was the 16th of November, with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters α = 0.319, β = 0.16d−1, γ = 0.0005; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters.

Fig 16. Scenario II. Simulations until the 29th of February of the cumulative number of deaths as obtained using the SIRD model.

Fig 16

Dots correspond to the number of confirmed cases from the 16th of November to the 10th of February. The initial date of the simulations was the 16th of November with zero infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters α = 0.319, β = 0.16d−1, γ = 0.0005; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters.

Again as Figs 15 and 16 suggest, the forecast of the outbreak at the end of February, through the SIRD model is characterized by high uncertainty. In particular, in Scenario II, by February 29, simulations result in an expected actual number of ∼8m infected cases (corresponding to a ∼13% of the total population) with a lower bound at ∼720,000 and an upper bound at ∼37m cases. Similarly, for the recovered population, simulations result in an expected actual number of ∼4.5m (corresponding to a 8% of the total population), while the lower and upper bounds are at ∼430,000 and ∼23m, respectively. Finally, regarding the deaths, simulations under this scenario result in an average number of ∼14,000, with lower and upper bounds at ∼900 and ∼100,000.

Importantly, under this scenario, the simulations shown in Fig 14 suggest a decline of the outbreak at the end of February. Table 1 summarizes the above results for both scenarios.

Table 1. Model parameters, their computed values and forecasts for the Hubei province under two scenarios: (I) using the exact values of confirmed cases or (II) using estimations for infected and recovered (twenty and forty times the number of confirmed cases, respectively).

Estimations Symbol Parameter Computed values 90% CI
Scenario I: Exact numbers for confirmed cases
Based on linear regression of the data R0 Basic reproduction number
11-16 Jan 4.80 3.36-6.67
11-17 Jan 4.60 3.56-5.65
11-18 Jan 5.14 4.25-6.03
β^ case recovery ratio 0.05 0.046-0.055
γ^ case fatality ratio 2.94% 2.9%-3%
Based on the SIRD simulator (Nov 16-Feb 10) R0 Basic reproduction number 2.6 -
α infection rate 0.191 0.19-0.192
Forecast to Feb 29 (Cumulative) infected 180,000 45,000-760,000
recovered 60,000 22,000-170,000
deaths 9,000 2,700-34,000
Scenario II: x20 Infected, x40 recovered of confirmed cases
Based on linear regression of the data R0 Basic reproduction number
11-16 Jan 3.2 2.4-4.0
11-17 Jan 3.1 2.5-3.7
11-18 Jan 3.4 2.9-3.9
β^ case recovery ratio 0.1 0.091-0.11
γ^ case fatality ratio 0.147% 0.144%-0.15%
Based on the SIRD simulator (Nov 16-Feb 10) R0 Basic reproduction number 2 -
α infection rate 0.319 0.318-0.32
Forecast to Feb 29 (Cumulative) infected 8m 720,000-37m
recovered 4.5m 430,000-23m
deaths 14,000 900-100,000

We note that the results derived under Scenario II seem to predict a slowdown of the outbreak in Hubei after the end of February.

Discussion

We have proposed a methodology for the estimation of the key epidemiological parameters as well as the modelling and forecasting of the spread of the COVID-19 epidemic in Hubei, China by considering publicly available data from the 11th of January 2020 to the 10th of February 2020.

By the time of the acceptance of our paper, according to the official data released on the 29th of February, the cumulative number of confirmed infected cases in Hubei was ∼67,000, that of recovered was ∼31,300 and the death toll was ∼2,800. These numbers are within the lower bounds and expected trends of our forecasts from the 10th of February that are based on Scenario I. Importantly, by assuming a 20-fold scaling of the confirmed cumulative number of the infected cases and a 40-fold scaling of the confirmed number of the recovered cases in the total population, forecasts show a decline of the outbreak in Hubei at the end of February. Based on this scenario the case fatality rate in the total population is of the order of ∼0.15%.

At this point we should note that our SIRD modelling approach did not take into account many factors that play an important role in the dynamics of the disease such as the effect of the incubation period in the transmission dynamics, the heterogeneous contact transmission network, the effect of the measures already taken to combat the epidemic, the characteristics of the population (e.g. the effect of the age, people who had already health problems). Also the estimation of the model parameters is based on an assumption, considering just the first period in which the first cases were confirmed and reported. Of note, COVID-19, which is thought to be principally transmitted from person to person by respiratory droplets and fomites without excluding the possibility of the fecal-oral route [21] had been spreading for at least over a month and a half before the imposed lockdown and quarantine of Wuhan on January 23, having thus infected unknown numbers of people. The number of asymptomatic and mild cases with subclinical manifestations that probably did not present to hospitals for treatment may be substantial; these cases, which possibly represent the bulk of the COVID-19 infections, remain unrecognized, especially during the influenza season [22]. This highly likely gross under-detection and underreporting of mild or asymptomatic cases inevitably throws severe disease courses calculations and death rates out of context, distorting epidemiologic reality.

Another important factor that should be taken into consideration pertains to the diagnostic criteria used to determine infection status and confirm cases. A positive PCR test was required to be considered a confirmed case by China’s Novel Coronavirus Pneumonia Diagnosis and Treatment program in the early phase of the outbreak [14]. However, the sensitivity of nucleic acid testing for this novel viral pathogen may only be 30-50%, thereby often resulting in false negatives, particularly early in the course of illness. To complicate matters further, the guidance changed in the recently-released fourth edition of the program on February 6 to allow for diagnosis based on clinical presentation, but only in Hubei province [14].

The swiftly growing epidemic seems to be overwhelming even for the highly efficient Chinese logistics that did manage to build two new hospitals in record time to treat infected patients. Supportive care with extracorporeal membrane oxygenation (ECMO) in intensive care units (ICUs) is critical for severe respiratory disease. Large-scale capacities for such level of medical care in Hubei province, or elsewhere in the world for that matter, amidst this public health emergency may prove particularly challenging. We hope that the results of our analysis contribute to the elucidation of critical aspects of this outbreak so as to contain the novel coronavirus as soon as possible and mitigate its effects regionally, in mainland China, and internationally.

Conclusion

In the digital and globalized world of today, new data and information on the novel coronavirus and the evolution of the outbreak become available at an unprecedented pace. Still, crucial questions remain unanswered and accurate answers for predicting the dynamics of the outbreak simply cannot be obtained at this stage. We emphatically underline the uncertainty of available official data, particularly pertaining to the true baseline number of infected (cases), that may lead to ambiguous results and inaccurate forecasts by orders of magnitude, as also pointed out by other investigators [1, 17, 22].

Supporting information

S1 Table. Reported cumulative numbers of cases for the Hubei region, China for the period January 11-February 10.

(PDF)

Data Availability

The data used in this paper were acquired from https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6. In S1 Table we provide the data that we have used for this study, i.e. the cumulative confirmed cases of infected recovered and deaths from January 11 to February 10.

Funding Statement

The authors received no specific funding for this work.

References

Decision Letter 0

Artur Arikainen

20 Feb 2020

PONE-D-20-04084

Data-Based Analysis, Modelling and Forecasting of the novel Coronavirus (2019-nCoV) outbreak

PLOS ONE

Dear Professor Siettos,

Thank you again for submitting your manuscript to PLOS ONE!

Your manuscript has been reviewed by 2 reviewers, and we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please find the reviews copied below.

We would appreciate receiving your revised manuscript by Apr 05 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

*** Also, as part of an initiative between PLOS (and other publishers) and the World Health Organisation to ensure that all relevant clinical information about this outbreak is shared quickly, we would like to ask your permission to notify the WHO directly that about your study/manuscript. If you have an existing preprint of your manuscript posted, we can simply notify the WHO of the preprint identifier. Alternatively, we can send them a copy of your manuscript file. More information on this initiative can be found here: https://wellcome.ac.uk/press-release/sharing-research-data-and-findings-relevant-novel-coronavirus-covid-19-outbreak. ***

We look forward to receiving your revised manuscript.

Kind regards,

Artur Arikainen

Associate Editor

PLOS ONE

-------------

-If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

-To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

-Please note while forming your response to reviewers that, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Journal Requirements:

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I think that this topic is very current and important and that these estimates can contribute to the effort to face the current epidemic in China and the potential pandemic of this new coronavirus. I suggest adapting the title to the new nomenclature of the virus (SARS-CoV-2) or that of the disease (COVID-19).

Reviewer #2: 1. On p. 2, in the Introduction, the new official name of “COVID-19” has been approved by the WHO for this virus since this article was submitted. In the second paragraph of Methodology, R0 should be R_0

2. p. 3, in equation (2), it is unclear why the second term only has $\\alpha S(t-1)I(t-1)$, but does not have N in the denominator, similarly to an identical term in equation (1). Of course, it is very likely just a typo, since an expression for R_0 in equation (5) is correct.

3. When introducing the system (1)-(4), it is worth mentioning that it is defined over an interval $t=1,2,\\ldots$, with the corresponding initial condition $S(0)=N$, $I(0)=1$, $R(0)=D(0)=0$.

4. p. 3, in the first line of subsection 2.1 I suggest to explicitly write \\Delta I(t)=I(t)-I(t-1) etc to make it clear from the start, over which time interval the changes are measured (this is now stated later on the same page).

5. p. 3, in the right-hand side of equation (7), there is an extra multiplier $(t-1)$, and in the same right-hand side, it is worth to explicitly write $N$ in the denominator. Then, after equation (8), it is worth writing $S(t-1)\\approx N$, which would then give expression (9).

6. Bearing in mind that according to (9), $R0$ is related to changes in the infected/recovered/dead individuals over a single time interval, it is not clear how this then translates into expression (10), which contains cumulative time changes over the course of an epidemic. The authors should provide an explanation of how this expression was produced, and also mention in the text what is denoted by a prime in this expression. A similar question applies to expression (11), for which an explanation should also be provided.

7. p. 4, some brief explanation should be provided to explain why the weights in (12) are also square if one is minimising a sum of squares that are already non-negative.

8. p. 4, it is worth mentioning why for statistical analysis a 90% confidence interval was chosen instead of a more standard 95%.

9. In Scenario I, it is not clear why the value of R0 was only estimated using the time interval from 16 January to 20 January, while the recovery and mortality rates were obtained over a much longer interval up to 10 February. Similarly, with very high variability in the values of $\\widehat{\\beta}$ and $\\widehat{\\gamma}$ as shown in Figure 2, some comment should be provided as to how the specific values of $\\beta$ and $\\gamma$ were chosen on p. 2 to obtain an estimate of the mean reproduction number. Also, the authors should comment on how this value corresponds with the observed values as depicted in Fig. 1, or its continuation over a longer time period.

The values of $\\widehat{\\beta}$ and $\\widehat{\\gamma}$ shown in Fig. 2 appear to vary by between 4 and almost 8 times between their minima and maxima. The authors should comment on what such large variation could be attributed to, and what it means for being able to discern their `actual’ values for the purposes of computing the basic reproduction number.

All of these questions also apply to Scenario II.

10. When revising this paper, it is worth having a look at https://www.worldometers.info/coronavirus/

to see how the observed numbers of new cases identified in Hubei province since this paper was submitted, fit with model predictions. This should give another week worth of data points, and the authors could briefly comment on which of their scenarios appears to be closer to the current observed state of an epidemic in Hubei province.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Andre Ricardo Ribas Freitas

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Mar 31;15(3):e0230405. doi: 10.1371/journal.pone.0230405.r002

Author response to Decision Letter 0


23 Feb 2020

Response Letter

We would like to thank both reviewers for the time and effort they put in reviewing our manuscript. We appreciate their rapid responses and their positive and constructive comments.

We would also like to thank the handling Editor for all his efforts to oversight the review process.

Below, we tried to respond point-by-point to the comments of the reviewers and revised our manuscript accordingly. We believe that the manuscript is now ready for publication.

Reviewer #1: I think that this topic is very current and important and that these estimates can contribute to the effort to face the current epidemic in China and the potential pandemic of this new coronavirus. I suggest adapting the title to the new nomenclature of the virus (SARS-CoV-2) or that of the disease (COVID-19).

Response

We thank the reviewer for his/her positive evaluation of our work. We have now updated the title and the text with the new nomenclature of COVID-19.

Reviewer #2:

1. On p. 2, in the Introduction, the new official name of “COVID-19” has been approved by the WHO for this virus since this article was submitted. In the second paragraph of Methodology, R0 should be R_0

Response

That what also the comment of the first reviewer. We have updated the title and the text with the new name of the novel coronavirus and the disease. We have also corrected the typo.

2. p. 3, in equation (2), it is unclear why the second term only has $\\alpha S(t-1)I(t-1)$, but does not have N in the denominator, similarly to an identical term in equation (1). Of course, it is very likely just a typo, since an expression for R_0 in equation (5) is correct.

Response

Thank you. Indeed, it was just a typo, which has been corrected in the revised manuscript.

3. When introducing the system (1)-(4), it is worth mentioning that it is defined over an interval $t=1,2,\\ldots$, with the corresponding initial condition $S(0)=N$, $I(0)=1$, $R(0)=D(0)=0$.

Response

We have now included the clarification.

4. p. 3, in the first line of subsection 2.1 I suggest to explicitly write \\Delta I(t)=I(t)-I(t-1) etc to make it clear from the start, over which time interval the changes are measured (this is now stated later on the same page).

Response

We have updated the corresponding equalities, as suggested.

5. p. 3, in the right-hand side of equation (7), there is an extra multiplier $(t-1)$, and in the same right-hand side, it is worth to explicitly write $N$ in the denominator. Then, after equation (8), it is worth writing $S(t-1)\\approx N$, which would then give expression (9).

Response

There was a typo there, which we have corrected. Thank you for noticing. We have also written $S(t-1)\\approx N$.

6. Bearing in mind that according to (9), $R0$ is related to changes in the infected/recovered/dead individuals over a single time interval, it is not clear how this then translates into expression (10), which contains cumulative time changes over the course of an epidemic. The authors should provide an explanation of how this expression was produced, and also mention in the text what is denoted by a prime in this expression. A similar question applies to expression (11), for which an explanation should also be provided.

Response

The reviewer is correct to raise this point. One has two options here: either to do the regression using the differences as they appear in (9), or using their cumulative sums. If one sums up both sides of (7) and (8) over time, then one gets their cumulative sums instead of the differences. However, this way one reduces the noise in the regression.

Below Eq. 9, we have now added the following paragraphs and a new Equation (Eq. 10) to explain this:

“At this point, the regression can be done either by using the differences per se, or by using the corresponding cumulative functions (instead of the differences for the calculation of $R_0$ using Eq.(\\ref{eq9})). Indeed, it is easy to prove that by summing up both sides of Eq.(\\ref{eq7}) and Eq.(\\ref{eq8}) over time and then dividing them we get the following equivalent expression for the calculation of $R_0$.”

Here, we have least squares using Eq. (\\ref{eq9b}) to estimate $R_0$ in order to reduce the noise included in the differences. Note that the above expression is a valid approximation only at the beginning of the spread of the disease.

Thus, based on the above, a coarse estimation of $R_0$ and its corresponding confidence intervals can be provided by solving a linear regression problem using least-squares problem as:

Eq. 10.

We also explain that in the above the prime $'$ is for the transpose operation.

7. p. 4, some brief explanation should be provided to explain why the weights in (12) are also square if one is minimising a sum of squares that are already non-negative.

Response

Indeed, the weights should be out of the parenthesis.

8. p. 4, it is worth mentioning why for statistical analysis a 90% confidence interval was chosen instead of a more standard 95%.

Response

We now mention explicitly why the 90% CI was used. In the Results section, we have now added the following sentence:

“We also report the corresponding 90% confidence intervals instead of the more standard 95% because of the small size of the data”

9. In Scenario I, it is not clear why the value of R0 was only estimated using the time interval from 16 January to 20 January, while the recovery and mortality rates were obtained over a much longer interval up to 10 February.

Response

We have taken just the interval from 16th to 20th of January to compute R0 in order to be as close as possible to the hypothesis of S~N. As the epidemic evolved with more cases, this hypothesis is violated.

On the other hand, the computations of the recovery and mortality rates are getting more robust as more data are introduced. In the Results section, we have now added the following sentence to better explain this point.

“The estimation of $R_0$ was based on the data until January 20 in order to satisfy as much as possible the hypothesis underlying its calculation by Eq.(\\ref{eq9})”

10. Similarly, with very high variability in the values of $\\widehat{\\beta}$ and $\\widehat{\\gamma}$ as shown in Figure 2, some comment should be provided as to how the specific values of $\\beta$ and $\\gamma$ were chosen on p. 2 to obtain an estimate of the mean reproduction number. Also, the authors should comment on how this value corresponds with the observed values as depicted in Fig. 1, or its continuation over a longer time period.

Response

Exactly, due to this variability, we don’t compute R0 simply by taking the fraction of the computed by regression values of \\beta \\ gamma and \\alpha. Actually, we don’t compute \\alpha by regression. We compute R0 explicitly through Eq. 9 (actually through the corresponding cumulative functions).

We have now added the following sentence below Eq.9 to make this clear:

“Note that one can use directly Eq.(\\ref{eq9}) to compute $R_0$ with regression, without the need to compute first the other parameters, i.e. $\\beta$, $\\gamma$ and $\\alpha$.”

11. The values of $\\widehat{\\beta}$ and $\\widehat{\\gamma}$ shown in Fig. 2 appear to vary by between 4 and almost 8 times between their minima and maxima. The authors should comment on what such large variation could be attributed to, and what it means for being able to discern their `actual’ values for the purposes of computing the basic reproduction number.

Response

We have now added the following sentence to comment on this in the revised manuscript.

“Note that the large variation in the estimated values of $\\beta$ and $\\gamma$ may be accounted to the small size of the data and data uncertainty. This is also reflected in the corresponding confidence intervals. As more data are taken into account, this variation is significantly reduced.”

12. All of these questions also apply to Scenario II.

Response

We have now updated the text accordingly also for the second Scenario.

13. When revising this paper, it is worth having a look at

https://www.worldometers.info/coronavirus/

to see how the observed numbers of new cases identified in Hubei province since this paper was submitted, fit with model predictions. This should give another week worth of data points, and the authors could briefly comment on which of their scenarios appears to be closer to the current observed state of an epidemic in Hubei province.

Response

We thank the reviewer for this comment. We have re-done the computations considering that the actual number of infected in the population is 20 times the reported confirmed cases of infected and 40 times the number of confirmed cases of recovered, while keeping the number of deaths unchanged.

It seems that the forecasts from the simulations based on this Scenario are closer to the scaled current observed data. Thus, we have revised our manuscript as follows:

A. At the end of the Methodology section (before the beginning of subsection 2.2) we have added the following paragraph:

“As the reported data are just a sample of the actual number of infected and recovered cases including the asymptomatic and/or mild ones, we have repeated the above calculations considering twenty times the reported number of infected and forty times the reported number of recovered in the population, while leaving the reported number of deaths the same, given that their cataloguing is close to the actual number of deaths due to COVID-19.”

B. At the end the subsection with the results of Scenario I we have added the following paragraph:

“Furthermore, simulations reveal that the confirmed cumulative number of deaths is significantly smaller than the lower bound of the simulations. This suggests that the mortality rate is considerably lower than the estimated one based on the officially reported data. Thus, it is expected that the actual numbers of the infected, and consequently of the recovered ones too, are considerably larger than reported. Hence, we assessed the dynamics of the epidemic considering a different scenario that we present in the following subsection.”

C. Before the end of the subsection with the results of Scenario II, we have added the following paragraph:

“We note, that the results derived under Scenario II seem to better reflect the actual situation as the reported number of deaths is within the average and lower limits of the SIRD simulations. In particular, as this paper was revised, the reported number of deaths on the 22th February was ~2,346, while the lower bound of the forecast is ~2,900. This indicates an even lower mortality rate than that of ~0.147%, and thus an even larger actual number of infected (and recovered) cases in the population. Regarding the number of infected and recovered cases by February 20, the cumulative numbers of confirmed reported cases were 64,084 infected and 15,299 recovered. Thus, the corresponding scaled numbers are 1,281,680 infected and 611,960 recovered. Based on Scenario II, for the 22th of February, our simulations give a total number of ~758,000 infected with ~1.8m as an upper bound, and a total of ~520,000 recovered with a total of 1.1m as an upper bound.

Hence, based on this estimation of the actual numbers, the evolution of the epidemic is within the bounds of our forecasting.”

Attachment

Submitted filename: Response Letter.pdf

Decision Letter 1

Sreekumar Othumpangat

2 Mar 2020

Data-Based Analysis, Modelling and Forecasting of the COVID-19 outbreak

PONE-D-20-04084R1

Dear Dr. Siettos,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Sreekumar Othumpangat, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

This manuscript is well organized and have improved after the incorporation of the reviewers suggestions. This manuscript has utmost importance due to the rapid increase in COVID-19 cases through out the world with the mortality rate reaching 2 %, which is 20 times more than influenza based cases.

Reviewers' comments:

Acceptance letter

Sreekumar Othumpangat

17 Mar 2020

PONE-D-20-04084R1

Data-Based Analysis, Modelling and Forecasting of the COVID-19 outbreak

Dear Dr. Siettos:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sreekumar Othumpangat

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Reported cumulative numbers of cases for the Hubei region, China for the period January 11-February 10.

    (PDF)

    Attachment

    Submitted filename: Response Letter.pdf

    Data Availability Statement

    The data used in this paper were acquired from https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6. In S1 Table we provide the data that we have used for this study, i.e. the cumulative confirmed cases of infected recovered and deaths from January 11 to February 10.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES