Skip to main content
Infectious Disease Modelling logoLink to Infectious Disease Modelling
. 2022 Dec 16;8(1):72–83. doi: 10.1016/j.idm.2022.12.002

Revisiting classical SIR modelling in light of the COVID-19 pandemic

Leonid Kalachev a,b, Erin L Landguth b,, Jon Graham a,b
PMCID: PMC9755423  PMID: 36540893

Abstract

Background

Classical infectious disease models during epidemics have widespread usage, from predicting the probability of new infections to developing vaccination plans for informing policy decisions and public health responses. However, it is important to correctly classify reported data and understand how this impacts estimation of model parameters. The COVID-19 pandemic has provided an abundant amount of data that allow for thorough testing of disease modelling assumptions, as well as how we think about classical infectious disease modelling paradigms.

Objective

We aim to assess the appropriateness of model parameter estimates and prediction results in classical infectious disease compartmental modelling frameworks given available data types (infected, active, quarantined, and recovered cases) for situations where just one data type is available to fit the model. Our main focus is on how model prediction results are dependent on data being assigned to the right model compartment.

Methods

We first use simulated data to explore parameter reliability and prediction capability with three formulations of the classical Susceptible-Infected-Removed (SIR) modelling framework. We then explore two applications with reported data to assess which data and models are sufficient for reliable model parameter estimation and prediction accuracy: a classical influenza outbreak in a boarding school in England and COVID-19 data from the fall of 2020 in Missoula County, Montana, USA.

Results

We demonstrated the magnitude of parameter estimation errors and subsequent prediction errors resulting from data misclassification to model compartments with simulated data. We showed that prediction accuracy in each formulation of the classical disease modelling framework was largely determined by correct data classification versus misclassification. Using a classical example of influenza epidemics in an England boarding school, we argue that the Susceptible-Infected-Quarantined-Recovered (SIQR) model is more appropriate than the commonly employed SIR model given the data collected (number of active cases). Similarly, we show in the COVID-19 disease model example that reported active cases could be used inappropriately in the SIR modelling framework if treated as infected.

Conclusions

We demonstrate the role of misclassification of disease data and thus the importance of correctly classifying reported data to the proper compartment using both simulated and real data. For both a classical influenza data set and a COVID-19 case data set, we demonstrate the implications of using the “right” data in the “wrong” model. The importance of correctly classifying reported data will have downstream impacts on predictions of number of infections, as well as minimal vaccination requirements.

Keywords: Basic disease reproduction number, Communicable disease control, Coronavirus, COVID-19, Disease transmission, Epidemics, Epidemiology, Influenza data, Mathematical models, Montana, SIR models

Highlights

  • The magnitude of parameter estimation and prediction errors from data misclassification in SIR compartments was determined.

  • Recovered, quarantined or active cases may produce the best parameter estimates and predictions when used in a SIQR model.

  • Under robust testing, tracing and case isolation practices, active cases should be classified as isolated cases.

1. Introduction

Modelling the spread of epidemics and the launch of our classical understanding of infectious disease modelling began in the early 20th century with the development of the Susceptible-Infected-Recovered (SIR) model (Kermack & McKendrick, 1927). This model, and expansions thereof, have widespread usage during disease outbreaks that include predicting number of infections, identifying the basic disease reproduction number, and determining the percentage of the population needed for vaccination to achieve herd immunity (Bjørnstad et al., 2020). Historically, infectious disease modelling has relied on estimation of parameters through SIR models fitted to opportunistically collected data. In the majority of disease outbreaks, the number of ‘officially’ reported infections can be grossly underestimated (Gibbons et al., 2014). For example, only the most severe cases of influenza that visited hospitals and then led to testing are reported (Reed et al., 2015). The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic causing the COVID-19 disease has changed drastically the way disease data are collected; while still not solving the problem of underreporting (Wu et al., 2020), more types of comprehensive data are now available (e.g., number of cases for new infections, active cases, recovered, vaccinated, etc.). This development provides the opportunity to better identify data misclassification in the classical SIR modelling frameworks. The extent to which misclassification affects the parameter estimates and existing models' predictions is still unclear.

Traditionally, reported new or active cases on a certain date are compartmentalized to the Infected (I) modelling population. One can assume that the appearance of any reported new cases leads to a corresponding decrease in the Susceptible population (S) or equivalently that the cumulative sum of reported new cases (defined herein as total cases) should equal the total population (N) minus the Susceptible population (S) (Fig. 1). During most epidemics (and especially during the SARS-CoV-2 pandemic), the Infected spreaders are ideally isolated as soon as they exhibit symptoms. It is important to point out that the classical SIR model represents the situation where the Infected spreaders are not isolated, and thus continue their interactions with the Susceptibles spreading the disease until they become Recovered (REC). Thus, under robust testing, tracing, and case isolation practices, many of these cases reported, we assert, should actually be compartmentalized as Quarantined (Q) or Removed (Quarantined + Recovered; see Fig. 1).

Fig. 1.

Fig. 1

SIR and SIQR modelling compartments and their respective paradigms. (A) Classical SIR (Susceptible – Infected – Recovered), (B) SIR (Susceptible – Infected – Removed), (C) SIQR (Susceptible – Infected – Quarantined – Recovered). In all diagrams, α is the rate constant of the process describing the conversion of Susceptible, S, to Infected, I; β is the rate constant of transfer of Infected, I, into Recovered, R, compartment for model (A), into Removed, R, compartment for model (B), and into Quarantined, Q, compartment for model (C); and γ is the rate constant of recovery process, describing the conversion of Quarantined, Q, into Recovered, R (only for model (C)). The total population for (A), (B), and (C) is the sum of populations in each compartment; time dependent Total cases are given by NS, Active cases correspond to I+Q, and Removed cases are given by Q+R (here R represents Recovered).

Here, we investigate this assertion with a very specific set of questions related to the Susceptible-Infected-Removed model (SIR, where R corresponds to REM or Removed; see also (Prodanov, 2021)) and Susceptible-Infected-Quarantined-Recovered model (SIQR, where R corresponds to REC or Recovered) parameters’ estimation (α, β, and in some cases γ, see Fig. 1 for definitions) and resulting model predictions assuming models are fit using a single “type” of data. It is clear that the more data types that are collected, the more reliable the model parameter estimates will be, thus producing more accurate model predictions. However, if the data are limited to only one data type (e.g., total or active cases), then which of these datasets, if available, results in more reliable model parameter estimates, while still producing accurate model predictions and under what modelling framework? Furthermore, what will happen to parameter estimates and model predictions when a data type is misclassified?

To address these questions, we first use simulated data (Fig. A.1; Data A.1) to explore the parameter reliability and prediction capability with each model (SIR vs. SIQR; Fig. 1) under a variety of data scenarios. The objectives of the simulation study are twofold: (i) to investigate which parameters in the corresponding SIR/SIQR model could be reliably estimated, and (ii) to compare the predictions of the populations/compartments for which data are assumed to not be available with the corresponding known artificial data. We then utilize the findings from our simulation results in two applications. The first applies the SIR and SIQR model to a classical influenza outbreak in a boarding school in England (Influenza in a boarding school, 1978) that has been an example dataset considered in many mathematical biology text books for decades (Keshet, 1988; Murray, 1989). Here we argue for an alternative treatment of this classical example. The second application uses COVID-19 total and active cases data from Missoula County, Montana, USA to show how misclassification could be a pitfall if using the classical approach similar to that routinely applied to influenza outbreak data. We note that for each discussed model, together with parameter estimates, we also report the estimated values of the basic disease reproduction number, R0, which can be used to estimate the lower bound of the fraction of the population that needs to be vaccinated to avoid an epidemic, p=11/R0.

2. Materials and methods

For consistency of the presentation, some classical formulations and results on SIR and SIQR models are included in the Appendix.

2.1. Creating simulated data using the SIQR model

We first generated simulated data to explore the parameter reliability and prediction capability for each model (SIR vs. SIQR). We used the SIQR model and parameter values for α, β, and γ given in Table 1 with 5% multiplicative errors introduced to produce noise in the data. The errors were then adjusted to make sure that monotonicity of the data (since the measurements are cumulative) through time was preserved. For the chosen parameter values, the basic disease reproduction number equals R0 = 2.13. These data are found in Data A.1.

Table 1.

Estimated parameter values for different data sets, models and cases discussed in the body of the text. Notations for various population compartments: S = Susceptible; I = Infected (infection spreaders); Q = Quarantined; R = Removed (REM) or R = Recovered (REC). The total population for various models and cases: N=800 for simulated data examples; N=763 for Boarding School example; N=121,630 for Missoula County example. The total cases are given by NS, the Active cases are represented by I+Q (in the classical SIR model Active cases are labeled as Infected). The Scenario “Actual” in the table specifies the SIQR model parameter values for which the simulated data were generated. For the simulated data examples, in the cases column the letters in the parentheses indicate the data sets used for model fits. For the Boarding School examples, in Scenario A the “confined to bed” data are interpreted as Infected (which corresponds to the classical approach); in Scenario B the combined “confined to bed” and “convalescent” data are treated as Active cases; in Scenario C the combined “confined to bed” and “convalescent” data are interpreted as Quarantined. For the Missoula County cases, in Scenario A the available total cases data were used; in Scenario B the same data were interpreted as Quarantined; in Scenario C the Active cases data were treated as Infected (which corresponds to the routinely applied classical approach; see Boarding School example, Scenario A). Model parameters α are measured in persons/day (persons/week for Missoula County analysis); β and γ are measured in 1/day (1/week for Missoula County analysis). For SIR models the parameter γ is not defined (which is indicated by a “dash” in the γ column). For SIQR models γ is defined but cannot be estimated for some cases (which is indicated by a “no estimate” in the γ column). The basic disease reproduction number R0 is estimated for each model case using the formula R0=Nα/β. For the simulated data set the quality of models’ fits to data is determined qualitatively: (a) by how well the model predictions describe the data which were not used for parameter estimation (for SIR), (b) by how close the estimated parameter values were to those for which the simulated data were generated (for SIQR). The Quality of fit indicator is not meaningful for the Boarding School and Missoula County models because the predictions for various compartments cannot be compared to the corresponding data which were not reported (not available) for these compartments.

Models/Applications Scenario(data) α (SE) β (SE) γ (SE) R0 Quality of model fit to data
SIQR model (data) Actual 0.002 (0) 0.75 (0) 0.25 (0) 2.13
SIR model A(NS) 0.002 (2.3 × 10−5) 0.74 (0.014) 2.16 moderate
B(I+Q) 0.00143 (8.5 × 10−6) 0.21 (0.003) 5.34 not satisfactory
C(REC) 0.00144 (2.0 × 10−5) 0.56 (0.016) 2.04 not satisfactory
D(REM) 0.00197 (1.3 × 10−5) 0.69 (0.0098) 2.27 satisfactory
SIQR model A(NS) 0.00204 (2.6 × 10−5) 0.75 (0.016) no estimate 2.17 moderate
B(I+Q) 0.00195 (2.3 × 10−5) 0.70 (0.017) 0.25 (0.006) 2.24 satisfactory
C(REC) 0.00199 (4.0 × 10−5) 0.70 (0.024) 0.24 (0.025) 2.28 satisfactory
D(I) 0.00203 (5.4 × 10−6) 0.74 (0.003) no estimate 2.20 moderate
E(Q) 0.00195 (7.5 × 10−5) 0.67 (0.088) 0.25 (0.012) 2.31 satisfactory
Boarding School SIR/SIQR model A 0.00219 (3.4 × 10−5) 0.44 (0.016) 3.77
B 0.00209 (4.1 × 10−4) 0.49 (0.329) 0.60 (0.598) 3.23
C 0.00297 (0.002) 0.79 (2.272) 0.24 (0.165) 2.87
Missoula County SIR model A 7.8 × 10−5 (1.3 × 10−6) 9.17 (0.155) 1.03
B 7.9 × 10−5 (1.2 × 10−6) 9.33 (0.144) 1.03
C 3.3 × 10−5 (6.7 × 10−7) 3.45 (0.079) 1.15

The total population for which the propagation of infection was constructed was taken to be N = 800. The initial number of Infected cases was taken to be I(0) = 2. The simulated data are shown in Fig. A.1 for Susceptible individuals S(t), Infected spreaders I(t), Quarantined (isolated) individuals Q(t), and Recovered individuals R(t), over a 20-day period. The model fits, calculation of standard errors for parameter estimates using the Delta Method and calculation of 95% prediction bands were performed using the MATLAB optimization toolbox (MATLAB, 2020).

As illustrated in Fig. 1, the sum of Infected spreaders and Quarantined (I(t) + Q(t)) corresponds to the time-dependent Active cases, i.e., those individuals who currently have an infection. The sum of Quarantined and Recovered (Q(t) + R(t)) may be referred to as Removed (they are not involved in further spread of the infection). For the originally generated data (without errors) we assume S(t) + I(t) + Q(t) + R(t) = N, where N is the total population under consideration. The Total cases are given by the expression N − S(t). The use of Q(t) is based on the assumption that infected individuals self-report and self-isolate soon (almost immediately) after symptoms appear. For simulated data the characteristic time of removal of infected individuals from the infection spreaders pool is given by 1/β and is approximately 1.33 days, which is much shorter compared to the usual characteristic recovery time (e.g., for flu) given by 1/γ, which may range from 4 to 14 days (the meaning and values of β and γ are discussed below; see also Table 1 caption).

2.2. Fitting the simulated data with an SIR model

We first fit the SIQR model-produced data shown in Fig. A.1 (Supplementary materials) using an SIR model. We assume only one of the data types is known (e.g., we only have the data for Total cases, or Active cases, or Recovered cases, etc.). The objective of this simulation study is to investigate which parameters in the corresponding SIR model could be reliably estimated, and how the predictions for the populations/compartments for which data were not available compare with the corresponding known artificial data. Without loss of generality, we assume that the initial number of Infected cases is known (I(0) = 2) and, thus, it does not need to be estimated as part of the model fitting process.

We use four scenarios to fit the simulated data as follows. Scenario A used Total cases or NS(t) for fitting. Scenario B used Active cases or I(t) + Q(t) for fitting. Scenario C used Recovered cases for fitting. Scenario D used Removed cases or Q(t) + R(t) for fitting. Fig. 2 and Table 1 show the compartment prediction and parameter estimation accuracy for each Scenario, respectively. The quality of fit mentioned in Table 1 is determined qualitatively by the number of data sets not used in the model fitting process which were successfully predicted/recovered by the model with fitted parameter values, as well as by closeness of the estimated parameters to their true values. Since the simulated data sets are known, we just compare these known data sets with model predictions produced for these data sets. Thus, the notion of quality of data fit to data here is different from quantitative measures of data fit quality commonly used in statistics.

Fig. 2.

Fig. 2

SIR model of infection spread in a population fit to SIQR simulated data. The first row for each Scenario shows the dataset used from the SIQR simulated dataset in the SIR model fit, which is assumed to be known. The second and third rows show the SIR model predictions for the respective compartments in the SIR model, which are assumed to be unknown and, thus, not used during the model parameter estimation procedure. The data used from the SIQR simulated model for each Scenario include (note, N = S + I + Q + R): (A) Total cases or NS, (B) Active cases or I + Q, (C) Recovered cases or R, and (D) Removed cases or Q + R. Circles represent the simulated data, solid lines represent the model fit and predictions from the SIR model, and dashed lines represent the 95% prediction bands (too small to be viewed at this scale). Parameter estimates for each model Scenario can be found in Table 1. SIQR model fit/predictions for corresponding Scenarios can be found in Supplementary material.

2.3. Fitting the simulated data with an SIQR model

We next fit the SIQR model-produced simulated data shown in Fig. A.1 to an SIQR model to assess which types of data allow the original model parameter values to be restored. We use five scenarios to fit the simulated data assuming only one of the data types is known. Scenario A used Total cases or NS(t) for fitting. Scenario B used Active cases or I(t) + Q(t) for fitting. Scenario C used Recovered cases for fitting. Scenario D used Infected cases or I(t) for fitting. Scenario E used Quarantined cases or Q(t) for fitting. Figures A.2-A.6 and Table 1 show the compartment prediction and parameter estimation accuracy for each Scenario, respectively. All compartment predictions for data types assumed to be unknown were very good, as expected, since an SIQR model was used to generate the simulated data. Once again, as we are interested in illustrating the concept, without loss of generality, we set the initial number of Infected cases to I(0) = 2 to reduce the number of model parameters to be estimated.

2.4. 1978 England influenza boarding school data

We used a well-known and frequently studied and modeled influenza outbreak in an English boarding school in 1978 (Influenza in a boarding school, 1978). These data contained a total of 763 school boys between the ages of 10 and 18 years; all but 30 were full boarders; 113 boys stayed in the junior house and the rest in 10 houses of about 60 boys each. The boys returned to school after the Christmas holiday break and classes began on January 10. One boy who returned from Hong Kong got sick. This began an outbreak of H1N1 flu virus at the school. The records cover a 14 day period from January 22 to February 4. They include data on the numbers of boys ‘confined to bed’ and ‘convalescent’. The actual numbers of students (numerical values commonly used in various studies) can be found in (de Vries et al., 2006). Most of the school boys spent 3–7 days away from class due to illness. The data are shown in Fig. 3.

Fig. 3.

Fig. 3

SIR and SIQR model fit to 1978 influenza outbreak in a boarding school (6). Original data plotted in the top left panel (starred-line; defined in original paper as ‘confined to bed’, circle-dashed line; defined as ‘convalescent’ or recovering). Combining ‘confined to bed’ and ‘convalescent’ produced ‘Quarantined’ in the bottom left panel (triangle-line). The raw data (open circles) used for model fitting in all Scenarios are shown in the first row with predictions shown in second and third rows. Scenarios are as follows: (A) classical application of the SIR model using the data fit to compartment I, (B) SIQR model fit as if data were actually Active cases (I + Q), and (C) SIQR model fit as if data were Quarantined cases (Q).

2.5. Fitting the influenza data with SIR and SIQR models

We assume the number of Infected cases on day “zero”, before the first student ‘confined to bed’ appeared, is I(0) = 1. The initial number of Infected, in principle, could be estimated as a part of the optimization procedure. To simplify the analysis, and without loss of generality, we fix this value at 1 to reduce the number of parameters in the model.

We considered three Scenarios for these data. Scenario A is the usual approach taken to date in all mathematical biology textbooks and related courses, where we treat the ‘confined to bed’ data as Infected cases (and dismiss the ‘convalescent data’) and fit an SIR model to those Infected data. Scenario B used an SIQR model and summed the ‘confined to bed’ and ‘convalescent’ boys to represent Active cases (fitting I(t) + Q(t)), assuming that the boys counted as ‘confined to bed’ were able to infect others over some short time interval on their way to isolation. Scenario C also used the SIQR model and summed the ‘confined to bed’ and ‘convalescent’ boys to represent Quarantined cases (fitting Q(t)), assuming all cases were not part of the infection spreading process. Fig. 3 and Table 1 show the compartment prediction and parameter estimation accuracy for each Scenario, respectively.

2.6. Missoula County, Montana, USA COVID-19 data

In this example dataset, we are presented with two types of COVID-19 data from Missoula County, Montana, USA, courtesy of Missoula City-County Health Department from August 17, 2020–December 20, 2020 (Data A.2), which corresponds to 18 full weeks. We also note that this time period corresponds to the fall semester at the University of Montana and pre-dates the development of a vaccination. Figure A.7 shows the two data types reported for COVID-19: new and active cases, the latter of which we are redefining here as quarantined cases, since the reported active population was assumed to not play a part in further spreading of the infection. We note that the original Missoula County data included daily counts, which are sometimes not actually reported on a daily basis. To reduce the resulting noise in the data we used weekly data to fit the models. The weekly counts of new cases were obtained by summing the new daily counts for 7 consecutive days; the total cases starting from week one (corresponding to August 17, 2020–August 23, 2020) were obtained by adding the new cases observed on a given week to the previously observed new cases (the sum of new cases for all the previous weeks); finally, the weekly active cases, i.e., the active cases observed at a given instant of time during a particular week, were obtained by averaging the active cases for that week (we emphasize that these averages may not necessarily be integers).

2.7. Fitting the COVID-19 data with the SIR model

The population size, N, was estimated to be 121,630 for Missoula County. We considered three Scenarios for these data, all under an SIR model. We only demonstrate SIR model fits (and omit SIQR fits) to the Covid-19 data to focus on the impact of incorrectly matching the data type modeled to the model compartment. Scenario A used the Total cases (i.e., cumulative sum of reported new cases) to fit NS(t). Scenario B used the same data as in Scenario A, but fit to the Removed population. Scenario C used the active (correctly labeled quarantined) cases data fit to the Infected population. We emphasize once again that SIR (where R stands for Recovered) models do not contain a Quarantined compartment; thus, active cases are often mislabeled as infected (actively spreading the infection), leading to erroneous model predictions (as illustrated in the case of Scenario C). Here we fit the SIR (where R stands for Removed) model instead of a more appropriate SIQR model because the available data do not allow one to estimate the parameter γ (compare with the SIQR model fits for simulated data shown in Table 1). Fig. 4 and Table 1 show the compartment prediction and parameter estimation accuracy for each Scenario, respectively. For all Scenarios we assumed the number of Infected cases in week “zero”, i.e., during the first week of observations (which was August 17, 2020–August 23, 2020) to be I(0) = 5. This number was originally assumed to be unknown; it was estimated from the data together with other parameters, but then fixed at an integer value close to the estimate to make the presentation of this example simpler and more consistent with the other examples discussed in the paper.

Fig. 4.

Fig. 4

SIR model fits for Missoula County, Montana, USA COVID-19 data reported during the fall of 2020. First original dataset plotted in top left panel (circle-line; reported New cases). Second original dataset plotted in bottom left panel (starred-line; reported Active cases or redefined here as Quarantined). The population, N, was estimated to be 121,630. The Missoula County data (open circles) used for model fitting in all Scenarios are shown in the first row with predictions shown in the second and third rows. Scenarios are as follows: (A) The reported total cases data were used for parameter estimation (top row); Infected and Removed cases predicted by the model (bottom two rows). (B) The same data as in (A) were treated as Removed for model fitting purposes (top row); Infected and Total cases predicted by the model (bottom two rows). (C) The Quarantined data were treated as Infected for parameter estimation purposes (top row; same as in the case of Boarding School example, Fig. 3A); Total and Removed cases predicted by the model (bottom 2 rows). We note a large discrepancy between the dynamics of Removed cases predictions obtained for Scenario A and Scenario C; we also note a large discrepancy between the dynamics of Total cases predictions obtained for Scenario B and Scenario C.

3. Results and discussion

3.1. SIR and SIQR paradigms

Combining the Q (Quarantined) and REC (Recovered) compartments in the SIQR system leads us to an SIR model (where R = REM corresponds to Removed), and from a mathematical standpoint, we can reduce one system to the other (Eqn. S10; Fig. 1). However, Fig. 1 illustrates how the meaning of some variables between systems (SIR vs. SIQR; Fig. 1B vs. Fig. 1C) and even within systems (SIR; Fig. 1A vs. Fig. 1B) changes. For example, the meaning of ‘Recovered’ in the classical SIR model defines those who no longer have the disease (Fig. 1A), while ‘Removed’ (Fig. 1B) could be a combination of those who are ‘Quarantined’ and ‘Recovered’, i.e., not participating in spreading the infection. Likewise, the two systems can exhibit different behavior and predictions when parameter values are estimated from real (available) data, and depending on how the data are classified or what compartment is used for model fit.

3.2. Simulated data parameter estimations and prediction comparisons

Fig. 2 and Table 1 illustrate how the meaning of variables can change between modelling systems and how the two modelling systems can produce different predictions when parameter values are estimated from given simulated data (Fig. A.1). Let us describe the organization of Fig. 2, which makes it easier to understand and compare the results. Four different scenarios, A through D, are presented in four different columns. The top row in Fig. 2 indicates which of the four simulated datasets (generated from the SIQR model) was used for the SIR model fit. The other two respective SIR compartments and their model predicted behaviors are plotted in the bottom two rows. Parameter estimates for each of the four Scenarios are shown in Table 1. Scenarios A and D (model fits with total cases (N – S) and Removed cases (Q + REC), respectively) produced the best parameter estimates matching the true parameter values. Both Scenario B (Active cases (I + Q)) and C (Recovered cases) underestimated the α and β parameters (Table 1) and provided poor predictions for the two remaining compartment totals (Fig. 2, rows 2, 3). Overall, Scenario D, which used Removed cases, produced both the best parameter estimates and predictions across all compartments matching the simulated datasets.

The corresponding SIQR model parameter fits are shown in Table 1 and prediction graphs are shown in Fig. A.2-A.6. Here, we show how fitting the correct model (SIQR) with the generated simulated data produces, unsurprisingly, better parameter estimates and predictions matching the simulated data. For all Scenarios, parameter estimates for α and β are reliable. Scenarios B, C, and E (model fits with Active cases (I + Q), Recovered cases (REC), Quarantined cases (Q), respectively) are able to produce γ estimates, but Scenarios A and D (model fits with Total cases (N – S) and Infected cases (I), respectively) are unable to estimate γ. The inability to estimate γ in these two scenarios results from structural non-identifiability where the data types used in the model do not provide information regarding the transition from the Quarantined compartment to Recovered (Guillaume et al., 2019). Thus, with real data adhering to an SIQR model, if only the total new cases or only the number of Infected were reported, the γ-parameter cannot be estimated. Additional data, e.g., active cases, would need to be collected to allow the SIQR model to estimate the value of γ (see additional discussion of Missoula COVID-19 data modelling below).

3.3. 1978 England influenza boarding school application

Next, we examine a real world and classical application to demonstrate the role of misclassification of reported disease data in classical SIR modelling. We considered three Scenarios used for fitting the classical 1978 England influenza boarding school data (Fig. 3). The data are shown in the first column of Fig. 3. Columns two, three and four represent three different scenarios, A, B, and C, respectively. Scenario A is the standard SIR model fitted with the Infected compartment, assuming those ‘confined to bed’ are infected spreaders in the system (Fig. 3A). Scenario B is the SIQR model fitted assuming the combined data of ‘confined to bed’ and ‘convalescent’ were treated as Active cases (fitted with I + Q) (Fig. 3B). In Scenario B, it is assumed that the data represent some combination of infected spreaders and isolated cases. Scenario C is the SIQR model fitted with only the Q compartment assuming the combined data mentioned in B correspond to Quarantined cases (Fig. 3C). In this last case, we assumed that the ‘confined to bed’ school boys are not active spreaders of the infection but indeed all isolated. Scenarios B and C produce more variation in model prediction bands than Scenario A (top row of Fig. 3); however, this is expected as both Scenarios B and C have an additional compartment compared to Scenario A. Table 1 shows the parameter estimates for the three Scenarios with Scenarios B and C allowing us to estimate the additional parameter γ. If the Removed cases or even Total cases had instead been collected for these data, as seen from the simulated data study (Fig. 2D, A, respectively), it is likely that parameter estimates and corresponding compartment predictions would have been more accurate, even if the SIR (Removed) model were used instead of SIQR (Recovered). Using the estimates for the basic disease reproduction number R0 obtained for the boarding school model Scenarios, the lower bounds for the proportions needing to be vaccinated to avoid an epidemic for Scenarios A, B, and C are p= 0.73 , 0.69, 0.65, respectively. Thus, the difference between the minimal vaccination requirements obtained for different Scenarios is comparatively large; in situations where estimates are made for large cities or even countries, the implications of this analysis producing such vastly different vaccination requirements may be significant (in terms of the number of vaccine doses that have to be available, vaccine education campaigns, etc.). It is important to emphasize again that the classical approach to fitting the boarding school data corresponds to Scenario A. Examples include classical texts (de Vries et al., 2006; Keshet, 1988; Murray, 1989), where the numerical values of estimated model parameters are consistent with those presented in Table 1 for Scenario A. As far as we know, very natural interpretations leading to Scenarios B and C (the most realistic for the available data) were not considered in the scientific literature or the mathematical biology and epidemiology textbooks.

3.4. Missoula County, Montana, USA COVID-19 application

Finally, in the early outbreaks of the SARS-CoV-2 pandemic (when vaccination was not yet available), a person who experienced symptoms or tested positive for the disease, ideally immediately isolated. This presents a special feature for this disease, as well as for infectious disease models, where total or active cases reported should be considered as Quarantined (in the case of the SIQR model) or even Removed (in the case of the SIR model) rather than Infected, since their direct contact with the Susceptible population becomes minimal. Furthermore, the SARS-CoV-2 pandemic has presented a situation in which we have more data types that we can now use to fit our infectious disease models, as well as assess the impact of data compartment misclassification. Fig. 4, column one, shows two data types reported for COVID-19 in Missoula County, Montana, USA (see also Fig A.7 and Data A.2 in the Appendix): Total cases and Active cases (which we redefined as quarantined, since the reported active population was assumed to not play a part in further spreading of the infection). We again consider three Scenarios used for fitting these data to the SIR models. These different scenarios are presented in columns two, three, and four of Fig. 4, respectively. In Scenario A, the reported Total cases (N − S) data were used for parameter estimation (Fig. 4A). In Scenario B the same data as in (A) were treated as Removed for the model fit (Fig. 4B). The earlier simulation study supports the use of Total cases (Fig. 2A) and/or Removed cases (Fig. 2D) to obtain accurate model parameter estimates and predictions. Finally, Scenario C corresponds to the situation similar to that of the boarding school example, Fig. 3A; in particular, the Quarantined population data were treated as Infected for analysis (Fig. 4C). Scenario C overpredicts the number of new cases by a factor of five (second row; Fig. 4C). This Scenario mimics Scenario A of the boarding school influenza application showing how known quarantined (reported as active) cases could be used inappropriately in the SIR modelling framework (when treated as infected according to the classical boarding school model approach). Corresponding SIR parameter estimates and R0 values are shown in Table 1. We note that using one data type, reported cases (with appropriate interpretation) allowed us to estimate two parameters, α and β, in the SIR model (where R stands for Removed), as was the case with the simulated data for Scenarios A and D. If we additionally use the number of Active cases, this allows us to also estimate the value of the γ-parameter in SIQR model fits for scenarios A and D. Here we do not show or discuss model fits with multiple data types to preserve the consistency of presentation, but it is important to mention that, as would be expected, using multiple data types results in more reliable estimation of model parameters.

4. Conclusion

The COVID-19 disease data have shed new light on many aspects of infectious disease modelling. Classifying reported ‘active cases’ is not as straightforward as compartmentalizing these individuals into the Infected population as shown here with simulated data, the classical England influenza boarding school data, and a COVID-19 data example. One primary outcome of this work is that available collected data, assumed to be correct, and the model describing the disease propagation process, assumed or known to be correct, may lead to erroneous results if the data types are misaligned with the model compartments, i.e., the data are misinterpreted with respect to the model and assumed to represent the wrong variable. The correct interpretation of data and assignment of data to the right compartment will lead to correct model predictions. It is important to mention that the SIR model may not always be the most appropriate framework in situations when immediate isolation occurs. Although a large number of alternative approaches to modelling infectious disease propagation have been proposed and used, e.g., to predict the spread of COVID-19 (e.g., IHME COVID-19 Forecasting Team, 2020), SIR-based models still play an important role in describing various types of epidemics, including COVID-19, and in producing predictions for future disease cases to support policy decisions and public health actions (e.g., Giordano, et al., 2020; Pei et al., 2020). The fear effect studied recently in (Maji, 2021) may, in principle, influence population behavior and affect the values of model parameters estimated from corresponding data. In the examples of real data studied in this paper, the fear effect was not present in the Boarding School influenza case, but it was present (or could be interpreted as being present) in the COVID-19 case, which led to the situation where people diligently followed pre-infection isolation guidelines and immediately reported infection status followed by quarantine, resulting in the estimated model parameters being characteristic of fear affected communities. We recognize the limitations of our analysis in including only simple SIR/SIQR models that do not fully capture the complexities of SARS-CoV-2 dynamics (Exposed, Presymptomatic, Asymptomatic, etc.), but these data types are yet harder to come by. The short time periods for the studied data sets were chosen intentionally: during comparatively short time periods the values of model parameters, related to behavior patterns of the population during the epidemic (before, during, and after infection transmission) may be assumed to be constant. The proposed approach works well for longer time periods involving several waves of a disease, but the model parameters must be re-estimated for each wave of a disease (and, in general, the values of the parameters will change). Future work could consider simulated data and exploration of the model parameter space with more compartments when additional data types are known.

Ethics approval

The research design and methodology used only anonymized data sets. This means that no ethical approval is required.

Funding

This work was supported by National Institute of General Medical Sciences of the National Institutes of Health, United States (Award Numbers P20GM130418 and U54GM104944).

Data availability

All data are available in the article and its online Supplementary Materials, available as Supplementary data at IDM online.

Author contributions

Conceptualization: LK, ELL, JG, designed the study and participated in all drafts. LK conducted the investigations and created the visualizations. Funding was acquired through ELL and JG.

Declaration of competing interest

Authors declare that they have no competing interests.

Acknowledgments

The authors thank Missoula City County Health Department for collecting the data used in this study and their dedication and commitment in responding to the COVID-19 pandemic. The authors also thank Curtis Noonan, Kristin Laidre, Isaiah Reed, Jeffrey Shaman, and the anonymous reviewers for offering feedback on the manuscript.

Handling Editor: Dr Lou Yijun

Footnotes

Peer review under responsibility of KeAi Communications Co., Ltd.

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.idm.2022.12.002.

Appendix.

Here we present the SIR and SIQR model formulations used in the manuscript and discuss the relationships between different forms of classical SIR model formulations.

SIR model formulation

First, we consider the classical, so-called, Susceptible-Infected-Recovered (SIR) disease propagation model (Kermack & McKendrick, 1927). The original SIR model is formulated for three-time dependent populations, Susceptible, S(t), Infected, I(t) and Recovered, R(t), living in a community (region such as city, county, state, etc.) with a constant total population, N, so that the conservation relationship holds at any time (births and deaths can easily be taken into account by straightforward modifications of the model):

S(t)+I(t)+R(t)=N. (1)

For the basic model formulation, it is assumed that no one dies. The corresponding system of ordinary differential equations describing the behavior of S(t), I(t) and R(t), can be written as follows (the derivation is based on the Law of Mass Action taken from chemical and biological kinetics; here A is a constant representing the characteristic area of the region for which the model is being derived, so that s(t) = S(t)/A, i(t) = I(t)/A and r(t) = R(t)/A are the corresponding population densities):

dsdt=αsi,didt=+αsiβi,drdt=+βi; (2)
s(0)=s0,i(0)=i0,r(0)=0, (3)

where s0, i0, and r0=0 are the initial population densities for each compartment; α is the rate constant of the process describing the conversion of S(t) to I(t), β is the rate constant of the process which moves the Infected to the Recovered. Let us re-write (2) and (3) using the original notation for different portions of the population, S(t), I(t) and R(t), and area, A:

d(S/A)dt=α(S/A)(I/A),d(I/A)dt=+α(S/A)(I/A)β(I/A),d(R/A)dt=+β(I/A); (4)
S(0)A=S0A,I(0)A=I0A,R(0)A=R0A. (5)

Equations (2), (3) may be further simplified to elucidate the relationship between the coefficients in the SIR model formulated in terms of population densities vs. actual populations:

dSdt=(αA)SI,dIdt=+(αA)SIβI,dRdt=+βI; (6)
S(0)=S0,I(0)=I0,R(0)=0.

Assuming that the population density ρ = N/A for a certain area (village, town, county, etc.) is approximately constant, we can use the expression A = N/ρ to re-write (6) in another (popular) form:

dSdt=(αˆN)SI,dIdt=+(αˆN)SIβI,dRdt=+βI; (7)
S(0)=S0,I(0)=I0,R(0)=0.

where αˆ = α · ρ. We note that the forms of the SIR model represented by (6), (7) are needed only if the model coefficients estimated for one location are intended to be compared to or used for infection propagation predictions in another location. In other words, if the mobility and behavior of population in two different areas are approximately the same, but the total populations/population densities for these locations are different, then coefficient α estimated for one location, in principle, may be re-calculated to obtain the estimate for that coefficient in another location. If the modelling is intended for the analysis and for running “what if” scenarios at the same location, the constant parameters A and N may be absorbed into the model coefficient α (or αˆ); however, in such case the corresponding re-scaled will be unique for that particular location and the parameter value will be tied to it. In main text, to simplify notation, we assumed that we were interested to analyze the infection propagation in just one location. So, the SIR model was studied in the form (here, without loss of generality, we use the same notation for α as in (2), (4), (6), although the numerical value of this parameter is different compared to those in (2), (4), (6)):

dSdt=αSI,dIdt=+αSIβI,dRdt=+βI, (8)
S(0)=S0,I(0)=I0,Q(0)=Q0. (9)

The same model formulation, (8) and (9), is used for Susceptible – Infected – Removed situation. The difference is that in the Susceptible – Infected – Recovered case the Infected are actively spreading the infection until they recover while in the Susceptible – Infected – Removed case the Infected only spread the infection until they are quarantined/self-isolated (or removed from the infection propagation system). The parameter α for both cases is the same (and it is affected by the population behavior before the infection occurs, i.e., its numerical value will change depending on whether the people wear masks or not, how they communicate, if they are locked down, etc.). Parameter β has different meaning for the two cases. In the cases where R stands for Recovered, this parameter just specifies the characteristic time (1/β) needed to get healthy after the infection; it is disease specific and it is not affected by human behavior. In the cases where R stands for Removed, this parameter is affected by human behavior after a person gets an infection. In particular, it specifies the characteristic time (1/β) during which a person gets isolated (quarantined) or self-isolated after the first infection symptoms appear.

Taking into account (1), the equivalent model formulation for (8) and (9) is

dSdt=αSI,dIdt=+αSIβI, (10)
S(0)=S0,I(0)=I0. (11)

SIQR model formulation

In addition to the compartments present in the SIR model, this modification also contains the Quarantined (Q) population compartment. For the case where we are only interested in modelling and making predictions for one particular location, the SIQR model can be formulated as follows (here we just included the area parameter A into α):

dSdt=αSI,dIdt=+αSIβI,dQdt=+βIγQ,dRdt=+γQ; (12)
S(0)=S0,I(0)=I0,Q(0)=Q0,R(0)=0. (13)

Now α is still the rate constant of the process describing the conversion S(t) to I(t), but β is the rate constant of the process which moves the Infected (infection spreaders) to the Quarantined (isolated infected who are not spreading the disease) pool, and γ is the rate constant of the process describing the recovery of Quarantined and their conversion to Recovered. Let us emphasize that not only the meaning of parameter β has changed compared to how it was used in the classical SIR model, but also the meaning of the Infected, I(t), portion has changed. I(t) is now the portion of the population which is not only infected (carries the disease) but also is able to spread the infection, while the Quarantined portion, Q(t), are infected as well (they carry the disease) but are not spreading the infection due to isolation. Once again, in the absence of deaths and births, the following conservation relationship holds:

S(t)+I(t)+Q(t)+R(t)=N (14)

Thus, the equivalent reduced system of three equations can be considered instead of (12) with corresponding initial conditions:

dSdt=αSI,dIdt=+αSIβI,dQdt=+βIγQ, (15)

and Recovered can be obtained from (14): R(t)=NS(t)I(t)Q(t). We note that the total active cases correspond to the sum of the Infected spreaders and Quarantined (still sick, but not spreading the infection).

It can be easily checked that by adding the differential equations for Q and R in (12) we arrive at the equation for the ‘Removed’ in the SIR model, which is just the sum of Quarantined and Recovered, Q + R, representing the persons, infected or immune, who no longer contribute to the spread of the disease:

d(Q+R)dt=+βI. (16)

Thus, we convert the SIQR system (12) for S, I, Q, and R into an SIR system for S, I, and Q + R, where instead of the original “Recovered” (R(t)) we now have “Removed” (Q(t)). From a mathematical standpoint we have just reduced one system to another, but from a practical perspective, which involves estimating model parameters from real data, the two models turn out to be quite different.

Basic Disease Reproduction Number, R0

The Basic Disease Reproduction number R0 is estimated for each model case using the formula (Murray, 1989): R0=Nα/β. Here it is assumed that S(0)=S0N. The lower bound of the fraction of the population that needs to be vaccinated to avoid an epidemic (Murray, 1989): p=11/R0.

Appendix A. Supplementary data

The following is the Supplementary data to this article:

Multimedia component 1
mmc1.docx (444.1KB, docx)

References

  1. Bjørnstad O.N., Shea K., Krzywinski M., Altman N. Modelling infectious epidemics. Nature Methods. 2020;17:455–456. doi: 10.1038/s41592-020-0822-z. [DOI] [PubMed] [Google Scholar]
  2. Gibbons C.L., et al. Measuring underreporting and under-ascertainment in infectious disease datasets: A comparison of methods. BMC Public Health. 2014;14:147. doi: 10.1186/1471-2458-14-147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Giordano G., et al. Modelling the COVID-19 epidemic and implementation of population-wide interventions in Italy. Nature Medicine. 2020;26:855–860. doi: 10.1038/s41591-020-0883-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Guillaume J.H.A., et al. Introductory overview of identifiability analysis: A guide to evaluating whether you have the right type of data for your modeling purpose. Environmental Modelling & Software. 2019;199:418–432. [Google Scholar]
  5. IHME COVID-19 Forecasting Team Modeling COVID-19 scenarios for the United States. Nature Medicine. 2020;27:94–105. doi: 10.1038/s41591-020-1132-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Influenza in a boarding school Influenza in a boarding school. British Medical Journal 1978. 1978;4:587. 4 March 1978. [Google Scholar]
  7. Kermack W.O., McKendrick A.G. A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society. Series A. 1927;115:700–721. [Google Scholar]
  8. Keshet L. Random House; New York: 1988. Mathematical models in biology. [Google Scholar]
  9. Maji C. Impact of media-induced fear on the control of COVID-19 outbreak: A mathematical study. International Journal of Differential Equations. 2021:11. doi: 10.1155/2021/2129490. Article ID 2129490. [DOI] [Google Scholar]
  10. MATLAB (R2020b). Natick, Massachusetts: The MathWorks Inc., 2020.
  11. Murray J.D. Springer-Verlag; Berlin: 1989. Mathematical biology. [Google Scholar]
  12. Pei S., Kandula S., Shaman J. Differential effects of intervention timing on COVID-19 spread in the United States. Science Advances. 2020;6 doi: 10.1126/sciadv.abd6370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Prodanov D. Analytical parameter estimation of the SIR epidemic model. Applications to the COVID-19 pandemic. Entropy. 2021;23:59. doi: 10.3390/e23010059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Reed C., et al. Estimating influenza disease burden from population-based surveillance data in the United States. PLoS One. 2015;10 doi: 10.1371/journal.pone.0118369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. de Vries G., Hillen T., Lewis M., Schõnfisch B., Muller J. SIAM; Philadelphia: 2006. A course in mathematical biology: Quantitative modelling with mathematical and computational methods. [Google Scholar]
  16. Wu, et al. Substantial underestimation of SARS-CoV-2 infection in the United States. Nature Communications. 2020;11:4507. doi: 10.1038/s41467-020-18272-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1
mmc1.docx (444.1KB, docx)

Data Availability Statement

All data are available in the article and its online Supplementary Materials, available as Supplementary data at IDM online.


Articles from Infectious Disease Modelling are provided here courtesy of KeAi Publishing

RESOURCES