Skip to main content
Infectious Disease Modelling logoLink to Infectious Disease Modelling
. 2024 Oct 28;10(1):268–286. doi: 10.1016/j.idm.2024.10.008

Conditional logistic individual-level models of spatial infectious disease dynamics

Tahmina Akter a,b,, Rob Deardon a,c
PMCID: PMC11609356  PMID: 39624232

Abstract

Here, we introduce a novel framework for modelling the spatiotemporal dynamics of disease spread known as conditional logistic individual-level models (CL-ILM's). This framework alleviates much of the computational burden associated with traditional spatiotemporal individual-level models for epidemics, and facilitates the use of standard software for fitting logistic models when analysing spatiotemporal disease patterns. The models can be fitted in either a frequentist or Bayesian framework. Here, we apply the new spatial CL-ILM to simulated data, semi-real data from the UK 2001 foot-and-mouth disease epidemic, and real data from a greenhouse experiment on the spread of tomato spotted wilt virus.

Keywords: Disease transmission model, ILMs, Logistic ILM, Conditional logistic ILM, Posterior predictive distribution

1. Introduction

Infectious disease outbreaks can have devastating effects on human lives, agriculture, and economic growth. For example, the ongoing coronavirus disease outbreak wreaked havoc on public health and economic activity lost (Barro et al., 2020). High-quality mathematical models can provide powerful insights into how infectious disease complex systems behave, which in turn can enable outbreaks to be better controlled by designing efficient public health strategies and resource allocation, such as intervention or vaccination (Tildesley et al., 2006). To this end, Deardon et al. (2010) introduced a class of individual-level models that focus on describing and predicting the behavior of disease at the individual level of interest (e.g., infection between people, households, or farms).

Individual-level models (ILMs) are notable because they incorporate individual-specific covariate information on susceptible and infectious individuals to better describe the dynamics of infectious disease outbreaks. For example, we can account for population heterogeneity in space by including information on separation distance. However, fitting such models to data can be difficult due to the computational cost of calculating the likelihood. This situation arises when we deal with a large population. Utilizing ILMs is also challenging because it generally requires specialized software such as the EpiILM and EpiILMCT R packages (Warriyar et al., 2020; Almutiry et al., 2021) or coding in fast languages such as Fortran, Julia, or C.

Inference for such models is usually facilitated via Markov chain Monte Carlo (MCMC) within a Bayesian framework. This is a powerful tool because it can deal with high-dimensional and complex models and offers great flexibility in the choice of model. Bayesian MCMC is also powerful in that it provides a principled way for imputing missing data, as well as enabling the incorporation of prior knowledge, allowing multiple sources of data to be combined to improve parameter identifiability. However, in terms of practical outcomes, repeating the calculation of likelihood for an ILM as required by the MCMC method can be computationally very expensive, especially when dealing with large population sizes or complex models (Deardon, 2010).

A logistic regression model is a powerful tool in statistics used for modelling binary response variables and prediction. It is used to model the relationship between predictor variables and binary responses. It can be used to predict the probability of an event occurring, such as disease status (yes/no), based on the associated predictors. Moreover, the logistic model can used as a valuable tool in epidemiology for understanding the dynamics of disease transmission within a population (Jin et al., 2015). There is also, of course, a wide range of statistical software for fitting these models. The key features of these models are simplicity, interpretability, and applicability to a wide range of scenarios.

In this study, we propose a framework for logistic ILMs, specifically in the context of spatial individual-level models. The logistic ILM is used to model the probability of infection (or non-infection) of disease at each point in time based on risk factors (e.g., environmental, demographical, or behavioral) that are associated with individuals in the population. This is done in a similar way to an ILM, but the two models have a different underlying functional form. From these models, we can understand the spatial pattern of the disease, identify associated risk factors, and make predictions or forecasts just as we can with a standard ILM.

Spatial logistic ILMs are typically non-linear in terms of their covariates due to the spatial distance function typically used. However, we can condition on the spatial parameter of the logistic ILM so that the covariates in the model are linear predictors of the log odds of infection at each time point. This enables us to use standard statistical software to fit the logistic ILM and facilitates faster inference. We will do this in two stages. In the first stage, we will fix the spatial parameter by choosing an appropriate value for the parameter from a finite set of plausible values. This leads to a conditional logistic ILM (CL-ILM). In the second stage, the model can be fitted in either a Bayesian or frequentist framework. Here, we will focus on Bayesian CL-ILMs. We can check the performance of this model relative to say, a standard ILM by using a posterior predictive approach (Gardner et al., 2011), or model-based information criterion.

The subsequent sections of this paper are organized as follows. In Section 2, we introduce the general framework of ILMs, the logistic ILM, the spatial logistic ILM, the CL-ILM, methods for converting from epidemic data to that suitable for fitting the CL-ILM via standard statistical software, and the posterior predictive approach. In Section 3, we discuss our simulation process. In Section 4, we present our findings and compare the ILM and CL-ILM methods based on simulation studies under SI and SIR frameworks. In Section 5, we apply the CL-ILM to semi-real data based on the UK foot and mouth disease (FMD) outbreak of 2001. In Section 6, we assess the performance of the CL-ILM using tomato spotted wilt virus (TSWV) data. Finally, in Section 7, we conclude and propose plans for future research.

2. Methodology

2.1. Individual-level model

A class of disease transmission models defined as individual-level models was introduced by Deardon et al. (2010). These models provide a tool for modelling infectious disease spread through space and time at the individual level (e.g., individual people, households, or geographical regions). The goal of these models is to mimic the dynamic of infectious disease. The models are placed within a so-called compartmental framework. We begin by considering the SI - or susceptible (S), infectious (I) framework and then the SIR - or susceptible (S), infectious (I), removed (R) framework. The data includes spatial location coordinates (X, Y), infection times in the SI framework, and infection and removal times in the SIR framework. These compartmental frameworks can easily be extended to SEIR or SEIRS, which allows for a latent period and/or reinfections.

In the SI framework, individuals are initially in the susceptible state (S), and when infection occurs, the individual becomes infectious immediately and moves to the infectious state (I). In the SIR framework, the same process occurs but after some time the individual moves to the removal state (R), if death or recovery happens, for example. In our discrete-time scenarios, the epidemic starts at the time t = 1 when the first individual is being infected and the epidemic ends at the time t = tend; t = 1, 2, …, tend. Note that an infection occurring within the continuous time interval (t, t + 1] is considered to happen at time t in the discrete-time model. Thus, when we refer to (tmax − 1) in discrete time, we are actually accounting for infections that occur within the interval (tmax − 1, tmax]. The functional form of the ILM infection probability as defined in Deardon et al. (2010) is given as,

Pit=1exp{ΩS(i)jϵI(t)ΩT(j)k(i,j)}ϵ(i,t),ΩS(i),ΩT(j),ϵ(i,t)>0 (1)

where: Pit is the probability that susceptible individual i is infected at time t; I(t) is the set of individuals who are infectious at time t; ΩS(i) is a susceptibility function representing potential risk factors associated with the ith susceptible individual contracting the disease; ΩT(j) is a transmissibility function representing potential risk factors associated with the jth infectious individual passing on the disease; k(i, j) is an infection kernel that involves potential risk factors associated with both the infectious and susceptible individuals (e.g., a function of spatial distance); and ɛ(i, t) describes random behavior due to some otherwise unexplained infection process.

The likelihood function for the model of (1) is given as

L(D|θ)=t=1tmax1ft(D|θ),

where

ft(D|θ)=iϵI(t+1)I(t)PitiϵS(t+1)(1Pit),

and where θ is the vector of unknown parameters, D is the epidemic data set, S(t + 1) is the set of individuals susceptible at time (t + 1), I(t + 1) ∖ I(t) is the set of individuals newly infected at time t + 1, and tmax ≤ tend is the last time point observed in data. Infectious periods (removal) can be modeled in various ways. For simplicity, here, we assume that the infection times and infection periods are known.

We will focus on a simple spatial ILM with no covariates aside from spatial distance with the form,

Pit=1expαjϵI(t)dijβ,α,β>0,t=1,,tmax, (2)

where ΩS(i) = α, ΩT(j) = 1, k(i,j)=dijβ and ɛ(i, t) = 0 in equation (1), and where dij is the Euclidean distance between ith susceptible and jth infectious individual, α is the baseline susceptibility, and β is the spatial parameter.

The ILM model can be extended or adapted in various ways. In Sections 2.2–2.4 below, we discuss three such: the logistic ILM, the spatial logistic ILM, and the conditional logistic ILM. Table 13 in the Appendix also provides a brief explanation of the relationships between these models. These models define the probability of infection (exposure) and can be incorporated into various compartmental frameworks. These include the SI, SIR, and SIRS (where reinfection is possible), SEIR (where a latent period after exposure is included), and others, in discrete time. The rates of infection from these models can also be embedded in various continuous time models in which interevent times are typically assumed to be exponential (Almutiry et al., 2021).

2.2. Logistic ILM

Here, we discuss the logistic ILM and its general form. The logistic ILM is the logistic version of the ILM that involves the relationship between log odds of infection and potential risk factors associated with susceptibility and transmissibility. The general form of the logistic ILM infection probability is defined as

λit=ΨS(i)jϵI(t)ΨT(j)K(i,j)+eit, (3)

where λit=logPit1Pit, ΨS(i) is the potential risk factors associated with the ith susceptible individual, ΨT(j) is the potential risk factors associated with the jth infectious individual, K(i, j) is the infection kernel, and eit is some infection from unexplained causes. The likelihood function for the model (3) can be written as,

L(D|θ)=i=1nt=1tmax1Pityit(1Pit)1yit,

where yit is the infection status of ith individual at time t, with yit = 1 if the individual is infected and 0 otherwise. We can fit this model to data by maximizing the likelihood and then take a frequentist approach to inference or fit the model in a Bayesian framework using an MCMC algorithm, incorporating prior information on parameters.

2.3. Spatial logistic ILM

Here, we present a logistic version of the simple spatial ILM of equation (2). It can be considered as an alternative model in its own right, or as an approximation to the spatial ILM. It is given by,

logPit1Pit=Xα=α0+α1Xit, (4)

where ΨS(i) = α, ΨT(j) = 1, K(i,j)=dijβ0, and eit = 0 in equation (3), where X = (1, Xit), αT = (α0, α1) and Xit=jϵI(t)dijβ0. That is we relate the force of infection (α1jϵI(t)dijβ0) to the log odds of infection rather than the probability of infection. We can write the probability of being infected as

Pit=exp(Xα)1+exp(Xα)=expα0+α1Xit1+expα0+α1Xit=11+exp-α0+α1Xit.

However, standard statistical software for fitting logistic models will not be able to cope with the non-linearity in the spatial function (e.g., the glm command in R), due to the non-linearity in Xit. However, if we fix, or condition on β0, we can calculate Xit for each susceptible i at each time t and then use standard software to fit the model.

2.4. Conditional logistic ILM

The conditional logistic ILM involves conditioning on the spatial parameter β0. In such cases, the probability of being infected can be written as

P(Yit=1|β0=β0˜)=exp(α0+α1jϵI(t)dijβ0˜)1+exp(α0+α1jϵI(t)dijβ0˜)=11+expα0+α1jϵI(t)dijβ0˜,

where β0˜ is our fixed value of β0. The conditional likelihood function can be written as

L(α|β0)=i=1nt=1tmax1P(Yit|β0˜)yit(1P(Yit|β0˜))1yit. (5)

One simple way to choose β0˜ is to fit the model for each of a finite set of possibilities and choose the β0˜ which maximizes the likelihood.

2.5. Data converting from epidemic data to binary data

The epidemic data for a spatial SI ILM without covariates will contain information on infection times (and removal times if an SIR model is being fitted) of individuals with (X, Y) coordinates. Here, we will explain the procedure of how to convert such epidemic data to binary data that we can fit the spatial logistic ILM to. The infection pattern over time is shown in Table 1, for a hypothetical ‘toy example’ consisting of four individuals. Note that the column ‘Individual ID’ is not strictly needed but is included here to aid illustration. In this data, individual 1 is being infected at t = 4, and so becomes infectious at t = 5, and similarly, individual 2 is being infected at t = 3, and so becomes infectious at t = 4, and so on. Here, the epidemic starts when individual 3 becomes infectious at time t = 2, and we would condition on that infection.

Table 1.

Infection pattern over time.

Individual ID (X, Y) coordinate “Infectious” time
1 (2.6, 1.5) 5
2 (3.7, 6.8) 4
3 (5.7, 6.5) 2
4 (5.9, 6.3) 3

To convert the epidemic data to a data set suitable for fitting the CL-ILM using standard software, we create three columns. The first column incorporates time points for each individual. The time point will reach up to the point they get infected. The second column incorporates the infection event status of the individuals for each time point. We start to observe the binary data from t = 2 because the epidemic starts from one individual who was infected at time 1. The third column includes the set of infectious individuals (It) in the data. It contains Xit calculated for fixed β0˜. Note that, Xit will typically be calculated for change over time for each individual. The binary data set for Table 1 is shown in Table 2. Here, time (t) and It are supporting information that is not directly used in the fitting of our CL-ILM.

Table 2.

The binary epidemic data.

Individual ID (X,Y) coordinate Time (t) Infection It Xit
1 (2.6, 1.5) 2 0 {3} jϵI(2)d1jβ0˜
1 (2.6, 1.5) 3 0 {3, 4} jϵI(3)d1jβ0˜
1 (2.6, 1.5) 4 1 {2, 3, 4} jϵI(4)d1jβ0˜
2 (3.7, 6.8) 2 0 {3} jϵI(2)d2jβ0˜
2 (3.7, 6.8) 3 1 {3,4} jϵI(3)d2jβ0˜
4 (5.9, 6.3) 2 1 {3} jϵI(2)d4jβ0˜

Similarly, we can convert the epidemic data to binary data in the context of the SIR framework. In this case, the data contains the information on the time of infection and time of removal with (X, Y) coordinates for each individual. An infectious individual would move to the removal state after their infectious period. At that time, the individual would not be in the set of infectious individuals anymore, and this would feature in the calculation of Xit.

2.6. Posterior predictive distribution

To investigate the model accuracy or goodness of fit under the Bayesian framework, we can use a posterior predictive approach as introduced by Guttman (1967). We can generate realizations from the posterior predictive distribution (PPD) of various epidemiological statistics such as some form of epidemic curve or the final size of the epidemic, and then compare that with the equivalent statistic calculated from the observed data, to assess the model fit. Here, we consider the number of newly infectious individuals (incidence) over time, which we refer to as the epidemic curve. The algorithm for producing posterior predictive realizations in the case of an ILM or CL-ILM consists of the following steps:

  • (i)

    Sample a set of parameters from the MCMC-estimated posterior distribution.

  • (ii)

    Simulate an epidemic from the model using the parameters sampled in Step (i).

  • (iii)

    Summarize the simulated epidemic from Step (ii) via the epidemic curve (or some other statistic of interest).

  • (iv)

    Repeat Steps (i) to (iii) a large number of times. For this study, we repeated 500 times.

Then, we examine and compare the PPD of the epidemic curve to the original observed epidemic curve to check for accuracy and precision. We consider a model to be a good fit for the data if the observed data lies in the areas of high mass of the PPDs, and the PPD has low variance.

To quantify the posterior predictive model fit, we can also use metrics such as the mean square error (MSE). Here, the MSE is calculated by taking the average of the squared differences between predicted and actual values of new infections over time, which is then averaged over the total number of epidemic simulations. The MSE is given as,

MSE=1500tmaxs=1500t=1tmax(YstYˆst)2,

where Yst is the number of new cases of the sth sample at time t, and Yˆst is the predicted number of new cases of the sth sample at time t.

3. Simulation study

A simulation study is carried out to assess the performance of our CL-ILMs when the underlying data is generated by the spatial ILM (equation (2)). That is, we examine how well a spatial CL-ILM can approximate the basic spatial ILM. Here, we consider a log transformation of the Xit covariate to enhance stability in the data. Each data analysis is carried out in two stages. In the first stage, we use maximum likelihood over a finite set of β0 values to tune β0. In the second stage, we fit the CL-ILM under the Bayesian framework. Then, we examine the model accuracy via the posterior predictive approach described above. In addition, we compare the prediction error between the basic spatial ILM and CL-ILM under the SI and SIR frameworks. Furthermore, we compute the computational time required to fit both models under these frameworks.

In this study, we simulate epidemic data under four scenarios with different spatial ILM parameter values. The true parameter values of (α, β) are (0.7, 4), (0.5, 3), (0.2, 4), and (0.9, 5) for each of the four scenarios, respectively. For each set of parameters, we produce 30 epidemics. These are used to fix β0=β0˜. We then take an arbitrarily chosen subset of 20 epidemics and fit the CL-ILM to these using a Bayesian MCMC framework. Note, the subset of only 20 is taken at the second stage to reduce the computational burden associated with carrying out multiple MCMC analyses. For each simulated epidemic, we randomly generate the spatial location of 500 individuals uniformly within 10 × 10 unit square area for each epidemic. To generate epidemic data from the ILM, we use the epidata function from the ‘EpiILM’ R package. Then, we convert the epidemic data to binary data suitable for analysing with the glm command in R.

3.1. Fixing β0

We compare a number of spatial logistic ILMs to find the optimal tuning parameter (β0). For comparing these models, the maximum likelihood approach is used here. We compare the logistic models with spatial parameter β0 ∈ {−1, 0.5, 1, …, 9.5, 10}. By fixing β0, we construct the conditional logistic model.

We fit models and calculate the likelihood values using the glm function in R with a logit link. We also calculate the proportion of time each possible value of β0 is selected.

3.2. Model fit

Here, we fit the basic spatial ILM using the mcmc function of the package ‘adaptMCMC’ in R. For the basic spatial ILM, the marginal prior distributions of the parameters, α and β, are U(0, 5) and U(0, 10), respectively. Posterior predictive simulations are produced using the epidata command in the ‘EpiILM’ R package.

In the case of CL-ILM, we fit the model using the MCMClogit function from the ‘MCMCpack’ package in R. Here, the marginal prior distribution of the parameters, α0 and α1, are independent Cauchy distributions with location and scale parameters 0 and 1, respectively. We use our own R code to produce the epidemic curves under the posterior. Then we compare the fit of the spatial ILM and the CL-ILMs. We use the average MSE to measure the prediction error and the average standard deviation (SD) to capture the variation in the posterior realizations. Moreover, we report the average proportion of time points at which posterior predictive 95% credible intervals capture the true numbers of new infections.

4. Results

4.1. SI framework

Under the SI compartmental framework, we assess the performance of CL-ILMs when data is generated from a basic spatial ILM.

4.1.1. Choosing β0 via maximizing the likelihood

For the true parameter (α, β) = (0.7, 4), the spatial parameter β0˜ was found to be either 3.5, 4.0, or 4.5 by maximizing the likelihood (Table 3). Further, β0˜=4.0 (the true value) was chosen under a majority of epidemics (0.60). Similarly, when the true parameter value was (0.5, 3), the highest proportion was 0.733 for the true value of β0˜=3.0. When the true values were (0.2, 4) and (0.9, 5), the highest proportion was 0.533 and 0.600 for β0˜=4.0 and β0˜=5.0, respectively.

Table 3.

Proportion of time β0˜ selected for the four scenarios under the SI framework.

True parameter values (α, β) Selected β0˜ Proportion
(0.7, 4) 3.5 0.133
4.0 0.600
4.5 0.267
(0.5, 3) 3.0 0.733
3.5 0.267
(0.2, 4) 4.0 0.533
4.5 0.467
(0.9, 5) 4.5 0.167
5.0 0.600
5.5 0.167
6.0 0.067

This implies that the maximizing likelihood approach can be successfully used to fix β0˜ for the CL-ILM, either picking the true or one close to the true value under all epidemic scenarios tested.

4.1.2. Model fit

Table 4 shows the average MSE and SD under posterior prediction of the incidence-based epidemic curves under the SI framework. For each scenario, the average MSE and SD were higher for the CL-ILM compared to the spatial ILM. This would be expected, of course, since the actual data observed was simulated from the ILM.

Table 4.

Comparing average MSE and average SD between the ILM and CL-ILM under the SI framework.

True parameter values (α, β) ILM
CL-ILM
Avg(MSE) Avg(SD) Avg(MSE) Avg(SD)
(0.7, 4) 596.626 601.064 840.756 629.226
(0.5, 3) 625.753 668.041 836.540 792.532
(0.2, 4) 514.355 426.319 694.453 455.477
(0.9, 5) 491.494 415.214 729.006 436.251

We summarize the mean proportion of time credible intervals capture the true number of infectious with its standard deviation across epidemic datasets in Table 5. When comparing the ILM and CL-ILM, we observed that the mean proportion value was slightly lower for the CL-ILM compared to the ILM. The mean proportion of successful incidence capture varied between 0.968 and 0.992 while considering the spatial ILM. In contrast, the mean proportion varied between 0.881 and 0.950 while considering the CL-ILM. As expected, the values of SD were higher for the CL-ILM compared to the spatial ILM. However, we note that under the CL-ILM the lowest capture proportion was 0.881, and in the other scenarios it was larger than 0.900.

Table 5.

Mean proportion of time credible interval captures the true distribution of infection with its SD under the SI framework.

True parameter values (α, β) ILM
CL-ILM
Mean SD Mean SD
(0.7, 4) 0.992 0.037 0.912 0.148
(0.5, 3) 0.990 0.045 0.950 0.110
(0.2, 4) 0.968 0.082 0.881 0.178
(0.9, 5) 0.973 0.072 0.911 0.173

From Fig. 1, Fig. 2, Fig. 3, Fig. 4 (see Appendix), we show the comparison of the posterior predictive epidemic curve (number of newly infectious individuals over time) between the ILM and CL-ILM for each scenario under the SI framework. We observed that the width of the posterior predictive intervals was a little larger under the ILM than the CL-ILM for each scenario, suggesting less uncertainty in epidemic prediction under the CL-ILM. This is presumably due to the fixing of the spatial parameter. As we have already observed, there is also a higher chance of failing to capture the true incidence under the CL-ILM. However, the patterns of the posterior predictive distributions were fairly similar under the ILM and CL-ILM. Overall, this suggests that the CL-ILM provides a reasonable approximation to the basic spatial ILM.

Fig. 1.

Fig. 1

The posterior predictive distribution and 95% credible intervals (black lines) for ILM and CL-ILM under the SI framework across various scenarios. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample.

Fig. 2.

Fig. 2

The posterior predictive distribution and 95% credible intervals (black lines) for ILM and CL-ILM under the SI framework across various scenarios. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample. (continue).

Fig. 3.

Fig. 3

The posterior predictive distribution and 95% credible intervals (black lines) for ILM and CL-ILM under the SI framework across various scenarios. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample. (continue).

Fig. 4.

Fig. 4

The posterior predictive distribution and 95% credible intervals (black lines) for ILM and CL-ILM under the SI framework across various scenarios. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample. (continue).

4.2. SIR framework

Under the SIR compartmental framework, we evaluate the performance of CL-ILMs when data is generated from a basic spatial ILM. Here, we consider the infectious period follows Poisson distribution with a mean value of 4.

4.2.1. Choosing β0 via maximizing the likelihood

For the true parameter value of (α, β) = (0.7, 4), the spatial parameter β0˜ was found to be either 3.5, 4.0, 4.5, or 5.5 by maximizing the likelihood (Table 6). The highest proportion was 0.500 for the true value of β0˜=4.0. Similarly, when the true parameter value was (0.5, 3), for the majority of epidemics (0.667) β0˜=3.0 was chosen. When the true values were (0.2, 4) and (0.9, 5), the highest proportion was 0.633 and 0.400 for β0˜=4.0 and β0˜=5.0, respectively.

Table 6.

Proportion of time β0˜ selected for the four scenarios under the SIR framework.

True parameter values (α, β) Selected β0˜ Proportion
(0.7, 4) 3.5 0.133
4.0 0.500
4.5 0.333
5.5 0.033
(0.5, 3) 2.5 0.033
3.0 0.667
3.5 0.233
4.0 0.067
(0.2, 4) 3.5 0.033
4.0 0.633
4.5 0.300
5.0 0.033
(0.9, 5) 4.5 0.200
5.0 0.400
5.5 0.200
6.0 0.167
6.5 0.033

Once again, the findings imply that the maximizing likelihood approach can be effectively used to fix β0˜ for the CL-ILM, either picking the true value or one close to the true value under all epidemic scenarios tested.

4.2.2. Model fit

Table 7 shows the average MSE and SD under posterior prediction of the incidence-based epidemic curves under the SIR framework. Once again, the average MSE and SD were higher for the CL-ILM compared to the spatial ILM for all scenarios. This would be anticipated, of course, since the real data observed was simulated from the ILM.

Table 7.

Comparing average MSE and average SD between the ILM and CL-ILM under the SIR framework.

True parameter values (α, β) ILM
CL-ILM
Avg(MSE) Avg(SD) Avg(MSE) Avg(SD)
(0.7,4) 716.932 644.098 1315.145 1105.329
(0.5,3) 609.157 676.928 889.988 897.006
(0.2, 4) 600.913 507.291 2811.673 1304.466
(0.9, 5) 612.103 494.885 2472.659 1544.632

In Table 8, we summarize the mean proportion of time points at which credible intervals capture the true number of infections with its standard deviation under the SIR framework. For each scenario, the mean proportion value was slightly lower for the CL-ILM compared to the spatial ILM. Under the CL-ILM, the lowest capture proportion was 0.843, and the highest capture proportion was 0.950. Under the ILM, the mean proportion varied between 0.959 and 0.990. Moreover, the standard deviations were higher for the CL-ILM compared to the ILM for all scenarios except (0.9, 5).

Table 8.

Mean proportion of time credible interval captures the true distribution of infection with its SD under the SIR framework.

True parameter values (α, β) ILM
CL-ILM
Mean SD Mean SD
(0.7,4) 0.959 0.073 0.892 0.109
(0.5,3) 0.990 0.045 0.950 0.078
(0.2, 4) 0.987 0.041 0.846 0.070
(0.9, 5) 0.966 0.060 0.843 0.042

Fig. 5, Fig. 6, Fig. 7, Fig. 8 (see Appendix) show the posterior predictive distribution of the epidemic curves under the ILM and CL-ILM for each scenario under the SIR framework. Here, we notice that the width of the posterior predictive intervals is larger for the CL-ILM compared to the ILM for almost all scenarios, suggesting more uncertainty in epidemic prediction under the CL-ILM. Moreover, the patterns of the posterior predictive distributions are slightly different for the CL-ILM compared to ILM. Note that this differs from performance under the SI model. However, the credible interval mostly captures the true number of infections, suggesting that the CL-ILM provides a reasonable approximation to the basic spatial ILM.

Fig. 5.

Fig. 5

The posterior predictive distribution and 95% credible intervals (black lines) for ILM and CL-ILM under the SIR framework across various scenarios. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample.

Fig. 6.

Fig. 6

The posterior predictive distribution and 95% credible intervals (black lines) for ILM and CL-ILM under the SIR framework across various scenarios. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample. (continue).

Fig. 7.

Fig. 7

The posterior predictive distribution and 95% credible intervals (black lines) for ILM and CL-ILM under the SIR framework across various scenarios. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample. (continue).

Fig. 8.

Fig. 8

The posterior predictive distribution and 95% credible intervals (black lines) for ILM and CL-ILM under the SIR framework across various scenarios. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample. (continue).

4.3. Computational time

We measured the computational time required to fit both the spatial ILM and CL-ILM models under the SI and SIR frameworks. The ILM model fitting was performed using MCMC with 100,000 iterations, while the CL-ILM model fitting employed the glm function. The computational time was influenced by the sample size, number of infected individuals, duration of the epidemic, and, in the case of the ILM, by the number of MCMC iterations. Table 11, Table 12 (see Appendix) display the average computational times for epidemics within population of 500 and 1000 individuals, respectively. These tables present the average computational time across 10 epidemic simulations, with the results expressed in minutes (min).

In all scenarios under both the SI and SIR frameworks, the ILM consistently took longer to fit compared to the CL-ILM when the sample size was 500. The CL-ILM model required less than a minute to fit, while the spatial ILM model needed between 5 and 9 min, depending on the scenario.

When the sample size was increased to 1000, under the SI framework, the CL-ILM took between 2 and 6 min to fit, whereas the ILM required between 30 and 77 min. Similarly, under the SIR framework, the CL-ILM took 2–4 min, while the ILM fitting time ranged from 27 to 55 min. As expected, the time savings differential becomes noticeably larger as the sample size increases.

5. Application to semi-real data

Here, we fit the CL-ILM to a simulated epidemic based on foot and mouth disease data (FMD) from the UK epidemic of 2001. The reason for using this ‘semi-real’ data rather than the actual data set is that the culling strategy imposed by the UK government in 2001 is very hard to mimic, and so the posterior predictive performance of the epidemic model tends to be poor. The culling strategy varied over time and space but essentially aimed at pre-emptively culling animals as farms thought to be at high risk. Thus, we simulate a new ‘true’ epidemic under our ILM, utilizing only the spatial location coordinates (X, Y) from the 2001 UK FMD dataset, without incorporating a culling strategy. Then we compare the performance of the ILM and CL-ILM based on the ‘semi-real’ data. We consider a subset of 1101 farms from the Cumbria region, with infection times varied between t = 30 to 71 in days (t = 1 being the day of the first infection).

In this study, we consider a conditional logistic ILM of the following form,

logP(Yit|β0˜)1P(Yit|β0˜)=α0+α1jϵI(t)dijβ0˜, (6)

where: α0 is the intercept and α1 is the slope of the model. Then we compare the model with the spatial ILM as follows

Pit=1expαjϵI(t)dijβ. (7)

To simulate the epidemics, we used the true parameter values (α, β) = (0.9, 9) for both the SI and SIR frameworks. In the SIR case, the infectious period was modeled using a Poisson distribution with a mean of 8. We compared the spatial logistic ILMs by maximizing likelihood values to fix β0. Here, the values of β0 considered were 1,0.5,1,,14.5,15.0. The spatial parameter β0˜ was found 9.5 and 8.5 under the SI and SIR framework, respectively. This implies that the maximizing likelihood approach can be successfully used to fix β0˜ for the CL-ILM by picking a value close to the true value.

Then, we assess the performance of CL-ILM under the posterior and compare it with the spatial ILM. Table 9 shows the average MSE, and SD for posterior prediction of the incidence-based epidemic curves in the SI and SIR frameworks. We also summarize the proportion of time credible intervals capture the true number of new infections in this table.

Table 9.

Average MSE, SD, and proportion of time points true epidemic curve is captured in the credible intervals under the SI and SIR frameworks.

SI
SIR
ILM CL-ILM ILM CL-ILM
Avg MSE 61.392 59.401 51.199 44.676
SD 16.739 15.027 18.162 11.721
Proportion 0.952 0.905 0.928 0.881

Under the SI framework, the average mean squared error (MSE) and standard deviation (SD) were nearly identical for both the spatial ILM and CL-ILM. However, the proportion of time points at which the true curve captured by the 95% predictive intervals was marginally lower for the CL-ILM (0.905) compared to the spatial ILM (0.952). In contrast, when applying under the SIR framework, the average MSE and SD were lower for the CL-ILM relative to the spatial ILM. Nonetheless, the proportion of capture was slightly reduced for the CL-ILM (0.881) compared to the spatial ILM (0.928). Moreover, fitting the CL-ILM model (using the glm function) to the semi-real FMD data took 13.60 min under the SI framework and 5.63 min under the SIR framework, whereas the ILM model (using MCMC) required 52.17 min and 29.39 min under the SI and SIR frameworks, respectively.

Fig. 9 (see Appendix) presents the posterior predictive distribution and the 95% credible interval for both the spatial ILM and CL-ILM models applied to the semi-real FMD data. Under the SI framework, the posterior predictive distributions for the spatial ILM and CL-ILM were quite similar, although, under the SIR framework, the CL-ILM exhibited lower posterior uncertainty compared to the spatial ILM, particularly at later time points. Overall, however, these findings suggest that the CL-ILM provides a reasonable approximation to the basic spatial ILM.

Fig. 9.

Fig. 9

The posterior predictive distribution and 95% credible intervals (black lines) of ILM and CL-ILM for semi-real FMD data. The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample.

6. Application to TSWV data

The performance of the CL-ILM model is evaluated using a dataset from an experiment on the spread of tomato spotted wilt virus (TSWV) (Hughes et al., 1997). This dataset contains the TSWV infection status of 520 pepper plants grown in a greenhouse. The plants were arranged in 26 rows, each spaced 1 m apart, with 20 plants per row, placed 0.5 m apart. The experiment began on May 26, 1993, and ended on August 16, 1993. The infection period is divided into 14-day intervals, represented as t = 1 to t = 7 (the first infection occurred at t = 2). The TSWV data, available in the EpiILM package in R, include individual ID numbers, location coordinates (X, Y), infectious time, and removal time. We analysed 507 individuals with infection times ranging from t = 4 to t = 7. For the SIR framework, the infectious period was assumed to be constant at three time units.

We fitted model (6) and model (7) to the TSWV dataset under both the SI and SIR frameworks. To fix the spatial parameter β0, we considered the spatial parameter values 1,0.2,0.4,,3.8,4.0. The optimal spatial parameter β0˜ was found to be 1.4 under both frameworks.

Then, we evaluated the CL-ILM's performance using posterior estimates and compared it to the spatial ILM model using average MSE and SD. Table 10 presents the average MSE, and SD for posterior prediction of the incidence-based epidemic curves of TSWV data in both the SI and SIR frameworks. In addition, we summarize the proportion of time points at which credible intervals captured the true number of new infections in this table.

Table 10.

Average MSE, SD, and proportion of time points true epidemic curve is captured in the credible intervals under the SI and SIR frameworks.

SI
SIR
ILM CL-ILM ILM CL-ILM
Avg MSE 155.441 143.444 152.342 142.441
SD 101.730 90.745 89.762 87.280
Proportion 1.000 1.000 1.000 1.000

The CL-ILM demonstrated lower average MSE and SD compared to the spatial ILM in both frameworks. Additionally, the proportion of time points at which the true curve captured by the 95% predictive intervals was exactly 1 for both the spatial ILM and CL-ILM across the SI and SIR frameworks. Moreover, fitting the CL-ILM (using glm function) to the TSWV data took less than half a minute, whereas the ILM (using MCMC) required 8.26 min and 6.90 min under the SI and SIR frameworks, respectively.

Fig. 10 (see Appendix) shows the posterior predictive distribution and the 95% credible interval for both the spatial ILM and CL-ILM applied to the TSWV data. The posterior predictive patterns were quite similar between the two models, and the level of uncertainty was nearly identical for both ILM and CL-ILM within the SI and SIR frameworks. Once again, this indicates that the CL-ILM serves as a reasonable approximation to the basic spatial ILM.

Fig. 10.

Fig. 10

The posterior predictive distribution and 95% credible intervals (black lines) of ILM and CL-ILM for TSWV data (infection times t = 4 to t = 7). The red line indicates the observed epidemic curve, the green lines are the 500 samples and the blue line is the mean of the sample.

7. Discussion

This article has proposed a logistic ILM as both an alternative to, and approximation of, the individual-level model. Generally, the ILM is a complicated model and thus the inference for these models is computationally expensive especially when involves a large population. Moreover, the ILM generally calls for coding in low-level language which makes the analysis harder for the researcher with limited expertise in computational statistics. We use a new modelling framework called CL-ILMs. The logistic model is a well-understood model with an extensive choice of statistical software for fitting into data, and the CL-ILM is associated with a substantially lower computational burden.

Overall, we find that the CL-ILM provides for a good, and computationally efficient approximation, to the standard spatial ILM. Comparing parameters directly between these models is not generally helpful, but we note the MLE-based estimates of the spatial parameter β0 in the CL-ILM are generally close to those of the β spatial parameter in the standard spatial ILM. It is more useful to consider epidemic curve prediction under the models and we see that, generally, posterior predictive performance under both the spatial ILM and CL-ILM is similar. However, the posterior predictive uncertainty was found to be slightly greater under the SIR framework compared to the SI framework in our simulation study. In the case of TSWV data, the posterior predictive uncertainty was approximately the same across both the SI and SIR frameworks, whereas for the semi-real FMD data, less uncertainty was observed under the SIR framework compared to the SI framework. These findings suggest that the results vary slightly depending on factors such as the number of infected individuals and the duration of the epidemic.

Of course, this study has some limitations and there are other avenues of research worthy of exploration. Firstly, we supposed that event times (infection and removal times) are known. However, the event times are typically not observed in practice, with MCMC being used to solve this issue. However, it would be recommended to validate that our conclusions are robust when allowing for uncertain event times, though this would undoubtedly increase computation costs. Secondly, we use the maximize likelihood approach for tuning the spatial parameter in the CL-ILM. However, other methods such as probability scoring rules could be considered here. In addition, if susceptibility and/or transmissibility covariates are being included in the model, then the choice of fixed spatial parameter will need to incorporate model uncertainty regarding the covariates. Thus, we might want to consider criteria such as AIC or BIC for a few covariates or methods such as the LASSO or spike-and-slab priors with large numbers of covariates.

Spatial ILMs can be applied in the public health realm, generally to model disease spread between households or regions that have static spatial locations. CL-ILMs will have direct relevance to such problems. However, often we wish to incorporate different levels of transmission risk in public health models. For example, in Liu et al. (2021) an ILM was extended to allow for homogeneous within-city spread, and distance-based between-city spread, when modelling COVID-19, and individual-level spatial risk. Further, Mahsin et al. (2022) extended spatial ILMs to allow for a regional spatial structure within regions, when modelling influenza. Both of these situations, ‘metapopulation ILMs’ and ‘geographically dependent ILMs’ respectively, are prone to a high level of computational complexity. Thus, the extension of CL-ILMs to these problems would be of great interest.

Finally, here we have only considered SI and SIR compartmental frameworks for our CL-ILM, but the extension to others, such as the SEIR would be possible. However, the EpiILM package (in R) does not currently support simulating epidemic data under the SEIR or other frameworks. We can also consider the introduction of more complex data structures and dynamics into our CL-ILM framework. For example, we could consider behaviour change mechanisms (e.g., Ward et al., 2023), missing covariate information (Amiti et al., 2023), and contact network based continuous time ILMs (Almutiry & Deardon, 2020).

CRediT authorship contribution statement

Tahmina Akter: Writing – review & editing, Writing – original draft, Visualization, Software, Methodology, Formal analysis, Conceptualization. Rob Deardon: Writing – review & editing, Visualization, Validation, Supervision, Methodology, Conceptualization.

Declaration of competing interest

The authors declare that they have no known financial conflicts of interest or personal connections that might have influenced the work reported in this paper.

Acknowledgements

This project was funded by an Alberta Innovates Graduate Student Scholarship for Data-Enabled Innovation and a University of Calgary Eyes High Doctoral Scholarship, Doctoral Completion Scholarship, Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants program (RGPIN/03292-2022) and the Alberta Innovates Advance - NSERC Alliance program (222302037).

Handling Editor: Yiming Shao

Footnotes

Peer review under responsibility of KeAi Communications Co., Ltd.

Contributor Information

Tahmina Akter, Email: tahmina.akter@ucalgary.ca.

Rob Deardon, Email: rob.deardon@ucalgary.ca.

Appendix.

Table 11.

Average computational time for fitting the ILM and CL-ILM models with a sample size of 500.

Framework True parameter values (α, β) ILM (Minutes) CL-ILM (Minutes)
SI (0.7,4) 6.99 0.48
(0.5,3) 5.35 0.40
(0.2,4) 9.94 0.70
(0.9,5) 8.51 0.60
SIR (0.7,4) 6.59 0.50
(0.5,3) 5.26 0.43
(0.2,4) 8.85 0.58
(0.9,5) 7.89 0.55

Table 12.

Average computational time for fitting the ILM and CL-ILM models with a sample size of 1000.

Framework True parameter values (α, β) ILM (Minutes) CL-ILM (Minutes)
SI (0.7,4) 44.91 4.19
(0.5,3) 30.11 2.80
(0.2,4) 77.18 6.27
(0.9,5) 73.18 5.64
SIR (0.7,4) 39.77 3.22
(0.5,3) 27.85 2.88
(0.2,4) 55.65 4.02
(0.9,5) 55.43 3.86

Table 13.

Model descriptions

Model Description
Individual-level model (ILM) Pit=1expΩS(i)jϵI(t)ΩT(j)k(i,j)ϵ(i,t), where: Pit is the probability that susceptible individual i is infected at time t; I(t) represents the set of infectious individuals; ΩS(i) and ΩT(j) are susceptibility and transmissibility functions, respectively, k(i, j) is an infection kernel involving risk factors for both; and ɛ(i, t) accounts for random infection behavior.
Logistic ILM logPit1Pit=ΨS(i)jϵI(t)ΨT(j)K(i,j)+eit, where: ΨS(i) and ΨT(j) are susceptibility and transmissibility functions in the logistic ILM, respectively, K(i, j) is the infection kernel, and eit represents unexplained infection. The force of infection (ΨSΨTK) relates to the log odds of infection probability.
Spatial logistic ILM logPit1Pit=α0+α1jϵI(t)dijβ0, where: α0 is the intercept, α1 is the coefficient of the spatial term, and β0 is the spatial parameter. The force of infection (α1jϵI(t)dijβ0) is linearly related to the log odds of infection probability.
Conditional logistic ILM logPit1Pit=α0+α1jϵI(t)dijβ0˜, where: β0˜ is fixed value of β0. Here, the condition applies to the spatial parameter β0.

References

  1. Almutiry W., Deardon R. Incorporating contact network uncertainty in individual level models of infectious disease using approximate Bayesian computation. The International Journal of Biostatistics. 2020;16(1):20170092. doi: 10.1515/ijb-2017-0092. [DOI] [PubMed] [Google Scholar]
  2. Almutiry W., Warriyar V.K.V., Deardon R. Continuous-time individual-level models of infectious disease: EpiILMCT. Journal of Statistical Software. 2021;98(10):1–44. [Google Scholar]
  3. Amiti L., Torabi M., Deardon R. Analyzing COVID-19 data in the Canadian province of manitoba: A new approach. Spatial Statistics. 2023;55 doi: 10.1016/j.spasta.2023.100729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Barro R.J., Ursua J.F., Weng J. National Bureau of Economic Research; 2020. The coronavirus and the great influenza pandemic: Lessons from the “Spanish flu” for the coronavirus's potential effects on mortality and economic activity. [Google Scholar]
  5. Deardon R., Brooks S.P., Grenfell B.T., Keeling M.J., Tildesley M.J., Savill N.J., Shaw D.J., Woolhouse M.E.J. Inference for individual-level models of infectious diseases in large populations. Statistica Sinica. 2010;20(1):239–261. [PMC free article] [PubMed] [Google Scholar]
  6. Gardner A., Deardon R., Darlington G.A. Bayesian goodness-of-fit measures for individual-level models of infectious disease. Spatial and Spatio-Temporal Epidemiology. 2011;2(4):273–281. doi: 10.1016/j.sste.2011.07.012. [DOI] [PubMed] [Google Scholar]
  7. Guttman I. The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society: Series B. 1967;29:83–100. [Google Scholar]
  8. Hughes G., McRoberts N., Madden L.V., Nelson S.C. Validating mathematical models of plant-disease progress in space and time. Mathematical Medicine and Biology: A Journal of the IMA. 1997;14(2):85–112. [Google Scholar]
  9. Jin R., Yan F., Zhu J. Application of logistic regression model in an epidemiological study. Science Journal of Applied Mathematics and Statistics. 2015;3(5):225–229. [Google Scholar]
  10. Liu Z., Deardon R., Fu Y., Ferdous T., Ware T., Cheng Q. Estimating parameters of two-level individual-level models of the COVID-19 epidemic using ensemble learning classifiers. Frontiers in Physics. 2021;8 [Google Scholar]
  11. Mahsin M., Deardon R., Brown P. Geographically dependent individual-level models for infectious diseases transmission. Biostatistics. 2022;23:1–17. doi: 10.1093/biostatistics/kxaa009. [DOI] [PubMed] [Google Scholar]
  12. Tildesley M.J., Savill N.J., Shaw D.J., Deardon R., Brooks S.P., Woolhouse M.E., Grenfell B.T., Keeling M.J. Optimal reactive vaccination strategies for a foot-and-mouth outbreak in the UK. Nature. 2006;440(7080):83–86. doi: 10.1038/nature04324. [DOI] [PubMed] [Google Scholar]
  13. Ward C., Deardon R., Schmidt A.M. Bayesian modeling of dynamic behavioral change during an epidemic. Infectious Disease Modelling. 2023;8(4):947–963. doi: 10.1016/j.idm.2023.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Warriyar V.K.V., Almutiry W., Deardon R. Individual-level modelling of infectious disease data: EpiILM. R Journal. 2020;12(1) [Google Scholar]

Articles from Infectious Disease Modelling are provided here courtesy of KeAi Publishing

RESOURCES