Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Jul 22;80(3):ujae068. doi: 10.1093/biomtc/ujae068

A Gaussian-process approximation to a spatial SIR process using moment closures and emulators

Parker Trostle 1, Joseph Guinness 2, Brian J Reich 3,
PMCID: PMC11261348  PMID: 39036985

ABSTRACT

The dynamics that govern disease spread are hard to model because infections are functions of both the underlying pathogen as well as human or animal behavior. This challenge is increased when modeling how diseases spread between different spatial locations. Many proposed spatial epidemiological models require trade-offs to fit, either by abstracting away theoretical spread dynamics, fitting a deterministic model, or by requiring large computational resources for many simulations. We propose an approach that approximates the complex spatial spread dynamics with a Gaussian process. We first propose a flexible spatial extension to the well-known SIR stochastic process, and then we derive a moment-closure approximation to this stochastic process. This moment-closure approximation yields ordinary differential equations for the evolution of the means and covariances of the susceptibles and infectious through time. Because these ODEs are a bottleneck to fitting our model by MCMC, we approximate them using a low-rank emulator. This approximation serves as the basis for our hierarchical model for noisy, underreported counts of new infections by spatial location and time. We demonstrate using our model to conduct inference on simulated infections from the underlying, true spatial SIR jump process. We then apply our method to model counts of new Zika infections in Brazil from late 2015 through early 2016.

Keywords: emulator models, moment-closure approximations, SIR models, spatiotemporal epidemiology

1. INTRODUCTION

Understanding the dynamics of disease spread is vital for control efforts. Without a deep comprehension of what drives new infections, efforts and resources targeting prevention may be misdirected. Epidemiological modeling plays an important role in this process by allowing researchers to analyze historical infection data. Many of these models are proposed by subject-area experts, such as veterinarians using their understanding of the dynamics of within- and between-farm spread to develop models for porcine diseases. These theory-based models are often compartmental models (eg, Paeng and Lee, 2017; Jones et al., 2021; Galvis et al., 2022). The most realistic of these models are stochastic compartmental models, which are flexible enough to handle the variability in real infection data. Unfortunately, these models are computationally demanding to fit and often require simulation-based approaches such as approximate Bayesian computing (ABC; Beaumont, 2010). This challenge is even greater when the data are spatially indexed, leading to potentially long computational run times that could last weeks or more.

The goal of our work is to demonstrate a novel statistical method to fit spatial, stochastic compartmental models for disease spread without requiring ABC. Such an approach will permit researchers to fit these disease models more efficiently without long simulations, while still properly accounting for the variability inherent to their stochastic processes. The key to our methodology is to approximate the complex epidemiological process with a simpler Gaussian process. In this paper, we demonstrate our approach by proposing a stochastic, spatial susceptible-infectious-recovered (SIR) model based on work by Paeng and Lee (2017) and then using our approximation to conduct inference on the true process.

Modeling data from such a stochastic process is hard in part because characterizing the moments of a continuous-time Markov process is not simple (eg, see Allen, 2008, on the SIR process). We rely therefore on a moment-closure approximation, which allows us to approximate the moments of our spatial SIR process with those of a Gaussian process. The terminology refers to “closing” the dependency on higher-order moments because the higher-order moments in a Gaussian process are fully characterized by its mean and covariance function. Moment-closure approximations were originally proposed in Whittle (1957), and such an approximation was used in Isham (1991) for the non-spatial SIR process. Moment-closure approximations have been used extensively in the natural sciences (eg, Kuehn, 2016; Forgues et al., 2019). There have been some proposed spatial moment-closure models, particularly in ecology (eg, Murrell et al., 2004; Ernst et al., 2019) and for disease spread on networks where nodes are binary-valued (eg, Sharkey et al., 2015; Chen et al., 2020). Our approach is more general and relies on fewer simplifying assumptions than some of the above literature. However, there is a significant drawback to the moment-closure approach. These approximations are characterized by coupled, ordinary differential equations (ODEs). These ODEs are used to calculate the means and spatial covariances for the Gaussian-process approximation. This is a computational bottleneck for even a modest number of spatial locations.

To solve this bottleneck, we implement an emulator-based approximation to the moment-closure forward equations. Emulators, also known as surrogate models, are used to approximate the output of computationally intensive models (Gramacy, 2020). Much of the framework for modern emulators can be traced to Kennedy and O’Hagan (2001), an influential paper that proposed a Bayesian approach to calibrating computer models. There are multiple methods building upon their framework (Higdon et al., 2004; Goldstein and Rougier, 2006; Qian et al., 2006; Bayarri et al., 2007). An influential follow-up methodologies was Higdon et al. (2008), which used an SVD-based approach to design emulators. This approach was extended in Hooten et al. (2011) and Leeds et al. (2014). Recently Pratola and Chkrebtii (2018) and Gopalan and Wikle (2022) extended SVD-based approaches to computational output stored in tensors, an approach we adapt in this paper. Many other approaches to constructing emulators are possible, however (eg, Reich et al., 2012; Massoud, 2019; Thakur and Chakraborty, 2022).

In this paper, we describe our methodology and show how it may be used in a spatiotemporal, Bayesian hierarchical model for retrospective statistical inference given noisy counts of new infections. We do so by modeling the latent number of susceptibles using a low-rank emulator model based on a moment-closure approximation to our underlying continuous-time spatial SIR process. Our work makes multiple novel contributions, including extending the moment-closure work of Isham (1991) to a spatial domain, demonstrating the use of emulators to approximate the resulting ODEs, and showing, to the best of our knowledge, the first demonstrated use of an emulator for a covariance function in a physical-statistical model.

2. SPATIAL SIR MODEL

We begin by describing our spatial extension to the SIR model. We base our model on that of Paeng and Lee (2017) but generalize the spatial infections to not necessarily depend on distance between spatial locations, to allow Inline graphic to vary spatially, and by making the process stochastic. We first propose our spatial SIR jump process in Section 2.1, and then we derive its moment-closure approximation and the resulting forward equations in Section 2.2. A background overview for the closed-population SIR model is provided in Web Appendix A.

2.1. Spatial jump process

Let Inline graphic be a spatial lattice with Inline graphic spatial coordinates Inline graphic. Let Inline graphic be the set of spatial locations adjacent to Inline graphic. We denote the susceptibles at Inline graphic at time Inline graphic as Inline graphic, and we similarly denote the infectious at Inline graphic at time Inline graphic as Inline graphic. Each location has population size Inline graphic that does not vary in time. The number of recovered is therefore determined by Inline graphic and Inline graphic, so we do not model them directly.

Define the current state at time Inline graphic as Inline graphic. Consider a sufficiently small Inline graphic such that only one new infection or recovery at most may occur at any spatial location. Then, there are Inline graphic possible events in the interval Inline graphic for this sufficiently small Inline graphic: a new infection in one of the Inline graphic locations, a recovery in one of the Inline graphic locations, or no change. For Inline graphic, let Inline graphic denote the new infection event such that Inline graphic except Inline graphic and Inline graphic, and let Inline graphic denote the recovery event such that Inline graphic except Inline graphic. Then, our spatial SIR jump process is defined via the following conditional probabilities:

2.1. (1)

with the probability of no event, Inline graphic, equal to one minus the sum of the Inline graphic probabilities in (1). These probabilities are controlled by local infection parameters Inline graphic, spatial infection parameters Inline graphic, and a recovery parameter Inline graphic.

Our model makes a few simplifying assumptions about the nature of the disease. First, we assume Inline graphic does not vary spatially because it reflects an intrinsic property of the pathogen, though this may be a simplification in cases where individuals have spatially varying access to healthcare. Second, we do not consider Inline graphic, Inline graphic, or Inline graphic to vary temporally, although in practice they may change as diseases evolve or as healthcare measures are put into place. In this paper, we focus on generally shorter time domains, but our methodology could be extended to allow for temporally varying parameters, e.g., by regressing Inline graphic on to spatiotemporal covariates. Furthermore, our model could be extended further to include common other compartments, such as an “exposed” compartment in an SEIR model (Abou-Ismail, 2020). We discuss this further in Web Appendix B.4.

2.2. Forward equations for the spatial SIR jump process

Working with the spatial SIR jump process directly to model new infection counts is not practical. Therefore, we now describe the moment-closure approximation to this jump process in which we use an approximating Gaussian process. As described in Isham (1991), there are two approaches to derive the governing ODEs for the means and spatial covariances through time, which we call the forward equations. We begin by demonstrating the simpler of the two approaches, which we call the heuristic approach, by deriving the forward equation for the mean number of susceptibles.

As in Section 2.1, consider Inline graphic and a sufficiently small Inline graphic that only the Inline graphic transition events described earlier may occur. For a particular Inline graphic, there can be two events that occur:

2.2. (2)

where Inline graphic was defined above in (1). The expected change in Inline graphic is therefore Inline graphic. Taking the expectation of this quantity with respect to the spatially indexed Inline graphic following a Gaussian process and then taking the limit as Inline graphic yields:

2.2. (3)

where Inline graphic is the mean susceptibles at Inline graphic at time Inline graphic, Inline graphic is the mean infectious at Inline graphic at time Inline graphic, and Inline graphic is the covariance between the susceptibles at Inline graphic and infectious at Inline graphic at time Inline graphic. Though not present in (3), we define Inline graphic and Inline graphic similarly.

Deriving the remainder of the moment-closure forward equations to the spatial SIR jump process for Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic may be done in a similar fashion as shown in (3). In general, this is straightforward though tedious algebra. The full set of forward equations is provided in Web Appendix B along with derivation details. The alternative derivation approach is based on Whittle (1957) and involves moment and cumulant generating functions. Details on that method are also provided in Web Appendix B. The two derivation methods yield the same forward equations.

We assume the starting conditions of the outbreak do not correspond to a high probability of the infection dying out. This is a source of moment-closure approximations failing and is referred to as “epidemic fadeout” (Lloyd, 2004). Only a subspace of the parameter space yields realistic epidemic behavior. In practice checking the curves for the mean susceptibles to ensure they are monotonically nonincreasing serves as a check on these assumptions, so in practice our prior is truncated to the space of values that do not lead to epidemic fadeout.

3. EMULATOR FOR THE MEAN AND COVARIANCE FUNCTIONS

The moment-closure approximation for the number of susceptibles Inline graphic is a Gaussian process with mean function Inline graphic and spatial covariance function Inline graphic. Both of these functions vary over space and time and depend on the model parameters Inline graphic, where Inline graphic and Inline graphic is the source of the outbreak. We assume that the recovery rate Inline graphic and starting time of the outbreak Inline graphic are known. Typically Inline graphic may be estimated using patient reports, and Inline graphic may be imputed based on initial reported cases. In this section, we develop a statistical emulator to approximate the forward equations with an emulator that can be used for repeated function calls in an MCMC algorithm. To do so, we evaluate the forward equations of Inline graphic space-filling input parameters Inline graphic and then use the realizations of the forward equations to build a statistical prediction of their values at other inputs Inline graphic. We denote the output of running the forward equations for input Inline graphic as Inline graphic and Inline graphic, where Inline graphic.

Much of the discussion in this section will use tensor terminology and notation. We provide a brief overview of the necessary background in Web Appendix C. We first describe our methodology in Sections 3.1 and 3.2 in terms of the higher-order singular value decompositions (HOSVD) of tensors, as in Gopalan and Wikle (2022). In Section 3.3, we discuss our imputation methodology for arbitrary Inline graphic. In Section 3.4, we describe our model for new infections.

3.1. Low-rank approximation to the mean function

We store our simulated output for the mean susceptibles in a third-order tensor Inline graphic, where the slice Inline graphic is the matrix Inline graphic. We construct a low-dimensional approximation to Inline graphic using an HOSVD (De Lathauwer et al., 2000; Kolda and Bader, 2009), such that the low-dimensional approximation is Inline graphic with Inline graphic and Inline graphic. The HOSVD is frequently compared with the well-known SVD for matrices, and it is a method to calculate a “low-rank approximation [to a tensor] with small reconstruction error” (Zare et al., 2018).

We denote the spatial factor matrix of Inline graphic as Inline graphic and the temporal factor matrix of Inline graphic as Inline graphic with columns Inline graphic and Inline graphic. This corresponds to a Inline graphic low-rank approximation to Inline graphic, with Inline graphic and Inline graphic. The low-rank approximation to the mean susceptibles is therefore

3.1. (4)

The basis weights Inline graphic control the dependence of the mean function on the model parameters Inline graphic as well as the interactions between Inline graphic and Inline graphic. These weights are the focus of our interpolation in Section 3.3, and they result from calculating Inline graphic, where Inline graphic and Inline graphic are the n-mode products. We note that if Inline graphic and Inline graphic, then our approximation to Inline graphic is exact (Kolda and Bader, 2009). Finally, we implement streaming algorithms for this low-rank approximation for large Inline graphic, Inline graphic, and Inline graphic to avoid storing the entire tensor Inline graphic. These streaming algorithms let us calculate Inline graphic and Inline graphic by holding only one Inline graphic in memory at a time. See Web Appendix D.

3.2. Low-rank approximation to the covariance function

Our approach to emulating the spatial covariance function is similar to our approach for the mean function. We store our simulated output for the spatial covariance matrices in a tensor Inline graphic. The marginal spatial covariance matrix for Inline graphic at time Inline graphic is therefore Inline graphic. We now wish to calculate a low-rank approximation Inline graphic to Inline graphic based on an HOSVD, where Inline graphic and Inline graphic.

There is symmetry in Inline graphic because the slice corresponding to Inline graphic is a covariance matrix. Therefore, the factor matrices for the first and second mode are identical. To ensure symmetry while avoiding redundancies, we build an emulator for Inline graphic such that Inline graphic. It is more efficient to emulate Inline graphic because no matrix decompositions are required while model fitting.

We first calculate the spatial factor matrix Inline graphic, where Inline graphic are the orthonormal spatial basis functions that capture spatial variability in the unfolded tensor along the first mode (equivalently, the second mode). We then calculate Inline graphic, which calculates time- and parameter-indexed weights for the interaction of the spatial basis functions. In the case of Inline graphic, then we may write Inline graphic. It follows there exists a Cholesky decomposition Inline graphic such that Inline graphic, which naturally implies that we may set Inline graphic (see Web Appendix E).

Approximating Cholesky decompositions in an analogous fashion serves as a further dimension reduction. This can be accomplished doing a low-rank approximation to Inline graphic. We perform this low-rank approximation by forming a tensor Inline graphic, where Inline graphic, and calculating a temporal factor matrix for Inline graphic, i.e., we calculate the first Inline graphic left singular vectors of Inline graphic unfolded along its third, temporal mode. We denote this temporal factor matrix as Inline graphic, where Inline graphic are orthonormal temporal basis functions. We then calculate the weight tensor Inline graphic by calculating Inline graphic.

Therefore, we approximate the covariance function as

3.2. (5)
3.2. (6)

By modeling the covariance function as Inline graphic with Inline graphic comprised of elements Inline graphic, we ensure that our approximation for Inline graphic is non-negative definite for all Inline graphic and Inline graphic. We note as before that all our approximations are exact when Inline graphic and Inline graphic (Kolda and Bader, 2009), and we also note these calculations may be performed using streaming algorithms to avoid memory problems (see Web Appendix D).

3.3. Gaussian process regression for interpolation

To fit our data model using MCMC, we need to evaluate Inline graphic and Inline graphic for proposed parameters Inline graphic. Given the low-rank approximations above, evaluating Inline graphic and Inline graphic reduces to evaluating Inline graphic and Inline graphic. Therefore, we require a model for the weights as functions of the input parameters Inline graphic. To do so, we build a Gaussian-process regression for Inline graphic and Inline graphic assuming independence between the indices Inline graphic and using the same covariance for all predictions. We then predict Inline graphic and Inline graphic using local kriging with a Matérn correlation function. We provide additional details in Web Appendix D.

3.4. Emulator model for new infections

Using the emulators described above, our model for the new infections Inline graphic is

3.4. (7a)
3.4. (7b)
3.4. (7c)
3.4. (7d)

Under this parameterization, the mean of Inline graphic is Inline graphic with variance

Inline graphic , with Inline graphic being the reporting rate and Inline graphic controlling overdispersion. These parameters model real-world variance in the number of new infections reported.

We use B-spline basis curves as a low-rank approximation to a time series to capture temporal correlation in the susceptibles. Let Inline graphic, and let Inline graphic be a matrix containing Inline graphic B-spline basis curves standardized so that the diagonal entries of Inline graphic are equal to 1. Therefore, Inline graphic and Inline graphic.

We use a truncated normal prior for Inline graphic with mean 3, variance 25, and lower bound 1; uninformative normal priors for Inline graphic and Inline graphic; a discrete uniform prior for Inline graphic over Inline graphic; and a standard normal prior for Inline graphic.

3.5. Computational details

We briefly summarize the three main steps in our approach and their computational burdens, given the streaming algorithms in Web Appendix D:

  1. Running the forward equations requires Inline graphic  Inline graphic loops, where Inline graphic is the number of time steps between Inline graphic and Inline graphic to approximate the continuous curves (see Web Appendix F). The memory requirements are on the order of Inline graphic. Sparse-matrix operations greatly assist in this step, making the Inline graphic matrix much faster in practice. For example, the most costly step naively is instead Inline graphic with sparse matrices, where Inline graphic is the number of edges on the spatial graph (Davis, 2006). Furthermore, the forward calculations may be calculated in parallel.

  2. Calculating the basis functions for Inline graphic requires Inline graphic matrix multiplications (which may be done in parallel), a subsequent Inline graphic eigendecomposition, and then Inline graphic matrix multiplications. The memory requirements are on the order of Inline graphic, though this could be eased as described in Web Appendix D.

  3. Fitting the model using MCMC requires storing the weights for the tensor-based approximations in memory (on the order of Inline graphic for Inline graphic and Inline graphic for Inline graphic). The slowest step is the repeated Inline graphic search for the nearest neighbors in the parameter space for proposed values of Inline graphic.

We provide additional suggestions on implementation details as well as our MCMC algorithm in Web Appendix F. Though the emulator setup requires nontrivial computational resources, it yields about a 40-fold speed-up in computing Inline graphic and Inline graphic compared with running the forward equations directly when fitting the model. Table 1 provides some example computational times for preparing the emulator. These times are based on a university cluster with nodes consisting mostly of Intel Xeon Gold 6130, 6226R, and Platinum 8358 processors.

TABLE 1.

Runtime estimates to perform all forward-equation and emulator computations for Inline graphic and Inline graphic on a university cluster.

Inline graphic Mean-function runtime Covariance-function runtime
100 8.6 sec 149.6 sec
225 21.0 sec 15.3 min
400 24.3 sec 59.1 min

4. SIMULATION STUDY

We analyze the performance of our method on data simulated from the spatial SIR jump process described in Section 2.1. We simulate counts of daily susceptibles from this jump process using the Gillespie stochastic simulation algorithm (Gillespie, 1976, 1977). The simulated response data Inline graphic are generated by drawing from an overdispersed negative-binomial distribution with mean Inline graphic and with overdispersion Inline graphic.

4.1. Simulation study with constant β

We define a regular lattice of 25 spatial locations with rook adjacencies. The population size is uniformly 100 000, and the center of the square grid is Inline graphic. We assume 100 individuals were infected at time 0 at Inline graphic and all others were suspectible. We use Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic. We begin our analysis at day 61, so Inline graphic, reflecting data not being collected in the early days of an outbreak. We simulate 200 data sets for each simulation study.

We tune our emulators to use Inline graphic, Inline graphic, Inline graphic, and Inline graphic (see Web Appendix G for details). For the first simulation, we use Inline graphic so there is no underreporting, and we use Inline graphic. For the second simulation, we use Inline graphic so not all new infections get reported. For the third simulation, we use Inline graphic but set Inline graphic. For all simulations, we also fit a simpler model that only uses the mean function, i.e., sets all Inline graphic equal to 0. Table 2 shows our results. It takes 2.0 minutes to draw 1000 posterior samples for these simulations.

TABLE 2.

Results for first set of simulation studies using the full emulator approximation to the moment-closure equations (“full”), using only the mean emulator (“mean only”), approximate Bayes computation (“ABC”), or least-squares optimization (“ODE”).

Scenario Model Inline graphic , 95% cov. Inline graphic , 99% cov. Inline graphic , 95% cov. Inline graphic , 99% cov. Inline graphic , MSE (SE) Inline graphic , MSE (SE)
Baseline Full 96.0% 99.0% 86.0% 96.5% 3.507 (0.366) 0.547 (0.058)
Baseline Mean only 41.0% 51.5% 31.0% 38.5% 26.153 (5.375) 3.363 (0.590)
Baseline ABC 100.0% 100.0% 100.0% 100.0% 6.304 (0.585) 0.737 (0.068)
Baseline ODE 38.0% 59.0% 39.0% 64.0% 19.773 (2.224) 2.698 (0.301)
Inline graphic Full 96.0% 99.0% 89.0% 95.5% 4.523 (0.434) 0.595 (0.058)
Inline graphic Mean only 46.5% 60.0% 38.0% 48.5% 15.005 (1.644) 2.018 (0.217)
Inline graphic ABC 100.0% 100.0% 100.0% 100.0% 7.276 (0.638) 0.771 (0.070)
Inline graphic ODE 41.5% 59.0% 44.0% 65.0% 15.634 (1.681) 2.054 (0.218)
Inline graphic Full 95.5% 97.0% 92.0% 96.5% 7.873 (0.819) 0.832 (0.081)
Inline graphic Mean only 63.0% 74.0% 49.5% 61.0% 18.574 (2.934) 2.354 (0.321)
Inline graphic ABC 99.5% 100.0% 99.5% 100.0% 12.493 (1.189) 1.210 (0.117)
Inline graphic ODE 40.5% 60.0% 43.5% 63.0% 17.203 (1.739) 2.181 (0.213)

Empirical interval coverage is shown along with MSE and its standard error, multiplied by Inline graphic.

We compare our results in Table 2 with two standard approaches for spatial epidemiological data. First, we fit the model using ABC (Beaumont et al., 2009). We use the R package SimInf to simulate from the stochastic spatial SIR process (Bauer et al., 2016; Widgren et al., 2019), and we use the same priors as in our proposed methodology. We use the means and standard deviations of the spatial infection counts to accept or reject proposed parameter draws, and we follow Beaumont et al. (2009) by repeatedly decreasing the acceptance threshold by 10% over ten iterations of the algorithm. Second, we fit the model analogously to Paeng and Lee (2017) via least-squares optimization using ODE curves based on (1). We also estimate confidence intervals for these parameters using parametric bootstrapping (Chowell, 2017).

As Table 2 shows, our good coverage and parameter estimates come from the covariance emulator. Fitting only mean curves results in worse coverage, a consequence of failing to allow sufficient model flexibility. As expected, our mean squared errors (MSEs) increase as overdispersion increases or reporting rates decrease. Our results compare favorably to the competing methods, each of which has material drawbacks. Though easy to implement, ABC is slow, taking two to three days to draw 200 posterior samples. Coverage remains typically at 100% with high MSEs. Even easier to implement is the least-squares optimization in the ODE approach, but estimates are poor and have inadequate coverage. It is the fastest approach, running under an hour.

4.2. Simulation study with spatially varying β (s)

For our second set of simulations, we allow Inline graphic to vary spatially such that Inline graphic. We draw Inline graphic and then calculate Inline graphic to center the spatially-varying covariate. We use Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic while keeping Inline graphic and Inline graphic. We begin our analysis at day 21, i.e., Inline graphic. We again use Inline graphic, Inline graphic, Inline graphic, and Inline graphic (see Web Appendix G).

Table 3 shows coverage compared with a mean-emulator-only model, and Table 4 similarly shows the MSEs for Inline graphic, Inline graphic, and Inline graphic. The difference between including or excluding the covariance emulator is striking. The problem with the mean-only model is insufficient flexibility to estimate the parameters well. The inclusion of the covariance emulator yields good coverage and MSEs. It takes 5.5 minutes to draw 1000 posterior samples for these simulations.

TABLE 3.

Empirical credible-interval coverage results for second set of simulation studies using the full emulator approximation (“full model”) to the moment-closure equations or using only the mean emulator (“mean only”) as well as an approximate Bayesian computation (“ABC”) fit and a least-squares fit using the ODEs (“ODE”).

Study Inline graphic , 95% cov. Inline graphic , 99% cov. Inline graphic , 95% cov. Inline graphic , 99% cov. Inline graphic , 95% cov. Inline graphic , 99% cov.
Full model 92.0% 96.0% 89.5% 96.0% 88.5% 95.5%
Only means 24.5% 27.5% 14.0% 17.0% 21.5% 28.0%
ABC 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
ODE 100.0% 100.0% 59.5% 65.5% 99.0% 100.0%

In this study, Inline graphic where Inline graphic is a spatially varying covariate.

TABLE 4.

MSEs and their standard errors for second set of simulation studies using the full emulator approximation (“full model”) to the moment-closure equations or using only the mean emulator (“mean only”) as well as an approximate Bayesian computation (“ABC”) fit and a least-squares fit using the ODEs (“ODE”).

Study Inline graphic , MSE (SE) Inline graphic , MSE (SE) Inline graphic , MSE (SE)
Full model 25.202 (2.843) 1.298 (0.130) 0.010 (0.001)
Only means 171.197 (7.682) 7.682 (1.301) 0.071 (0.005)
ABC 5,946.657 (187.692) 435.759 (12.136) 0.554 (0.014)
ODE 110.416 (14.678) 17.660 (1.266) 0.033 (0.005)

In this study, Inline graphic where Inline graphic is a spatially varying covariate. Both MSEs and SEs are multiplied by Inline graphic.

We again compare our results with ABC and a least-squares fit using ODE curves. As before, ABC was slow, and we were decreased the acceptance tolerance threshold only five times to get runtime under four days. This led to wide credible intervals and poor MSE, suggesting yet more time was needed to fit this model. Unlike the first simulation study, the ODE approach sometimes yielded 100% coverage. We found the estimated sampling distributions exhibited strong skew with long tails, caused by unlikely simulated infections leading to wildly different parameter estimates.

5. DATA ANALYSIS

We demonstrate the application of our model to Zika virus outbreak data in Brazil. The Zika virus is typically a mild virus, causing fevers, rashes, and various pains (CDC, 2019b). Unfortunately, there are serious complications for pregnant women, whose children may be born with multiple severe birth defects including microcephaly (CDC, 2019a). Several papers have studied the outbreak in Brazil as a whole but left spatial and geographic heterogeneities for future work (Wang et al., 2017; Sadeghieh et al., 2021). One paper fit separate models for the outbreaks in eight different states independently (Zhao et al., 2019).

The Pan American Health Organization (PAHO) reports government-supplied data on weekly new cases of the Zika virus (PAHO, 2021). The data are reported for all 27 states of Brazil, including the Federal District. The first significant outbreak began in Maranhao in early 2015 with infections peaking in the second half of 2015 and into 2016 (PAHO, 2016). We fit our model to the 40 weeks of outbreak data beginning in the 41st week of 2015 through the 28th week of 2016, corresponding roughly to October 2015 through July 2016 (PAHO, 2021).

We relied on the estimate of weekly recovery rate Inline graphic from Sadeghieh et al. (2021), which studied Zika spread by mosquitoes. We assumed the outbreak began in the first week of 2015 in Maranhao with 100 infections. We expected the reported weekly new infections to be underreported; one paper suggests the reporting rate for Zika infections was between 7% and 17% (Shutt et al., 2017) for Central and South America as a whole. These rates likely varied by country. To pick starting values and to evaluate reporting rates, we fit a simple model to the Zika data with a Poisson likelihood and mean function based on a deterministic SIR model. We found that even the lower bound of 7% yielded poor fits to the data, and we estimated that the reporting rates were often closer to 1% and varied by state. We fixed those reporting rates for all subsequent analysis. Additional suggestions on using available literature and estimating initial parameters is in Web Appendix H, which may be useful in cases when the disease is not as well documented as Zika.

Using the above starting values, we prepared our emulator simulations. We model Inline graphic, where Inline graphic is the scaled and centered log population density of state Inline graphic (IBGE, 2021). We used Inline graphic, Inline graphic, Inline graphic, and Inline graphic, and we used the 20 nearest neighbors when kriging for both emulators. This was to help with the large parameter space relative to Inline graphic.

We estimate Inline graphic, Inline graphic, and Inline graphic  Inline graphic, where the estimates are posterior means and the numbers in parentheses are 95% credible intervals. Fitting the emulator took 232 minutes using 20 cores, and it took 2.5 minutes to draw 1000 posterior samples during model fitting (totaling 14.5 hours for 250 000 samples after discarding 100 000 as burn-in). By comparison, drawing fifty posterior samples using ten iterations of ABC took 4.5 days and yielded Inline graphic, Inline graphic, and Inline graphic, leading to much wider credible intervals. As a further comparison, we fit a non-spatial model using the original Isham (1991) Gaussian Process approximation to the non-spatial, stochastic SIR model (see Web Appendix A). Because this model only takes 38 seconds to fit, we do not rely on an emulator, though otherwise the methodology is analogous to our approach described in Section 3.4. We estimate Inline graphic, where Inline graphic is the sole transmission parameter. As expected, Inline graphic is much larger here because all transmission are local to Brazil as a whole. We again fix Inline graphic.

There are two implications from our model fit. First is that, as expected, higher population densities are associated with higher rates of transmission. The second is that there is evidence of strong spatial spread of Zika as evidenced by the credible interval for Inline graphic. We note that the parameter estimates and credible intervals were not in the emulator design space corresponding to problematically low Inline graphic and high Inline graphic as discussed previously.

Our model fit best in states with the highest counts of outbreaks. In Figure 1, we plot the new counts of Zika infections in Rio de Janeiro and Bahia, the two states with the highest counts of reported new infections, along with overall state-level Zika infections during this time. Additional discussion of these results are available in Web Appendix I.

FIGURE 1.

FIGURE 1

Plots of reported new Zika infections in Brazil in late 2015 through early 2016 by state and normalized by population (A), in Rio de Janeiro only (B), and in Bahia only (C). Rio de Janeiro and Bahia are the states with the most reported new infections. The lines in the (B) and (C) plots are the fitted mean curves.

6. DISCUSSION

We have proposed a novel method to analyze spatial infection data based on combining moment-closure approximations with emulators. Both components of our method are vital. The emulator approach is needed because of the computational run time of the forward equations. However, without the forward equations the emulator-based approximation would need to be based on the stochastic, spatial SIR jump process, a much harder modeling problem.

Though this paper has been about a spatial SIR jump process, our approach can be extended to other continuous-time Markov processes. There is a need for efficient approximations in domains like the veterinary literature, which often relies on repeated simulations of complex compartmental models for epidemiological studies (eg, Jones et al., 2021; Galvis et al., 2022). In our experience, even simple spatial models there may take weeks to fit. Our method could allow a significant reduction in the computational burden in this work.

There are several areas for improvement. First, we assume there is no epidemic fadeout. A more sophisticated moment closure may not need this assumption. The second is that as the dimension of Inline graphic grows, Inline graphic may have to grow as well, suggesting a need for a more sophisticated experimental design. The third is we assume Inline graphic, Inline graphic, and the starting time of the infection are known or could be estimated in preliminary analysis. The final is practical: our approach requires a significant amount of hands-on work, requiring the user to derive forward equations, tune emulator designs, and then tune an MCMC algorithm. Finding ways to simplify the approach would allow for its wider adoption.

Supplementary Material

ujae068_Supplemental_Files

Web Appendices referenced in Sections 25 along with all code for Sections 45 are available with this paper at the Biometrics website on Oxford Academic. The code is also available at: https://github.com/jptrostle/SpatialSIRGPMC.

Acknowledgement

The authors thank the associate editor and the referee for their constructive and insightful comments and suggestions that greatly improved the paper. The authors also acknowledge the computing resources provided by North Carolina State University High Performance Computing Services Core Facility (RRID:SCR_022168).

Contributor Information

Parker Trostle, Department of Statistics, North Carolina State University, Raleigh, NC, 27607, United States.

Joseph Guinness, Department of Statistics and Data Science, Cornell University, Ithaca, NY, 14853, United States.

Brian J Reich, Department of Statistics, North Carolina State University, Raleigh, NC, 27607, United States.

FUNDING

This work was partially supported by the National Science Foundation (DMS2152887) and the National Institutes of Health (R01ES031651-01).

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

The data that support the findings in this paper are available at the Pan American Health Organization (PAHO) Health Information Platform for the Americas (PLISA) website at https://www3.paho.org/data/index.php/en/mnu-topics/zika.html.

References

  1. Abou-Ismail  A. (2020). Compartmental models of the Covid-19 pandemic for physicians and physicians-scientists. SN Comprehensive Clinical Medicine, 2, 852–858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Allen  L. J. S. (2008). An introduction to stochastic epidemic models. In: Mathematical Epidemiology, (eds. Brauer  F. P.), 81–130. Berlin: Springer. [Google Scholar]
  3. Bauer  P., Engblom  S., Widgren  S. (2016). Fast event-based epidemiological simulations on national scales. The International Journal of High Performance Computing Applications, 30, 438–453. [Google Scholar]
  4. Bayarri  M. J., Walsh  D., Berger  J. O., Cafeo  J., Garcia-Donato  G., Liu  F.  et al. (2007). Computer model validation with functional output. The Annals of Statistics, 35, 1874–1906. [Google Scholar]
  5. Beaumont  M. A. (2010). Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics, 41, 379–406. [Google Scholar]
  6. Beaumont  M. A., Cornuet  J.-M., Marin  J.-M., Robert  C. P. (2009). Adaptive approximate Bayesian computation. Biometrika, 96, 983–990. [Google Scholar]
  7. CDC (2019a). Birth Defects | Zika virus | CDC. https://www.cdc.gov/zika/healtheffects/birth_defects.html. [Accessed 20 July 2022].
  8. CDC (2019b). Symptoms | Zika Virus | CDC. https://www.cdc.gov/zika/symptoms/symptoms.html. [Accessed 20 July 2022].
  9. Chen  X., Ogura  M., Preciado  V. M. (2020). SDP-based moment closure for epidemic processes on networks. IEEE Transactions on Network Science and Engineering, 7, 2850–2865. [Google Scholar]
  10. Chowell  G. (2017). Fitting dynamic models to epidemic outbreaks with quantified uncertainty: A primer for parameter uncertainty, identifiability, and forecasts. Infectious Disease Modelling, 2, 379–398. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Davis  T. A. (2006). Direct Methods for Sparse Linear Systems. Philadelphia: Society for Industrial and Applied Mathematics. [Google Scholar]
  12. De Lathauwer  L., De Moor  B., Vandewalle  J. (2000). A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 21, 1253–1278. [Google Scholar]
  13. Ernst  O. K., Bartol  T. M., Sejnowski  T. J., Mjolsness  E. (2019). Learning moment closure in reaction-diffusion systems with spatial dynamic Boltzmann distributions. Physical Review E, 99, 063315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Forgues  F., Ivan  L., Trottier  A., McDonald  J. G. (2019). A Gaussian moment method for polydisperse multiphase flow modelling. Journal of Computational Physics, 398, 108839. [Google Scholar]
  15. Galvis  J. A., Corzo  C. A., Machado  G. (2022). Modelling and assessing additional transmission routes for porcine reproductive and respiratory syndrome virus: Vehicle movements and feed ingredients. Transboundary and Emerging Diseases, ​​​69, e1549–e1560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gillespie  D. T. (1976). A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of Computational Physics, 22, 403–434. [Google Scholar]
  17. Gillespie  D. T. (1977). Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry, 81, 2340–2361. [Google Scholar]
  18. Goldstein  M., Rougier  J. (2006). Bayes linear calibrated prediction for complex systems. Journal of the American Statistical Association, 101, 1132–1143. [Google Scholar]
  19. Gopalan  G., Wikle  C. K. (2022). A higher-order singular value decomposition tensor emulator for spatiotemporal simulators. Journal of Agricultural, Biological and Environmental Statistics, 27, 22–45. [Google Scholar]
  20. Gramacy  R. B. (2020). Surrogates : Gaussian Process Modeling, Design, and Optimization for the Applied Sciences. Boca Raton: Chapman and Hall/CRC. [Google Scholar]
  21. Higdon  D., Gattiker  J., Williams  B., Rightley  M. (2008). Computer model calibration using high-dimensional output. Journal of the American Statistical Association, 103, 570–583. [Google Scholar]
  22. Higdon  D., Kennedy  M., Cavendish  J. C., Cafeo  J. A., Ryne  R. D. (2004). Combining field data and computer simulations for calibration and prediction. SIAM Journal on Scientific Computing, 26, 448–466. [Google Scholar]
  23. Hooten  M. B., Leeds  W. B., Fiechter  J., Wikle  C. K. (2011). Assessing first-order emulator inference for physical parameters in nonlinear mechanistic models. Journal of Agricultural, Biological, and Environmental Statistics, 16, 475–494. [Google Scholar]
  24. IBGE (2021). Portal do IBGE. https://www.ibge.gov.br/en/home-eng.html. [Accessed 21 July 2022].
  25. Isham  V. (1991). Assessing the variability of stochastic epidemics. Mathematical Biosciences, 107, 209–224. [DOI] [PubMed] [Google Scholar]
  26. Jones  C. M., Jones  S., Petrasova  A., Petras  V., Gaydos  D., Skrip  M. M.  et al. (2021). Iteratively forecasting biological invasions with PoPS and a little help from our friends. Frontiers in Ecology and the Environment, 19, 411–418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kennedy  M. C., O’Hagan  A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63, 425–464. [Google Scholar]
  28. Kolda  T. G., Bader  B. W. (2009). Tensor decompositions and applications. SIAM Review, 51, 455–500. [Google Scholar]
  29. Kuehn  C. (2016). Moment closure—a brief review. In: Control of Self-Organizing Nonlinear Systems, Understanding Complex Systems, Cham, Switzerland: Springer. [Google Scholar]
  30. Leeds  W. B., Wikle  C. K., Fiechter  J. (2014). Emulator-assisted reduced-rank ecological data assimilation for nonlinear multivariate dynamical spatio-temporal processes. Statistical Methodology, 17, 126–138. [Google Scholar]
  31. Lloyd  A. L. (2004). Estimating variability in models for recurrent epidemics: Assessing the use of moment closure techniques. Theoretical Population Biology, 65, 49–65. [DOI] [PubMed] [Google Scholar]
  32. Massoud  E. C. (2019). Emulation of environmental models using polynomial chaos expansion. Environmental Modelling & Software, 111, 421–431. [Google Scholar]
  33. Murrell  D. J., Dieckmann  U., Law  R. (2004). On moment closures for population dynamics in continuous space. Journal of Theoretical Biology, 229, 421–432. [DOI] [PubMed] [Google Scholar]
  34. Paeng  S.-H., Lee  J. (2017). Continuous and discrete SIR-models with spatial distributions. Journal of Mathematical Biology, 74, 1709–1727. [DOI] [PubMed] [Google Scholar]
  35. PAHO (2016). Timeline of the emergence of Zika virus in the Americas. www.paho.org. [Accessed 1 August 2022].
  36. PAHO (2021). Data - BRA Zika Report. www.paho.org. [Accessed 1 August 2022].
  37. Pratola  M., Chkrebtii  O. (2018). Bayesian calibration of multistate stochastic simulators. Statistica Sinica, 28, 693–719. [Google Scholar]
  38. Qian  Z., Seepersad  C. C., Joseph  V. R., Allen  J. K., Jeff Wu  C. F. (2006). Building surrogate models based on detailed and approximate simulations. Journal of Mechanical Design, 128, 668–677. [Google Scholar]
  39. Reich  B. J., Kalendra  E., Storlie  C. B., Bondell  H. D., Fuentes  M. (2012). Variable selection for high dimensional Bayesian density estimation: Application to human exposure simulation. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61, 47–66. [Google Scholar]
  40. Sadeghieh  T., Sargeant  J. M., Greer  A. L., Berke  O., Dueymes  G., Gachon  P.  et al. (2021). Zika virus outbreak in Brazil under current and future climate. Epidemics, 37, 100491. [DOI] [PubMed] [Google Scholar]
  41. Sharkey  K., Kiss  I., Wilkinson  R., Simon  P. (2015). Exact equations for SIR epidemics on tree graphs. Bulletin of Mathematical Biology, 77, 614–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Shutt  D. P., Manore  C. A., Pankavich  S., Porter  A. T., Del Valle  S. Y. (2017). Estimating the reproductive number, total outbreak size, and reporting rates for Zika epidemics in South and Central America. Epidemics, 21, 63–79. [DOI] [PubMed] [Google Scholar]
  43. Thakur  A., Chakraborty  S. (2022). A deep learning based surrogate model for stochastic simulators. Probabilistic Engineering Mechanics, 68, 103248. [Google Scholar]
  44. Wang  L., Zhao  H., Oliva  S. M., Zhu  H. (2017). Modeling the transmission and control of zika in brazil. Scientific Reports, 7, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Whittle  P. (1957). On the use of the normal approximation in the treatment of stochastic processes. Journal of the Royal Statistical Society. Series B (Methodological), 19, 268–281. [Google Scholar]
  46. Widgren  S., Bauer  P., Eriksson  R., Engblom  S. (2019). Siminf: An R package for data-driven stochastic disease spread simulations. Journal of Statistical Software, 91, 1–42. [Google Scholar]
  47. Zare  A., Ozdemir  A., Iwen  M. A., Aviyente  S. (2018). Extension of PCA to higher order data structures: An introduction to tensors, tensor decompositions, and tensor PCA. Proceedings of the IEEE, 106, 1341–1358. [Google Scholar]
  48. Zhao  S., Musa  S. S., Fu  H., He  D., Qin  J. (2019). Simple framework for real-time forecast in a data-limited situation: The zika virus (zikv) outbreaks in brazil from 2015 to 2016 as an example. Parasites & Vectors, 12, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujae068_Supplemental_Files

Web Appendices referenced in Sections 25 along with all code for Sections 45 are available with this paper at the Biometrics website on Oxford Academic. The code is also available at: https://github.com/jptrostle/SpatialSIRGPMC.

Data Availability Statement

The data that support the findings in this paper are available at the Pan American Health Organization (PAHO) Health Information Platform for the Americas (PLISA) website at https://www3.paho.org/data/index.php/en/mnu-topics/zika.html.


Articles from Biometrics are provided here courtesy of Oxford University Press

RESOURCES