Abstract
A better understanding of various patterns in the coronavirus disease 2019 (COVID-19) spread in different parts of the world is crucial to its prevention and control. Motivated by the previously developed Global Epidemic and Mobility (GLEaM) model, this paper proposes a new stochastic dynamic model to depict the evolution of COVID-19. The model allows spatial and temporal heterogeneity of transmission parameters and involves transportation between regions. Based on the proposed model, this paper also designs a two-step procedure for parameter inference, which utilizes the correlation between regions through a prior distribution that imposes graph Laplacian regularization on transmission parameters. Experiments on simulated data and real-world data in China and Europe indicate that the proposed model achieves higher accuracy in predicting the newly confirmed cases than baseline models.
Subject terms: Epidemiology, Statistics
Introduction
The outbreak of coronavirus disease 2019 (COVID-19) has impacted all aspects of the world significantly for a long. As of 26 Oct 2021, over 243 million confirmed cases of COVID-19 have been reported, including over 4 million deaths1. Therefore, it is essential to study the spread of COVID-19 for better prediction and prevention of the disease. This paper proposes a new stochastic dynamical model that can describe different spread patterns of COVID-19 in multiple regions. We also develop an algorithm to estimate the corresponding transmission parameters and their posterior distributions. Our model is inspired by the Global Epidemic and Mobility (GLEaM) model proposed in Ref.2. GLEaM is a stochastic dynamic model that depicts the spread of epidemics, integrating multiple data layers. The GLEaM model involves 3362 subpopulations in 220 countries obtained from Voronoi tessellation, centered around major airports. These subpopulations are connected by a multi-layered mobility network composed of processes from short-range commuting between nearby subpopulations to international flights. In each subpopulation, the transmission of epidemics is modeled by a variant of an Susceptible-Exposed-Infected-Removed (SEIR) compartmental model3. Please see “Review of the SEIR model and GLEaM model” for a more detailed review of the SEIR model and GLEaM model.
In the vast majority of GLEaM’s applications4–11, the parameters are estimated based on Ref.12. Reference12 performed the maximal likelihood analysis of the reproduction number in the seed region, Mexico. For each value of the reproduction number , the method generated the distribution of the arrival time of the influenza A(H1N1) in 12 countries produced by GLEaM simulations. Then, the optimal reproduction number was chosen by maximizing the likelihood function of arrival time. Reference12 and subsequent works following its settings4–11 assumed that the epidemic was seeded from one region and the transmission parameters or the other parameters (like the introduction date and location) were estimated through the maximal likelihood analysis of arrival time or other events. In particular, the method in Ref.12 was adopted in Ref.6 to estimate the posterior distribution of the reproduction number of COVID-19, which was assumed to be uniform for all subpopulations at all times. However, this setting is unsuitable for the current scenario of the COVID-19 pandemic since COVID-19 has lasted for a long time, and the community transmission has been widespread in most countries in the world13. To model the spread of COVID-19, both spatial and temporal heterogeneity of the transmission parameters are needed, rather than directly modeling the reproduction number solely as a periodic function of time as in Ref.12. This is because the social behaviors, containment measures, medical conditions, and other elements that affect the spread of COVID-19 may vary among different countries and over time.
Recently, Reference14 improved the inference method in Ref.12 based on the GLEaM model, by involving spatial heterogeneity. Specifically, Ref.14 estimated the initially infected individuals in each subpopulation through microblogging data from Twitter and also estimating the reproduction number for USA, Italy, and Spain separately. However, international travel was not considered in this study, and GLEaM was applied to each of the aforementioned countries (as an isolated systems) independently. The transmission rates in all subpopulations of these countries were again presumed to be homogeneous. Furthermore, Ref.14 also assumed that the initially infected individuals for each subpopulation were proportional to the total number of Twitter users in that subpopulation. Thus it still assumed that the severity of the pandemic at the initial outbreak of COVID-19 was uniform over the country, which is not the case for COVID-19.
In addition to the abovementioned issues, other potential concerns exist in applying GLEaM to model the spread of COVID-19. As mentioned in the last section of Ref.15, GLEaM can be used to simulate the spread of the epidemic under normal conditions since it uses the “steady-state” mobility data around the world. However, since the outbreak of COVID-19, the social order has been disrupted, and travel has been restricted in most countries. Thus GLEaM might not work well with its multi-layered mobility networks. Furthermore, the estimate of parameters using GLEaM is based on a large number of simulations to explore the space of parameters, which may potentially take much computational time2 when the epidemic parameters to be estimated are spatially heterogeneous. In addition, although the social behavior, medical conditions, and other factors that affect the spread of COVID-19 may vary among different regions, these factors for regions that are geographically close or have similarities in other aspects still bear some resemblance. Hence, the transmission rates for COVID-19 should not only have their own heterogeneity but also be correlated to each other. To the best of our knowledge, neither of the features is reflected in GLEaM or most of its applications.
As the consequences of the possible constraints of GLEaM described above, most of the papers using GLEaM to model the epidemics mainly focus on estimating only the transmission parameter in the seed region at the very beginning of the outbreak. However, for the current long-lasting spread of the COVID-19 pandemic all over the globe, the spatial and temporal heterogeneity of the transmission parameters is needed to be taken into full consideration.
In this paper, we propose a new stochastic model that incorporates transportation between regions and at the same time enables spatial and temporal heterogeneity of transmission parameters. We model n regions as a graph having n nodes, and the transportation pattern between the regions is encoded as n-by-n matrices. Our graphical model of epidemic dynamics is a general abstract one motivated by and simplified from the GLEaM framework. Figure 1 shows a diagram of the proposed model. In contrast to most applications of GLEaM, which mainly focus on the initial outbreak, our proposed model is able to model the long-lasting spread of epidemics. For the inference of model parameters, we introduce an optimization algorithm that utilizes the correlation between districts. Furthermore, the posterior distribution of parameters is estimated by an Markov Chain Monte Carlo (MCMC) sampling procedure, where we set the initial value of the Markov Chain as the optimal parameter obtained by the optimization algorithm. This approach can potentially accelerate the convergence of MCMC sampling.
Figure 1.
A diagram illustrating the model proposed in this paper. The example includes three regions, marked by circles in blue and indexed by 1, 2, and 3. For and , on the edge (k, j) represents the similarity between regions k and j. The square nodes associated with each region denote the compartments for each region, including susceptible, exposed, hospitalized, and removed compartments. The arrows connecting square nodes denote the transition between compartments in each region, and the details can be found in “Model description”. Note that and , the transmission parameters in the three regions, are allowed to be spatially heterogeneous. The double arrows in red denote the transportation between three regions on the l-th day. For (), represents the total transportation volume from region k to region j on the l-th day.
In summary, the main contributions of our paper are:
We propose a new stochastic model to describe the epidemic’s long-lasting spread, allowing spatial and temporal heterogeneity of transmission parameters and transportation between districts.
Based on the proposed model, we also design an algorithm that first makes inference for the parameters through a two-step procedure and then estimates the posterior distribution efficiently by MCMC sampling with the estimated parameters as the initial points. The parameter inference combines the information of correlation between districts, which is equivalent to imposing graph Laplacian regularization on the transmission parameters.
-
We compare the performance of the proposed model with the baseline models on both simulated and real-world data.
- For the simulated data, the results show that combining heterogeneity and transportation into the model helps improve the performance of trajectory prediction and parameter estimation. Moreover, our inference algorithm that integrates the correlation of districts leads to further improvement in predicting the future trajectories.
- For the real-world data in China and Europe, the proposed model outperforms the baselines in trajectory prediction.
A strength of the proposed model resides in introducing spatial and temporal heterogeneity of transmission parameters. We compare with more related works and comment on the differences and relations in “More related works”. Datasets used in this paper are publicly available at Refs.16–18. Our work focuses on the methodology development and we aim at a new stochastic dynamic model that is generally applicable.
We list the default notations and parameters used throughout the paper in Table 1. The rest of the paper is structured in the following way: In “Methods”, we introduce the stochastic dynamic model and the corresponding inference algorithm. In “Experimental results for simulated data”, we compare the performance of trajectory prediction and parameter estimation of the models with or without mobility, heterogeneity, and using correlation information in the inference part for the simulated data. Section “Experimental results on COVID-19 data” describes the real-world data used in this paper, and presents the results and findings of applying the proposed model to the COVID-19 data in China and Europe. We discuss the limitations and possible extensions in “Discussion”.
Table 1.
List of notations and parameters.
| Notation | Description | Notes |
|---|---|---|
| n | Total number of regions | |
| T | Total number of days | |
|
Susceptible individuals in the k-th region at time t , |
The same range of k and t also holds below |
|
| Exposed individuals in the k-th region at time t | need to be inferred | |
| Hospitalized individuals in the k-th region at time t | need to be inferred | |
| Removed individuals in the k-th region at time t | ||
| Total individuals in the k-th region at time t |
is assumed to be constant over time |
|
| Traveling volume matrix on the l-th day | ||
| Deterministic counterpart of , | Other compartments follow the same | |
| Determined by (2.4) | Convention of notations, | |
| Accumulated confirmed cases determined by (2.4) | ||
| Accumulated removed cases determined by (2.4) | ||
| Accumulated confirmed cases on the i-th day | ||
| Accumulated removed cases on the i-th day | ||
|
Newly confirmed cases determined by (2.4) on the i-th day |
||
| Newly confirmed cases on the i-th day | ||
| Infection rate in the k-th region |
need to be inferred and are allowed to be time-varying |
|
| Inverse of average incubation period | is prefixed as 0.14 | |
| Inverse of average removed time in the k-th region | need to be inferred | |
| Set of parameters to be estimated | ||
| W |
The matrix characterizing the proximity between regions |
|
| d |
Total number of groups that n regions are divided into |
|
| The m-th group of regions | ||
|
Penalty factor of Graph Laplacian regularization |
||
|
Parameter reducing regularization between inter-group regions |
||
| A | Affinity matrix constructed by W, , and as in (2.12) | |
| Parameter in the prior (2.10) |
Review of the SEIR model and GLEaM model
In this section, we provide a more detailed introduction to the SEIR model and the GLEaM model so as to provide a background of our study and augment the following context.
To depict the evolution of the epidemics, Ref.19 proposed the celebrated Susceptible-Infected-Removed (SIR) model and characterized the development of the pandemic with a deterministic ordinary differential equation (ODE). There are many extensions of the SIR model, including the Susceptible-Exposed-Infected-Removed (SEIR) model for diseases with a latent period, the Susceptible-Infected-Susceptible(SIS) model for diseases that do not gain immunity after recovery, etc.
These deterministic transmission models are constructed under certain assumptions, including that the population is large, closed, and homogeneous. Due to the random nature of the transmission process, many stochastic dynamic models are developed20–22. Under certain rather generalized conditions, the deterministic models can be seen as the mean-field equations of the corresponding stochastic processes. However, this approximation may not hold when the size of the outbreak has not grown up to the same order of the total population, which is the case in many applications23. More details can be found in Ref.24 and the references therein.
The Global Epidemic and Mobility (GLEaM) model proposed in Ref.2 used a meta-population scheme which balanced between the agent-based stochastic models and the deterministic compartmental models. Specifically, Ref.2 adapted a high-resolution population database that divided the surface of the earth with cells of 15 min 15 min of arc, and then used Voronoi tessellation to assign each cell to one of the major airports around the world. The obtained subdivisions were then called subpopulations.
The stochastic dynamic in the subpopulations was then coupled with two layers of mobility flows apart from the infection dynamic within each subpopulation. The first layer was the worldwide airport network between the airports in the subpopulations, which could be seen as a weighted graph whose edges represented the number of passengers between each pair of airports. This layer was integrated into the model through stochastic transportation between subpopulations. The second layer was the commuting network that connected subpopulations graphically close. This layer was integrated through being used to compute the effective population and infection in each subpopulation. More details can be found in Ref.2.
More related works
Several recent works also involved different levels of heterogeneity in their models in various ways. Reference25,26 utilized randomness in reproduction numbers to reflect the heterogeneity of the population, using plate model with Bayesian method and heterogeneous well-mixed theory27 with age-of-infection method19, respectively. References28,29 used functional data analysis tools. Specifically, Ref.28 captured two different epidemic patterns in different regions of Italy using the probKMA algorithm Refs.30, and29 revealed different patterns of the epidemic across countries with functional principle component analysis. In addition, Refs.31–35 adapted SEIR / Susceptible-Exposed-Infected (SEI) / Susceptible-Infected (SI) compartmental models similar to this paper. Among these works,31–33 considered heterogeneity in the aspects of age groups, social links, and vaccination status separately. References34,35 bore more similarity with our paper since they also allowed transmission parameters to be spatially heterogeneous and involved transportation between different regions. However, Ref.34 only considered intracounty data, and the transportation was used to compute the effective size of compartments and did not affect the dynamic model. Furthermore, the transmission rates in Ref.34 were determined by an SDE whose parameters were to be fitted. Therefore, Ref.34 focused on a different scope from our study. The settings of compartments in Ref.35 were more realistic than the one considered in our paper by considering reporting rates. Nevertheless, compared with the model and inference algorithm described in “Methods”, transmission rates in Ref.35 did not have temporal heterogeneity or correlation with each other. Both Refs.34,35 used the Ensemble Kalman Filter, which samples particles in the state space according to the prior distribution and obtains the posterior distribution in the process of moving particles at each time step. This might be computationally less efficient than directly applying MCMC according to the posterior distribution with the initial point maximizing the posterior distribution, as implemented in this paper.
Methods
Ethics statement
The medical record data in China and Europe used in this paper are publicly available and can be found on the official websites of the National Health Commission of the People’s Republic of China16, the Chinese Center for Disease Control and Prevention17, and European Centre for Disease Prevention and Control18. The collection of data is performed in compliance with local government regulations. More details about data sources can be found in “Data sources”.
Model description
Compartmental model over multiple regions
In GLEaM2 and other epidemic models involving transportation34,35, the whole area is usually divided into subdivisions. For example, the GLEaM model divides the total area of 220 countries into over 3300 subpopulations centered around major airports and34 divided Milwaukee County and Dane County in the state of Wisconsin into several regions. In this paper, we consider abstract subdivisions in the whole area, which will be referred to as “regions” hereinafter until further specifications in the later experiment sections. We denote n as the number of regions.
In our model, we use continuous time , where it is assumed that the evolution of the epidemic lasts within a period of T time units. The unit of time is fixed as one day throughout this paper. Note that when we introduce the transportation model in below the traveling matrix is assumed to be constant within each day, and the observed data is also collected on a daily basis. Thus we will use notation of discrete time (days) from hereinafter, however, the evolution dynamic itself is modeled over continuous time.
For each region, we consider the following epidemiology compartments adapted from the SEIR model:
: Susceptible.
: Exposed and infectious.
: Hospitalized.
: Removed (recovered or dead).
The subscript k of the states denotes that they belong to the k-th region, and the dependence on the continuous time t is addressed through expressing the states as functions of .
At time t, we use to denote the total population in the k-th region. The population is allowed to be time-varying due to the inter-region mobility, especially for the days before the implementation of travel restrictions. However, since the traveling volume is not comparable to the total population in a region, the fluctuation of the total population in a region is not obvious. In this paper, we assume that keeps constant over time, which means that we consider a closed system, where exported/imported cases are not considered. However, it is worth noting that we do allow the transportation of active virus carriers between regions within our system. We remark in advance that this assumption is reasonable for the real-world data sets considered in this paper. From January to February 2020, strict international travel restrictions were imposed in China. While for data in Europe, from May to August 2020, the local spread of the epidemic has reached a relatively high level, and the imported cases were not comparable to the indigenous cases. We also denote as the total population that are permitted to move in the k-th region, excluding the hospitalized ones.
Transportation between regions and the stochastic model
Transportation plays an essential role in the spread of COVID-19. Actually, Refs.36,37 indicated that the travel restrictions were remarkably important in mitigating the transmission of COVID-19, especially in the early stage of the pandemic. Recently, as detailed in Ref.38, the Omicron variant had spread to 110 countries and had become dominant in many of them by 22 December 2021, only one month after its first report from South Africa on 24 November 2021. This motivates us also to take transportation into consideration in this paper. In our model, we introduce the transportation between regions via a traveling matrix, which is similar to the notation in the GLEaM model2. Specifically, we denote as the traveling volume from region k to j on the l-th day (). Then the traveling matrix on the l-th day can be written as
| 2.1 |
Given the transportation matrix, we describe in below the stochastic model of the dynamic of the compartments over n regions, denoted as , where is the continuous time, and the variables , , , and take integer values from 0 to N. The proposed stochastic model is illustrated in Fig. 1.
Transmission in the k-th region: A case from chooses an individual from randomly at Poisson rate ( are allowed to be spatially heterogeneous), and the individual chosen is infected if it is of state . Note that this is different from traditional SEIR models, since we assume that for the COVID-19 case, the pre-symptomatic patients from can be contagious.
Hospitalization in the k-th region: Each individual in will be hospitalized with Poisson rate .
Recovery or death in the k-th region: Each individual in will transfer into with Poisson rate . The rate owns spatial heterogeneity due to the uneven distribution of medical resources.
- Transportation between regions: At the end of the l-th day, all the individuals in region k except the ones in have the same probability of traveling from region k to region j, and the total traveling volume from region k to region j is . We assume that there are no transmissions happening during the transportation between regions. If we denote as the number of people transported from to at the end of the l-th day, then follows a multinomial distribution. Specifically,
with2.2
As a consequence, is a piece-wise constant function of t which only changes at the end of each day. Specifically, for any time ,
| 2.3 |
A more comprehensive stochastic dynamic model has been previously developed in Ref.24. However, the work did not consider transportation between regions, which is a focus of this study.
Differential equation with spatial heterogeneity
Following Refs.39,40, we derive the corresponding mean-field differential Eq. (2.4) of the stochastic dynamic introduced in “Transportation between regions and the stochastic model”, which is continuous in time, and the compartments take real values.
| 2.4 |
The first four equations of (2.4) describe the evolution of the in the deterministic version of our model, which is governed by the transition dynamic explained in “Transportation between regions and the stochastic model”. The fifth equation characterizes the deterministic total population, which is a piece-wise linear function of time t (since the traveling volume is a piece-wise constant function of t) and coincides with expressed as in (2.3) when t takes integer values. The last two equations depict the evolution of accumulated confirmed and removed cases, denoted by and respectively, in the deterministic model. It is worth noting that in the calculation of and , each case is only accounted for once. In (2.4), is the deterministic counterpart of and the same for .
Furthermore, we assume that the accumulated confirmed and removed cases are available from data on a daily basis, which are denoted as and , respectively. We also assume that is 0, while and are left to be inferred for each . For inference of parameters, we further denote as the deterministic newly confirmed cases on the i-th day determined by (2.4), and as the newly confirmed cases computed from data, namely , . The same convention holds for the definitions of and . Note that the data are random in nature.
Note that the model and the inference algorithm described below can be applied to estimate parameters as long as , , and are available. The availability of and is required in many works that use the SEIR model to estimate transmission rates of the epidemic41–43, and the transportation network is also used in GLEaM2 and its applications. However, in contrast to the works based on GLEaM4–6,12,15, we allow parameters to possess both spatial and temporal heterogeneity, and further utilize the correlation between regions in the inference of parameters. We remark that the spatial and temporal heterogeneity is reflected in the fact that the transmission parameters are allowed to vary in both space and time in our model. The temporal heterogeneity is introduced in more detail for the real-world data in “Model extension by allowing time-varying parameters”.
Estimation of model parameters
Based on the model described in “Model description”, the parameters that need to be specified are and . Using a simplification in Remark 1, we prefix the parameter , and estimate the rest in a two-step procedure to be described in this section. As a brief summary,
Step 1. We first make inference for by maximizing the likelihood of the observed newly removed cases. Details in “Step 1: Estimate ”.
Step 2. After the estimation of , are then estimated by maximizing the posterior probability, where we introduce a prior distribution combining the information of correlation between regions. Details in “Step 2: Estimate ”.
Finally, in the end of Step 2, we introduce an MCMC sampling approach to estimate the marginal posterior distributions of . This provides information about the uncertainty of the estimated parameters, like , which are of scientific interest. We summarize the two-step procedure in this section together in Algorithm 1.
Remark 1
Among the unknown parameters, we prefix the parameter , the inverse of the average time for a person from being exposed to hospitalized, to be 0.14 universally in the algorithm. According to44, the mean duration of incubation period is 5.2 days. Furthermore, we assume that the average time for an individual from showing symptoms to being hospitalized is 2 days45,46. Thus, the mean duration for an individual from being exposed to being hospitalized is 7.2 days, whose inverse value is approximately 0.14.
Step 1: Estimate
We first estimate by maximizing the likelihood
over for each k. We assume that the newly removed cases in one day follow a Poisson distribution whose mean equals to the product of and the accumulated hospitalized cases (which is the difference between the accumulated confirmed cases and the accumulated removed cases, and thus is observable) the day before. Then, the likelihood of can be written as
where () denotes the probability that k occurrences are observed for a discrete random variable X having a Poisson distribution with mean .
Then, we estimate for each k separately.
Step 2: Estimate
Next, we estimate the remaining parameters , by finding that achieves maximum a posteriori probability (MAP).
Posterior distribution of and MAE estimate. We denote the posterior distribution of given data as . Then by Bayesian formula,
| 2.5 |
where is the prior distribution of to be determined and is a constant irrelevant to . We further denote
| 2.6 |
then
| 2.7 |
To fit the realistic evolution of the epidemic more precisely, is estimated as
| 2.8 |
with reasonable prior distribution . Then, MCMC sampling scheme starting from is applied to get the posterior distribution for . This process might possess higher computational efficiency than choosing the initial point for MCMC randomly or empirically.
Next, we specify the formulas for the likelihood function and the prior distribution .
Likelihood function of . Notice that the ODE system (2.4) is the mean-field version of our stochastic model, and are determined by the parameters ( and are treated as given), thus by the Markov property, are all independent for , conditioned on the parameters . Furthermore we suppose that . Thus, the likelihood of can be written as
| 2.9 |
Choice of prior distribution of . The remaining problem is to choose the prior distribution . Presuming that the transmission rates in the regions owning more similarities are closer, is designed to combine the information of correlations between regions. In particular, given a matrix A which characterizes the pairwise similarities between the regions, and if we denote ,
| 2.10 |
where is the degree matrix of A with , is a constant depending on and A. Here, a small is chosen for to be a probability measure without imposing much restriction on . Then, by (2.6) and (2.10)
| 2.11 |
The parameter estimation procedure could be extended to the case when are time-varying by modifying (2.13), which we will introduce in more detail in “Model extension by allowing time-varying parameters” for the real-world data in China and Europe.
Construction of affinity matrix A appearing in prior distribution (2.10). Now, we specify the construction of affinity matrix A in (2.10) that reflects the similarity between regions. A is constructed from affinity matrix W by further addressing the correlation between regions with more similarities. To treat different data sets and W with a unified approach, we assume that (W can be re-scaled entry-wise if necessary).
For a given whose choice is detailed later, the next step of attaining A is to divide the n regions into d groups (, , and , ) where the regions in the same groups have more similarities. Then, for given and a given penalty factor , is constructed as follows:
| 2.12 |
We remark that in (2.12) is taken to be 0.1 for all the experiments in this paper. By constructing A as in (2.12), correlations for regions in the same groups are further addressed, whose transmission parameters are imposed with stronger restrictions.
Now we specify the choice of W for data sets that will be analyzed later in this paper. For simulated data and real-world data in China, in which cases the transportation data are available, we construct W from the traveling volume matrices . Specifically, , where . Nevertheless, for real-world data in Europe, where we are not aware of traveling data publicly available that are sufficient for the proposed model, W is just the all 1 adjacency matrix. We remark that the affinity matrix W may also be obtained by ways other than using the transportation data, as long as it reflects the similarities between districts.
Specified formula for MAP estimate of . From the MAP estimate (2.8), definition of V (2.11) given prior distribution (2.10), and the definition of in (2.12), the inference of can be equivalently written as follows
| 2.13 |
where is given in (2.9).
It can be seen that by choosing the prior distribution as in (2.10), a regularization term is imposed for better generalization.
Estimation of the marginal posterior distribution of
. Finally, after choosing determined by A and , the optimization process as in (2.13) is accomplished by a BFGS algorithm47–50. To obtain the posterior distribution of , we use classical MCMC sampling scheme starting from solved by the optimization.
Prediction of the epidemic trajectories with the estimated parameters
Once the parameters are estimated from the optimizations and as described in “Estimation of model parameters”, the trajectories of newly confirmed cases could be simulated according to the stochastic dynamic process with . Furthermore, trajectories could also be sampled from the posterior distribution of instead of using alone, which also takes the randomness from into account. Particularly, this could be achieved by sampling from MCMC and then simulating trajectories with the sampled . Additionally, deterministic trajectories determined by (2.4) could also be computed by explicit Euler’s method.
Experimental results for simulated data
Two specific cases are considered for simulated data. We first remark that the regions in “Methods” are called as provinces in this section. Section “Four provinces case” considers four provinces separated into two groups (the provinces in the same group are assumed to have more similarities) and with traffic between each pair of the provinces. Section “Thirty provinces case” considers thirty provinces randomly separated into three groups, with the other settings similar to the previous case. Section “More details of experimental settings and sensitivity analysis” includes more details of the experimental settings and sensitivity analysis.
More details of experimental settings and sensitivity analysis
Experimental settings
The results in “Experimental results for simulated data” are for 100 replicas. In each replica, three random trajectories are sampled independently according to the stochastic model with prefixed parameters, part of which are treated as the ground truth training, validation, and testing trajectory, respectively (see more details in Sect. A.1.1 of Supplementary Information).
For each model, we first fit the parameters using the training trajectory and then predict the testing trajectory using the estimated parameters. Note that we detail the choice of hyper-parameters for the model proposed in Sect. A.2 of Supplementary Information. In particular, the penalty factor is chosen by cross-validation and chosen as the value minimizing the validation error, since for the simulated data, the validation error are identically distributed as the testing error. The comparison of trajectory prediction is from one typical realization, for which we compare the ground truth training and testing trajectories with the fitted training and predicted testing trajectories for all the models. Additionally, parameter estimation and quantitative evaluations are compared with mean and standard deviation over all 100 replicas. The detailed computations of training, validation, and testing errors can be found in Sect. B of Supplementary Information.
Sensitivity analysis of
Note that parameter inference with the proposed model involves the parameter , as shown in (2.13). Therefore, the sensitivity analysis for the parameter in (2.13) is performed for the four provinces case. Specifically, the results for varying from to are presented and compared. Details can be found in “Results of parameter estimation” and “Further model evaluation”. Similar results are obtained for other data sets, and details are omitted.
Mismatched partition of regions
We also note that the graph Laplacian penalty of the proposed model depends on the partitioning the regions into several groups, as described in “Estimation of model parameters”. Since the graph knowledge is usually not fully known, it is a question whether our methods can still perform well without accurate prior knowledge. For the thirty provinces case, we report the results of the proposed model with a mismatch between the partition of the regions and the ground truth division, the details of which can be found in “Thirty provinces case”.
Four provinces case
Data description
In this simulated study, we let , , and set the threshold separating training and testing data to be 10 (more detailed can be seen in Supplementary Information S1). The other prefixed parameters are listed below:
For , , , .
For , and , .
, , , , .
The four provinces are divided into two groups, with the first group consisting of Provinces 1 and 2 and the second group consisting of Provinces 3 and 4. The similarities within groups are reflected in the settings that the values of are closer for provinces in the same group.
Models to compare
The proposed model and other four baseline models. We first specify the models to be compared below. The last one is the proposed model, and the first four models serve as baselines with different settings.
The model with uniform prior distribution, without heterogeneity or migration.
The model with uniform prior distribution, without heterogeneity but with migration.
The model with uniform prior distribution, with heterogeneity but without migration.
The model with uniform prior distribution, with both heterogeneity and migration.
The model with prior distribution based on graph Laplacian, with both heterogeneity and migration.
For better illustration and comparison between the models in the experiment results, the Models 1–5 are summarized in Table 2 below.
Table 2.
Models to be compared when the transportation data are available.
| Model | Migration | Heterogeneity | Prior of |
|---|---|---|---|
| 1 | Uniform prior | ||
| 2 | Uniform prior | ||
| 3 | Uniform prior | ||
| 4 | Uniform prior | ||
| 5 | Graph Laplacian prior |
Model 5 is the proposed model in this paper, and Models 1–4 are baseline models with different settings.
First, the models with uniform prior distributions themselves (Models 1–4) are compared according to whether two key assumptions exist in the model:
Whether the transmission rates are allowed to vary over regions.
Whether there exists transportation between regions.
Then, the model with prior distribution based on graph Laplacian (Model 5) is compared with those using uniform distributions as prior distributions (Models 1–4). The former one utilizes the correlation between subpopulations by adding a regularization term for the model. In contrast, only lower and upper bounds are imposed on parameters without other prior information being used in the latter ones.
Parameter inference of the five models and sensitivity of . For Model 5, the proposed model, the parameters are estimated following the two-step procedure described in “Estimation of model parameters”, where . For the estimation of , following the general formula (2.13) in “Estimation of model parameters”, the specific formula of for the four provinces case is as follows,
| 3.1 |
where is taken to be 0.1. For Model 5, we conduct the sensitivity analysis for parameter in (3.1) and present results for and respectively in “Results of parameter estimation” and “Further model evaluation”.
We remark that in Models 1–4, the estimation of still follows a similar two-step procedure as in Model 5, and the first step of obtaining remains formally the same. The difference lies in the optimization object of . First, the regularization term becomes prior knowledge of the parameters’ upper and lower bounds. Second, for models without heterogeneity of parameters, are forced to be the same in ODE system (2.4). For models without transportation between regions, terms involving disappear in (2.4). Additionally, for the other data sets considered in the following sections, the parameter estimation methods for the baseline models are similar and thus will not be repeated.
Finally, note that the model in Ref.12 is similar to Models 1 and 2, since they all assume a spatially homogeneous transmission parameter. However, Ref.12 assumed that the epidemic was seeded from one seed region while Models 1 and 2 do not make such assumption. Moreover, Ref.12 focused more on the spread of the epidemic from the seed region at the early stage of the pandemic, and only the introduction dates in the other regions were utilized for the estimation of transmission parameters. In comparison, the estimation of transmission rate in Models 1 and 2 exploits the data in all regions in the whole process.
Results of trajectory prediction
First, we remark that in Model 5, is chosen to be the minimizer of the averaged validation errors over 100 replicas over a range of values of . The weighted (simply averaged) validation errors, MAE and MSE (MAE and MSE), are defined as in Sect. B of Supplementary Information. We remark that the superscript refers to when the error is computed on validation data, and the subscripts and denote that the errors are the weighted and simple average of relative errors over time respectively.
The averaged weighted validation errors MAE and MSE over replicas are shown in Supplementary Fig. S2, and the simply averaged counterparts are shown in Supplementary Fig. S3. For parameter inference using Model 5, is chosen to be , at which all the averaged validation errors (MAE, MSE, MAE and MSE) over 100 replicas are minimized, as can be seen from Supplementary Figs. S2 and S3.
The trajectories of a typical realization are plotted in Fig. 2 and the absolute errors of the fitted trajectories are shown in Fig. 3. As can be seen in these two figures, heterogeneity helps improve the prediction of testing data more than transportation, while introducing migration without heterogeneity of parameters worsens the estimate as can also be noticed from Table 4. More explanations can be found in “Further model evaluation”.
Figure 2.
True and fitted trajectories for the simulated data with four provinces. The vertical lines show the threshold of training-testing split. In Model 5, is chosen, at which the averaged validation errors over 100 replicas are minimized as shown in Supplementary Figs. S2 and S3.
Figure 3.
Absolute errors of fitted trajectories of Models 1–5 for the simulated data with four provinces. The vertical lines show the threshold of training-testing split.
Table 4.
Training and testing errors with standard deviation of Models 1–5 for simulated data with four provinces.
| Model | M | H | GL prior | MAE | MAE | MSE | MSE |
|---|---|---|---|---|---|---|---|
| 1 | |||||||
| 2 | |||||||
| 3 | |||||||
| 4 | |||||||
| 5 | |||||||
The formulas of errors are detailed in Sect. B (Eq. (S5)) of Supplementary Information. Remarks for Columns 1–4 and the choice of in Model 5 are the same as those in Table 3. MAE, MAE, MSE and MSE are computed for each of the 100 replicas, then the mean and standard deviation are presented in the table above.
Additionally, Model 5 with prior distrbution based on graph Laplacian lowers the absolute errors of predicted trajectories compared with Model 4.
As shown in Figs. 2 and 3, Models 1 and 2 have slightly better generalization accuracy than Model 5 for Province 3. On the one hand, for data in this replica, the estimated is 0.3860 using Model 5 and 0.4640 using Models 1 or 2 (recalling that the ground truth is 0.4). On the other hand, due to the randomness of the generated testing data, the sampled newly confirmed cases are much more than the deterministic ones in Province 3 obtained by running (2.4) with the ground truth parameters. Hence, although all these estimates of are biased from the ground truth 0.4, estimates using Models 1 and 2, which are biased up, lead to less absolute errors.
Results of parameter estimation
The mean and standard deviation of estimated by the five models for four provinces case are reported in Table 3. It can be observed that models allowing heterogeneity estimate parameters more accurately, and Model 5 that integrates the correlation leads to slightly better estimate for . We can see that compared to Model 4, the estimates of smaller ’s (such as ) become larger, and the estimate of which has the largest value becomes smaller, since the graph Laplacian penalty tends to make closer to each other.
Table 3.
Estimated with standard deviation using Models 1–5 for simulated data with four provinces.
| Model | M | H | GL prior | ||||
|---|---|---|---|---|---|---|---|
| 1 | |||||||
| 2 | |||||||
| 3 | |||||||
| 4 | |||||||
| 5 | |||||||
Recall that the ground truth is , , , . Models in Column 1 are detailed in Table 2. Columns 2–4 indicate whether the model permits migration between provinces (for “M”), the heterogeneity of parameters (for “H”) and the Graph Laplacian prior (for “GL prior”), respectively. are inferred for each of the 100 replicas, and then the mean and standard deviation are presented in the table above. In addition, for parameter inference using Model 5 , ( respectively), at which the averaged relative validation errors over 100 replicas are minimized as shown in Supplementary Figs. S2 and S3.
Moreover, we performed a sensitivity analysis for the hyper-parameter to check that the results are robust to . The last two rows of Table 3 show the parameter estimation results for Model 5 with and respectively (more values of are tested and the results are similar as well). We observe that the variation of parameters estimated by Model 5 with varying from to does not exceed . Therefore, the parameter estimation results are not sensitive to the choice of as long as is not too large.
Further model evaluation
The training and testing errors, MAE, MAE, MSE, MSE, as defined in Sect. B (in the Eq. (S5)) of Supplementary Information, are listed below in Table 4 with mean and standard deviation. We remind the readers that the superscripts and represent the errors are computed on training and testing data respectively, and the subscript denotes that the error is weighted average of the daily relative errors over time. It can be seen from Table 4 that the presence of both heterogeneity and transportation helps reduce the training and testing errors by comparing the first four models. By comparing Model 4 and Model 5, it can be seen that using the graph Laplacian regularization leads to better prediction performance in average, which might not be obvious in this case due to the relatively large variance. The advantage of the proposed Model 5 is more evident for larger number of regions involved in the dynamic system, as shown in the next “Thirty provinces case”.
In addition, it can be seen from Table 4 that the errors increase greatly after transportation is included while heterogeneity remains absent. A possible explanation for this might be that without heterogeneity of parameters and transportation between provinces, the estimated values of ’s are lower than the true values of ’s for group 1, which leads to that the estimated newly confirmed cases are fewer than the true ones for provinces in group 1. For the same reason, the estimated newly confirmed cases are higher than the true ones in group 2. When the transportation is considered, more confirmed cases in group 1 are transferred to group 2 than the cases transported in the opposite direction. As a result, when the transmission parameters do not have heterogeneity, migration between provinces will worsen the prediction performance compared to the case without migration.
Furthermore, the last two rows of Table 4 report the training and testing errors for Model 5 with the same while and respectively. As a consequence of the robustness of the parameter estimation regarding , the errors of Model 5 are also robust to . The similar analysis is also performed for the other data sets and the similar results can be obtained which we do not report repetitively. Hereinafter, the results are presented with .
The plots of the mean of weighted and simply averaged testing errors MAE, MSE, MAE and MSE against varying are shown in Supplementary Figs. S2 and S3 respectively. Recall that the subscripts (w) and (s) denote the weighted and simple average respectively. Note that Model 5 with , at which the averaged validation errors over replicas are minimized, achieves the minimal values of testing errors MAE and MSE (also MAE and MSE). This is because validation and testing errors have the same distribution in this case.
Thirty provinces case
In this simulation study, we present the results for the simulated data involving a larger number of regions.
Data description
We set the total number of regions , the total days considered , and set the threshold that separates the training and testing set as . The other prefixed parameters are listed as below:
For , , , .
For , and , .
As in the four provinces case, we assign thirty provinces into different groups and reflect the similarity between provinces in the choice of ’s, namely the values of ’s being closer for provinces in the same groups. The three groups are denoted as below and the partition is denoted as P. Then, the proposed model (Model 5) can be applied as described in “Estimation of model parameters” with or without the graph information (the ground truth partition P) being fully known.
Figure 4 shows maps of the thirty provinces colored by transmission parameters or their estimates. It can be observed that ’s of provinces in the same groups are closer. Furthermore, the estimates of ’s using Model 5 with heterogeneity are close to the ground truth. In contrast, the estimates using Model 1 without heterogeneity are the same for all the provinces and deviate from the ground truth for most of the provinces. More results of trajectory prediction and parameter estimation can be found in “Results of trajectory prediction” and “Results of parameter estimation”.
Figure 4.
Maps of the thirty provinces divided into three groups, colored by transmission parameters or their estimates. In all the three panels, the circles on top of the panels denote the nine provinces in Group 1 (indexed by 1–9), the squares in the middle of the panels denote twelve provinces in Group 2 (indexed by 10–21), and the diamonds at the bottom of the panels denote nine provinces in Group 3 (indexed by 22–30). The provinces are colored by the ground truth in the left panel, by the averaged estimates using Model 5 in the middle panel, and by the averaged estimates using Model 1 in the right panel. The superscripts and denote the models used to obtain the estimates of . In the experiments, the traveling volumes between provinces are taken to be constant.
Details of assignment of the thirty provinces, choice of transmission rates and their values are listed in Sect. C of Supplementary Information.
Models to compare
We compare the Models 1–5 listed in Table 2 as detailed in “Models to compare”.
Parameter inference of the five models. For Model 5, the estimation of still follows the procedure in “Estimation of model parameters”. For the inference of , when we assume that the ground truth group division P is known, the general formula (2.13) can be specified as follows:
| 3.2 |
where . The parameter inference for Models 1–4 is the same as described in “Models to compare”.
Mismatched partitions of regions. For Model 5, we also report results when there are mismatches in partition of the provinces, since in real-world applications the graph information may not be fully known. Specifically, since most of the results reported in the following sections are for Provinces 1 (in ), 10 (in ), and 22 (in ), we consider the other two partitions and , which are different from P in the assignments of Provinces 1, 10, and 22, and another three provinces respectively. Comparison between the results using the ground truth partition and the results of these two different kinds of mismatches may reflect the potentially different impact of these mismatches on the results. For Model 5 with partitions or , is estimated by (3.2) with replaced by or .
The partition deviates from P in the assignment of Provinces 1 (in ), 10 (in ), and 22 (in ):
Group 1 (): Provinces 22, 2–9.
Group 2 (): Provinces 1, 11–21.
Group 3 (): Provinces 10, 23–30.
The partition deviates from P in the assignment of Provinces 2 (in ), 11 (in ), and 23 (in ):
Group 1 (): Provinces 1, 23, 3–9.
Group 2 (): Provinces 10, 2, 12–21.
Group 3 (): Provinces 22, 11, 24–30.
For mismatched partitions and , the provinces assigned to the wrong groups are marked in bold.
Results of trajectory prediction
Note that for Model 5 (with the ground truth partition P), we choose , at which the averaged validation errors over replicas are minimized, as marked with orange pentagrams in Fig. 7 and Supplementary Fig. S4. The following results for Model 5 are all obtained with .
Figure 7.
Testing and validation errors on simulated data with thirty provinces. Testing and validation errors on simulated data with four provinces. Left: The weighted prediction errors on validation set (MAE) and testing set (MAE) respectively, plotted vs. the values of . The errors are averaged over 100 replicas of experiment. Right: Same plot of MSE error. In each plot, the blue and red horizontal lines show the values of the averaged errors when . Note that both the MAE and MSE validation errors are minimized at , which is marked by blue squares in both plots. The construction of training/validation/testing data is detailed in Sect. A.1.1 of Supplementary Information, and the formulas of computing the errors can be found in Sect. B of Supplementary Information.
We choose three provinces from the total thirty provinces (one province from each group), and plot the prediction of the trajectories and also the absolute errors from one specific replica as in Figs. 5 and 6. It can be observed from Fig. 6 that Model 5 (the green lines) achieve the most accurate prediction results due to the graph Laplacian regularization. Model 4 (the dark purple lines) also have good generalization performance, thanks to the heterogeneity and transportation involved in this model.
Figure 5.
True and fitted trajectories for the simulated data in the three provinces chosen from the total thirty provinces. The vertical lines show the threshold of training-testing split. For Model 5, , at which the averaged validation errors over replicas are minimized (as shown in Fig. 7).
Figure 6.
Absolute errors of the fitted trajectories for simulated data in the three provinces chosen from the total thirty ones. The vertical lines show the threshold of training-testing split.
We note that Model 3 also behaves much worse than the best models, especially in Provinces 1 (in ) and 22 (in ). In fact, the estimates for the ’s in these three provinces using Model 3 are 0.2908, 0.4198, and 0.4639, respectively, which are not much different from the ground truth values. Nevertheless, Model 3 does not involve transportation between provinces. This leads to that the confirmed cases of Province 22 in (which are much more than the cases of provinces in the other two groups) are not output to other provinces; hence the predicted newly confirmed cases are much more than the truth. A similar situation happens in the case of Province 1 (in ). Namely, Model 3 does not take the imported cases from the provinces in the other two groups into account, causing the predicted newly confirmed cases to be much less than the truth.
Results of parameter estimation
The mean and standard deviation of ’s in Provinces 1 (in ), 10 (in ), and 22 (in ) are reported in Table 5. We can still see that Models 3–5 estimate the transmission parameters more accurately than Models 1 and 2 especially for Provinces 1 and 22. In addition, the estimates by Model 5 of are closer than those by Model 4, due to the existence of the regularization term.
Table 5.
Estimated with standard deviation using Models 1–5 for simulated data with thirty provinces.
| Model | M | H | GL prior | |||
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 | ||||||
| 4 | ||||||
| (P) | ||||||
| 5() | ||||||
| () | ||||||
The ground truth is (in ), (in ), (in ) as listed in Sect. C of Supplementary Information. Remarks for Columns 1–4 and estimates of ’s are the same as those in Table 3. In addition, for inference of parameters using Model 5, , at which the averaged validation errors over 100 replicas (computed with partition P) are minimized as shown in Fig. 7 and Supplementary Fig. S4. The partitions of Model 5 are P, , and respectively. P is the ground truth underlying graph structure, and and are mismatched partitions introduced in “Models to compare”.
The last three rows show the parameter estimates using Model 5 with three partitions P, , and respectively. Recall that P is the ground truth underlying graph structure, and and are mismatched partitions introduced in “Models to compare”. It can be seen that compared with the estimated parameters using Model 5 with that differs from P in the grouping of Provinces 1, 10, and 22, the estimates using Model 5 with that differs from P in the grouping of Provinces 2, 11, and 23 are closer to the estimates using Model 5 with the ground truth partition P. Hence, the results imply that for the parameter inference of simulated data generated from a certain partition, the incorrect division causes more discrepancy in the mismatched regions when comparing the estimates using the ground truth partition.
Further model evaluation
The mean and standard deviation of the weighted training and testing errors (formulas can be found in Sect. B of Supplementary Information, specifically in Eq. (S5)) are presented in Table 6. We see that Model 5 with achieves the minimum testing errors among all the models, and hence have the best generalization performance.
Table 6.
Training and testing errors of Models 1–5 with standard deviation for simulated data with thirty provinces.
| Model | M | H | GL prior | MAE | MAE | MSE | MSE |
|---|---|---|---|---|---|---|---|
| 1 | |||||||
| 2 | |||||||
| 3 | |||||||
| 4 | |||||||
| (P) | |||||||
| 5 () | |||||||
| () |
The last three rows show the comparison of errors for Model 5 with three partitions P, , and respectively. We may see that the testing errors of Model 5 with the incorrect partitions are basically the same and larger than those with the ground truth partitions. Nevertheless, these errors are still smaller than the other baseline models. Therefore, the incorrect division might worsen the generalization performance of Model 5; however, when the division does not deviate from the ground truth much, it still behaves better than Models 1–4 with uniform prior.
Moreover, as shown in Table 6, both Model 2, which allows transportation but not heterogeneity, and Model 3, that introduces heterogeneity but no transportation, behave worse than the baseline Model 1. Model 2 having more significant testing errors is due to the same reason as the case of four provinces, as has been analyzed in “Further model evaluation”. The reason for Model 3 not being able to predict well is explained in detail in “Results of trajectory prediction”.
However, we also note that Model 4, which involves both transportation and heterogeneity of parameters, has greater testing errors than Model 1 on average. This phenomenon may be partly explained by the fact that, on the one hand, there may be more infected cases imported from regions with higher ’s to those with lower transmission rates. On the other hand, although the transmission parameters are the same and all-around 0.42 in Model 1, and are still allowed to be spatially heterogeneous, which makes the trajectories not that far away from the ground truth. Meanwhile, though the ’s are allowed to be spatially heterogeneous in Model 4, it often underestimates ’s for provinces in and overestimates ’s for provinces in , making the trajectories deviate from the ground truth.
Figure 7 and Supplementary Fig. S4 plot the change of weighted and simply averaged testing errors MAE, MSE, MAE and MSE regarding (recall that the subscripts (w) and (s) denote weighted and simple average respectively). We can still observe that Model 5 with the minimizer of validation errors also achieves the minimal testing errors, since the validation and testing errors have the same distribution for the simulated data.
Experimental results on COVID-19 data
In this section, we apply our model to two real-world data sets, the COVID-19 data in China (from January to early February 2020) and Europe (from May to August 2020), in “Results for COVID-19 data in China” and “Results for COVID-19 data in Europe” respectively. We remark that the regions in “Methods” refer to provinces or municipalities for the data in China and refer to countries for the data in Europe. Before reporting results, we first introduce the data sources in “Data sources”. Then, we generalize our inference method to the case that the transmission rates have temporal heterogeneity in “Model extension by allowing time-varying parameters”, which might happen in real-world data due to travel restrictions or other reasons. For the two real-world data sets, we compare the proposed model with other baselines in the aspects of trajectory prediction and quantitative evaluation. We also show the estimation of transmission parameters with posterior distributions using our proposed model. Details can be found in “Results for COVID-19 data in China” and “Results for COVID-19 data in Europe”.
For the real-world data in Europe, since there is not consensus on proper divisions, we presented the results from two different partitions, both of which are based on geographical locations of the countries. The results imply that in comparison to the simulation data where the data are generated in the artificially divided regions, the generalization performances for the real-world data of both the partitions are similar and both much better than the baseline models.
Data sources
Real-world data studied in this section involve the data in China and Europe.
Real-world data in China
The publicly available pandemic data in China include the number of confirmed cases and removed cases (consisting of recovered cases and fatalities) in China’s major provinces and municipalities from January 21st to March 28th, 2020. These publicly available data are downloaded from the websites of the National Health Commission of the People’s Republic of China16 and Chinese Center for Disease Control and Prevention17. The corresponding total population in each province or municipality is from Ref.51.
We also utilize two transportation data sets in China extracted from Baidu Qianxi52, lasting from January 10th to February 10th in 2020: (1) the migration indexes which reflects the number of people moving out from the provinces, and (2) the migration percentage which reflects the percentage of the population moving to a destination province from the origin province. The traveling volume is estimated by combining these two data sets.
Real-world data in Europe
The publicly available pandemic data in Europe include the number of confirmed and removed cases (consisting of recovered cases and fatalities) in the following 11 countries from May 1st to August 31st in 2020: Denmark, Finland, Norway, Austria, Germany, Switzerland, Italy, Spain, Belgium, France, and Ireland. The data are downloaded from Refs.13,53, where the data are collected from European Center for Disease Prevention and Control18. The data of total population in each country is obtained from Ref.54. During the study, we are not able to obtain suitable transportation data between these European countries needed by our model, and thus we simplify the proposed model by incorporating only spatial and temporal heterogeneity but no transportation in this experiment, see details in “Results for COVID-19 data in Europe”.
We remark that the data in China are from the initial outbreak of the epidemic. On the contrary, the data in Europe are from when the epidemic has been ongoing for about six months. The choice of the aforementioned time period of interest is to illustrate that the proposed model could predict both the case when the epidemic breaks out from some seed region and the case when the epidemic has already established a sizable local spread within different regions.
Model extension by allowing time-varying parameters
To capture the trend that the transmission parameters might change in real-world data due to travel restrictions or other factors, we first extend the model and parameter estimation procedure described in “Methods” so that are allowed to be time-varying.
Specifically, suppose that is a piece-wise constant function of t, and for days before some threshold , and afterwards.
We denote and as the vectors of transmission rates in the two periods respectively. The stochastic dynamic model introduced in “Model description” remains unchanged except that the transmission rates will vary as time increases.
As for the parameter estimation, the first step of obtaining also stays the same as described in “Estimation of model parameters”.
Then, when estimating and the marginal posterior distributions of , we modify the prior distribution as below,
| 4.1 |
where , D and A have the same meaning as in (2.10), and is still the normalizing constant. Consequently,
| 4.2 |
Therefore, by the definition of in (2.12), the estimation of by minimizing can be equivalently written as
| 4.3 |
After choosing W and , can be inferred through (4.3). Then, we could estimate the marginal posterior distributions of by MCMC sampling with the initial point .
We would like to remark that same as in the simulation study, since the proposed model depends on the penalty factor as shown in (4.3), is chosen so that the corresponding validation error achieves or is slightly higher than the minimum. Specifically, results from the multiple choices of are present, since for the real-world data, the validation data may not have the same distribution as the testing data (the choice of the training, validation, and testing sets is detailed in Sects. A, D, and E of Supplementary Information.
Results for COVID-19 data in China
Data description
After the selection process described in Sect. D of Supplementary Information, provinces or municipalities are taken into consideration, which are Anhui, Beijing, Fujian, Gansu, Guangdong, Guangxi, Hebei, Henan, Hubei, Hunan, Jiangsu, Jiangxi, Liaoning, Ningxia, Shandong, Shan-Xi, Shanxi, Shanghai, Sichuan, Zhejiang, and Chongqing. Furthermore, since the epidemic data last from January 21st to March 28th, 2020, there are 20 days in total.
In addition, from observation of data from January 21st to February 10th, the transmission rates in the selected provinces or municipalities change after some specific time point. Thus, we allow to be time-varying for this data set with appropriate changes to the model described in “Methods”, which are detailed in “Model extension by allowing time-varying parameters”. More details of the experimental settings can be found in Sect. D of Supplementary Information, including construction of training/validation/testing data and the choice of , the day that changes.
Furthermore, we remark that the data from Baidu Qianxi might not be the exact traveling volumes between municipalities and provinces. We assume that the actual traveling volume from one starting point to one destination is proportional to Baidu migration index (which reflects numbers of people departed from the stating point) and the percentage of population traveling from this origin to the destination. The corresponding scaling parameter also needs to be inferred from the data for all the models.
Models to compare
As in “Models to compare”, the same Models 1–5 are compared for COVID-19 data in China. Recall that Model 5 is the model proposed in this paper and the other four are the baseline models for comparison.
For Model 5, the parameter inference adopts the method in “Estimation of model parameters”, while allowing to be time-varying as described in “Model extension by allowing time-varying parameters”. Note that the medical resources were overwhelmed in Wuhan at the early stage of the pandemic55. When applying the general formula (4.3), we divide the provinces into two groups, which are Hubei and other provinces except Hubei. Furthermore, the affinity matrix W is constructed as averaged traveling volumes, that is , where . Then, for COVID-19 data in China, (4.3) can be specified as follows to estimate and scaling parameter :
| 4.4 |
where we take , and and are the transmission rates in the k-th province before and after the -th day respectively.
For Models 1–4, if the model does not have heterogeneity of transmission parameters, then are forced to be the same and so do , while the transmission rates in the two periods are allowed to be different; if the model does not allow transportation between provinces, then terms involving disappear in (2.4) as described in “Models to compare”.
Results of trajectory prediction
We first remark that for Model 5, three values of are chosen, and the trajectory prediction results presented below are from these three choices to perform the sensitivity analysis of . The first value is , at which the relative validation errors (, , , and ) are minimized, marked with blue squares in Fig. 12 and Supplementary Fig. S5. The other two values are and obtained by perturbing the minimizer , whose validation errors are slightly larger, marked with orange pentagrams and green diamonds in Fig. 12 and Supplementary Fig. S5 respectively. Since for the real-world data in China, the validation error does not necessarily have the same distribution as the testing error, the results from various choices of are presented for better comparison.
Figure 12.
Testing and validation errors on real-world COVID-19 data in China. The two plots are similar to those in Figure 7, with errors computed on the real-world data in China instead of simulated data. Both the MAE and MSE validation errors are minimized at , which is marked by blue squares in both plots. and are chosen by perturbing the minimizer , have slightly larger validation errors, and are marked with yellow pentagrams and green diamonds in both plots. The construction of training/validation/testing data is detailed in Sects. A.1.2 and D of Supplementary Information.
Figure 8 shows the true and predicted trajectories for newly confirmed cases in Hubei. In Fig. 8,
The orange line with circles shows the true trajectory.
The blue lines with crosses show the predicted deterministic trajectories obtained by running (2.4) with inferred using (4.4) with . The predicted trajectories with and are close to the ones with , and are therefore not included in Fig. 8.
The blue scatter plots show 100 stochastic trajectories with inferred using Model 5 with .
The orange scatter plots show 100 stochastic trajectories sampled from the posterior distribution of , which is estimated by MCMC iterations staring from inferred using Model 5 with .
The deterministic trajectories obtained by the other four models and the corresponding absolute errors are also plotted for better comparison. It can be seen that all the predicted deterministic trajectories generally capture the trend of the true trajectory. Besides, it can be seen from Fig. 8 that the performances of all the models that include heterogeneity of parameters are similar for Hubei. This may be explained by the fact that the cases in Hubei outnumber those in other provinces or municipalities, which makes all the models tend to fit the trajectory of Hubei best. Moreover, sampling trajectories from the posterior distribution of generates more randomness than sampling trajectories with .
Figure 8.
True and fitted trajectories in Hubei. The orange line with circles shows the true trajectory, the blue lines with crosses show the predicted deterministic trajectories using Model 5, blue and orange scatter plots show 100 stochastic trajectories with inferred using Model 5 and sampled from the posterior distribution of respectively. In each figure, the black vertical line shows the threshold of training-testing split. For Model 5, is chosen, at which the validation errors achieves the minimum value, marked by blue squares in Fig. 12 and Supplementary Fig. S5.
Similarly, Fig. 9 shows the predicted trajectories in Henan. When the transmission parameters are forced to be the same in all provinces, the increment of the newly confirmed cases of the predicted trajectory is faster than the trend shown in the true trajectory in the first period. Thus, heterogeneity helps improve the performance of fitting and predicting. Additionally, although the validation errors achieves the minimum at , it can be seen from Fig. 9 that the deterministic trajectory obtained by Model 5 with and achieves slightly better performance in prediction than the one obtained by Model 5 with . Therefore, the result suggests that the choice of is not necessarily limited to the minimizer of the validation error. Instead, we may also compare trajectories with that have slightly larger validation error for possibly better generalization performance.
Figure 9.
True and fitted trajectories in Henan. The remarks for the lines and scatter plots are the same as those in Figure 8. For Model 5, , and are chosen. As shown in Fig. 12 and Supplementary Fig. S5, the validation errors are minimized at . and are obtained by perturbing the minimizer without increasing validation errors much. , and are marked by yellow pentagrams, blue squares, and green diamonds respectively in Fig. 12 and Supplementary Fig. S5.
Moreover, from Fig. 10 which shows the fitted trajectories in Anhui, it can be seen that heterogeneity helps the models fit the training trajectory better. Furthermore, adding regularization helps Model 5 capture the trend of the trajectory better than Model 4.
Figure 10.
True and fitted trajectories in Anhui. The remarks for the lines and scatter plots are the same as those in Fig. 8. The choice of , and has been explained in Fig. 9.
Results of parameter estimation
To illustrate the estimate of the posterior of parameters, Fig. 11 shows the estimated posterior distributions of and in Hubei using Model 5, in which the vertical black lines show the inferred and with , and Table 7 shows the inferred and with the mean and standard deviation of their estimated posterior distributions from MCMC.
Figure 11.
Estimated posterior distribution of in Hubei. The vertical black lines represent the values of corresponding using Model 5.
Table 7.
Estimated transmission rates in Hubei.
| Mean of estimated posterior | Standard deviation of estimated posterior | ||
|---|---|---|---|
| 0.3566 | 0.3566 | ||
| 0.0723 | 0.0716 | 0.0076 |
By comparing the transmission rates in Hubei in the two periods, it can be seen that the transmission rate decreases greatly after February 2nd, which indicates that the containment measures adopted in Hubei against the COVID-19 outbreak are effective.
Further model evaluation
Table 8 shows the training errors and testing errors of the five models computed as detailed in Sect. B of Supplementary Information. As can be seen from Table 8, both heterogeneity of parameters and transportation between provinces help reduce the training errors and the testing errors as the simulated data case. The utilization of graph Laplacian prior helps further reduce the testing errors. Note that the testing errors of Model 4 are close to those of Model 3, which might be explained as after the traveling restrictions are imposed on January 23rd56, transportation no longer has much influence on the spread of the epidemic.
Table 8.
Training and testing errors of Models 1–5 for real-world data in China.
| Model | M | H | GL prior | MAE | MAE | MSE | MSE |
|---|---|---|---|---|---|---|---|
| 1 | 0.399 | 0.497 | 0.650 | 0.626 | |||
| 2 | 0.388 | 0.459 | 0.645 | 0.590 | |||
| 3 | 0.296 | 0.398 | 0.565 | 0.511 | |||
| 4 | 0.296 | 0.400 | 0.562 | 0.514 | |||
| 5 | 0.296 | 0.373 | 0.565 | 0.477 | |||
| 0.298 | 0.343 | 0.572 | 0.438 | ||||
| 0.303 | 0.328 | 0.582 | 0.416 |
Figure 12 and Supplementary Fig. S5 plot weighted and simply averaged testing errors against varying respectively. We can see from Table 8 and Fig. 12 that though achieves the minimal validation error, it has larger testing error than .
Results for COVID-19 data in Europe
Data description
The dataset contains numbers of daily COVID-19 cases in countries in Europe, which are Denmark, Finland, Norway, Austria, Germany, Switzerland, Italy, Spain, Belgium, France, and Ireland. After preprosessing and removing the first and last three days of the original data, the dataset spans over a period of days. More details of experimental settings can be found in Sect. E of Supplementary Information, including the selection of the countries, the preprocessing of the data, the construction of training/validation/testing data, and the choice of the day that changes in the selected countries.
Models to compare
Three models for comparison. For COVID-19 data in Europe, only three models detailed below are compared, which allow heterogeneity of parameters but do not include transportation. The last model is the proposed one in this paper, and the first two are baseline models. The transportation data needed in this model are not available to the best of our knowledge. However, the impact of transportation is expected to be less significant due to travel restrictions56.
Model with uniform prior distribution, without heterogeneity or migration.
Model with uniform prior distribution, with heterogeneity but without migration.
Model with prior distribution based on graph Laplacian, with heterogeneity but no migration.
Two different partitions of the countries. For Model 3’, 11 countries are partitioned into groups according to geographical locations:
(Northern Europe): Denmark, Finland, Norway;
(Central Europe): Austria, Germany, Switzerland;
(Southern Europe): Italy, Spain;
(Western Europe): Belgium, France, Ireland.
We denote this partition as P. Since this might not be the unique appropriate partition for the countries, we also present results from another partition denoted as in the following sections. also groups the countries that are geographically close together:
: Finland, Norway;
: Denmark, Austria, Germany;
: Spain, Belgium, France, Ireland;
: Switzerland, Italy.
The results for Model 3’ in “Further model evaluation” are presented with both the partitions P and .
Parameter inference of the three models. By the time-varying extension (4.3) detailed in “Model extension by allowing time-varying parameters”, for Model 3’, are estimated by the optimization problem in (4.5):
| 4.5 |
or by (4.5) but with the partition replaced by the partition . Here, we still take .
Results of trajectory prediction
We still remark that for Model 3’ (with P), as in the case of real-world data in China, part of the results reported are from three choices of for careful consideration, since the validation error may not have the same distribution as the testing error. Note that Fig. 17 and Supplementary Fig. S6 plot the weighted and simply averaged relative validation errors, , , and respectively. One choice is , at which the validation errors (computed with P) achieve the minimum, marked with blue squares in Fig. 17 and Supplementary Fig. S6. The other two choices are and , which have larger validation errors and are marked in orange pentagrams and green diamonds respectively in Fig. 17 and Supplementary Fig. S6.
Figure 17.
Testing and validation errors on real-world COVID-19 data in Europe. The two plots are similar to those in Fig. 7, with the errors computed on the real-world data in Europe instead of simulated data. Both the MAE and MSE validation errors are minimized at , which is marked by blue squares in both plots. and are chosen by perturbing the minimizer , have slightly larger validation errors, and are marked with yellow pentagrams and green diamonds in both plots. The construction of training/validation/testing data is detailed in Sects. A.1.2 and E of Supplementary Information.
Figures 13, 14 and 15 present the true and predicted trajectories in Austria, Germany, and Italy, respectively. As can be seen from Figs. 13, 14 and 15, heterogeneity of transmission parameters in Model 2’ (green lines) and Model 3’ (blue lines) helps improve the performance of fitting and generalization. Furthermore, utilization of correlation between countries in Model 3’ further improves the prediction of the trajectories as shown in Fig. 13. We also note that for Model 3’, compared to the results for , at which the validation errors are minimized, and achieve slightly more accurate prediction for Italy.
Figure 13.
True and fitted trajectories in Austria. The remarks for the lines and scatter plots are the same as those in Fig. 8. The vertical lines show the threshold of training-testing split of COVID-19 data in Europe. For Model 3’, , and are chosen. As shown in Fig. 17 and Supplementary Fig. S6, the validation errors are minimized at , and and are obtained by perturbing the minimizer . , and are marked by yellow pentagrams, blue squares, and green diamonds respectively in Fig. 17 and Supplementary Fig. S6.
Figure 14.
True and fitted trajectories in Germany. The remarks for the lines and scatter plots are the same as those in Fig. 8. The choice of , and has been explained in Fig. 13.
Figure 15.
True and fitted trajectories in Italy. The remarks for the lines and scatter plots are the same as those in Fig. 8. The choice of , and has been explained in Fig. 13.
Results of parameter estimation
Same as before, Fig. 16 shows the estimated posterior distributions of and using Model 3’, in which the vertical black lines show the inferred and with . Table 9 shows the inferred and and the mean and standard deviation of their estimated posterior distribution from MCMC. The estimated is larger than the estimated , which is consistent with the trend that the newly confirmed cases in Italy first decrease and then increase.
Figure 16.
Estimated posterior distributions of in Italy. The vertical black lines represent the values of corresponding using Model 3’.
Table 9.
Estimated transmission rates in Italy.
| Mean of estimated posterior | Standard deviation of estimated posterior | ||
|---|---|---|---|
| 0.1034 | 0.1035 | ||
| 0.1719 | 0.1719 |
Further model evaluation
Table 10 shows training and testing errors of the three models. It can be seen from Table 10 that Models 2’ and 3’ have smaller both training and testing errors than Model 1’, due to the introduction of heterogeneity of transmission parameters. We can also see that compared with Model 2’, Model 3’ that adds Graph Laplacian prior has better generalization performance. Especially, Model 3’ with has smaller testing error than with , although does not achieve the minimal validation errors.
Table 10.
Training and testing errors of Models 1’–3’ for real-world data in Europe.
| Model | M | H | GL prior | MAE | MAE | MSE | MSE |
|---|---|---|---|---|---|---|---|
| 1’ | 0.330 | 0.639 | 0.424 | 0.691 | |||
| 2’ | 0.171 | 0.543 | 0.228 | 0.599 | |||
| 3’ (P) | 0.203 | 0.450 | 0.268 | 0.489 | |||
| 0.214 | 0.447 | 0.285 | 0.488 | ||||
| 0.225 | 0.441 | 0.302 | 0.486 | ||||
| 3’ () | 0.209 | 0.436 | 0.273 | 0.511 | |||
| 0.222 | 0.415 | 0.291 | 0.492 | ||||
| 0.237 | 0.399 | 0.311 | 0.480 |
The formulas of errors are detailed in Sect. B (Eq. (S5)) of Supplementary Information. The remarks for Columns 2–4 are the same as those in Table 4. In addition, the choice of , and in Model 3’ has been explained in Fig. 13. Note that the validation errors are minimized at as shown in Fig. 17. The partitions in Model 3’ are P and respectively. P and are two different partitions of countries in Europe introduced in “Models to compare”.
Recall that P and are two different partitions of countries in Europe introduced in “Models to compare”, both based on geographical locations. By comparing the corresponding results in Table 10 for Model 3’ with partitions P and respectively, it can be seen that the testing errors are close. This is different from the implications of the simulated data. A possible explanation for this might be that the regions in the real world have more complicated correlations than the simulated regions where clustering information is uniquely and artificially prefixed. Although reasonable groupings may not always be unique for real-world cases, the proposed model could still predict the trajectories more accurately than the baseline models.
Figure 17 below and Supplementary Fig. S6 plot the weighted and simply averaged testing errors (whose definitions are in Supplementary Information) against varying respectively, from which one may see that by imposing regularization properly through choosing a moderate helps improve the generalization performance. By comparing the blue solid lines and red dashed lines in Fig. 17 (and Supplementary Fig. S6), we see that the change in validation and testing errors concerning increasing are not the same. This further verifies the indications from the previous findings that it would be better to examine and compare the results from multiple choices of with relatively low validation errors for Model 3’ to choose a better for the specific data sets.
Discussion
In this paper, we propose a stochastic dynamic model inspired by Ref.2, with considerations on the inter-district transportation and as well as spatially and temporally heterogeneous transmission parameters, which can model the ongoing and lasting spread of the epidemic in multiple districts. Based on the proposed model, we also introduce a two-step procedure for estimating the parameters, which utilizes graph Laplacian regularization to address the correlation between districts. Experiments on both simulated and real-world data show that compared with the baseline models, this proposed model improves the performance of fitting and generalization.
We acknowledge that there are limitations to this work. First, the stochastic dynamic model proposed in this paper might not be comprehensive enough to characterize the spread of COVID-19 in reality. For example, the model does not account for the ascertainment rate of the positive cases, the asymptomatic virus carriers, and the infectious latent period. Second, the proposed model does not consider the heterogeneity with respect to transmission risk in different populations, for example, populations with different ages, jobs, or health conditions, and needs further modification to incorporate the large-scale COVID-19 vaccination as well as waning of immunity over time. Finally, the parameter inference depends on the correlation between regions determined by the clustering patterns among regions. However, such division is usually not fully known in the real-world cases, and the inference of graph structure is yet to be explored in this paper.
The current methods have possible extensions in the following several directions in our future study. First, the dynamic model could be refined to be closer to reality, for example, taking the asymptomatic virus carriers and contact tracking into account and modeling the change of the transmission parameters over time in a more sophisticated way. Second, the model can be modified to accommodate the heterogeneity of various populations, including different age groups and vaccination statuses. For example, the compartments are to be further divided into subgroups, and the parameters such as transmission rates and mortality rates are allowed to be different; and the dynamic model is also to be adjusted accordingly. Third, the method could further include detecting the underlying graph structure of the regions so that the construction of the graph Laplacian matrix could be more self-contained and systematic. At last, the method could be further used to assess the influence of containment measures taken by different countries, for example, by adjusting the traveling volume to make it different from the actual transportation data and then analyzing the corresponding influence on the size and trend of the pandemic.
Supplementary Information
Acknowledgements
The authors thank Dr. Huaiyu Tian, Dr. Chong You, Omar Melikechi for helpful discussions, and Qiushi Lin and Feng Zhou for assistance with the collection of real-world data. Y. Z. and X.-H. Z. acknowledge support by National Natural Science Foundation of China grant 82041023 and The Bill & Melinda Gates Foundation (INV-016826). Y. T. and X. C. acknowledge support by NSF (DMS-2007040), and X.C. is also partially supported by the Alfred P. Sloan Foundation.
Author contributions
Y.T., Y.Z., and X.C. contributed to the development of the model and algorithm, as well as numerical experiments, data analysis and interpretation of results. All authors contributed to the study design and writing of the manuscript.
Data availibility
The data and code used in the simulation study (“Experimental results for simulated data”) and the application to COVID-19 data in Europe (“Results for COVID-19 data in Europe”) are publicly available in the repository https://github.com/Yixuan-Tan/Statistical_Inference_Using_GLEaM_with_Spatial_Heterogeneity_and_Correlation. The COVID-19 data in China (“Results for COVID-19 data in China”) are not publicly available due to that the official websites of health commissions of some provinces in China no longer maintain the COVID-19 data before March of 2020, but are available from the corresponding author on request.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Yixuan Tan and Yuan Zhang.
Contributor Information
Xiuyuan Cheng, Email: xiuyuan.cheng@duke.edu.
Xiao-Hua Zhou, Email: azhou@math.pku.edu.cn.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-022-18775-8.
References
- 1.World Health Organization (WHO). https://covid19.who.int.
- 2.Balcan D, et al. Modeling the spatial spread of infectious diseases: The global epidemic and mobility computational model. J. Comput. Sci. 2010;1:132–145. doi: 10.1016/j.jocs.2010.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Keeling, M. J. & Rohani, P. Modeling infectious diseases in humans and animals (Princeton university press, 2011).
- 4.Gomes, M. F. et al. Assessing the international spreading risk associated with the 2014 west african ebola outbreak. PLoS Curr.6 (2014). [DOI] [PMC free article] [PubMed]
- 5.Pastore-Piontti, A. et al. Real-time assessment of the international spreading risk associated with the 2014 west african ebola outbreak. In Mathematical and Statistical Modeling for Emerging and Re-emerging Infectious Diseases, 39–56 (Springer, 2016).
- 6.Chinazzi M, et al. The effect of travel restrictions on the spread of the 2019 novel coronavirus (covid-19) outbreak. Science. 2020;368:395–400. doi: 10.1126/science.aba9757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Colizza, V. et al. Estimate of novel influenza a/h1n1 cases in mexico at the early stage of the pandemic with a spatially structured epidemic model. PLoS Curr.1, RRN1129–RRN1129 (2009). [DOI] [PMC free article] [PubMed]
- 8.Balcan, D. et al. Modeling the critical care demand and antibiotics resources needed during the fall 2009 wave of influenza a(h1n1) pandemic. PLoS Curr.1, RRN1133–RRN1133 (2009). [DOI] [PMC free article] [PubMed]
- 9.Bajardi, P. et al. Modeling vaccination campaigns and the fall/winter 2009 activity of the new a(h1n1) influenza in the northern hemisphere. Emerg. Health Threats J.2, e11–e11 (2009;2008;2010;). [DOI] [PMC free article] [PubMed]
- 10.Tizzoni M, et al. Real-time numerical forecast of global epidemic spreading: Case study of 2009 a/h1n1pdm. BMC Med. 2012;10:165. doi: 10.1186/1741-7015-10-165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Poletto C, et al. Assessment of the middle east respiratory syndrome coronavirus (mers-cov) epidemic in the middle east and risk of international spread using a novel maximum likelihood analysis approach. Eurosurveillance. 2014;19:20824. doi: 10.2807/1560-7917.es2014.19.23.20824. [DOI] [PubMed] [Google Scholar]
- 12.Balcan D, et al. Seasonal transmission potential and activity peaks of the new influenza a (h1n1): a monte carlo likelihood analysis based on human mobility. BMC Med. 2009;7:1–12. doi: 10.1186/1741-7015-7-45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dong E, Du H, Gardner L. An interactive web-based dashboard to track covid-19 in real time. Lancet. Infect. Dis. 2020;20:533–534. doi: 10.1016/S1473-3099(20)30120-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhang, Q. et al. Forecasting seasonal influenza fusing digital indicators and a mechanistic disease model. In Proceedings of the 26th international conference on world wide web, 311–319 (2017).
- 15.Vespignani A. Predicting the behavior of techno-social systems. Science. 2009;325:425–428. doi: 10.1126/science.1171990. [DOI] [PubMed] [Google Scholar]
- 16.National Health Commission of the People’s Republic of China. http://en.nhc.gov.cn/antivirusfight.html. [DOI] [PMC free article] [PubMed]
- 17.Chinese Center for Disease Control and Prevention. https://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm. [DOI] [PMC free article] [PubMed]
- 18.European Centre for Disease Prevention and Control. https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases.
- 19.Kermack, W. O. & McKendrick, A. G. A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. Ser. A115, 700–721 (1927).
- 20.Kendall, D. G. Deterministic and stochastic epidemics in closed populations. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 4: Contributions to Biology and Problems of Health, 149–165 (University of California Press, 1956).
- 21.Bailey, N. T. A simple stochastic epidemic. Biometrika 193–202 (1950). [PubMed]
- 22.Bartlett M. Some evolutionary stochastic processes. J. Roy. Stat. Soc.: Ser. B (Methodol.) 1949;11:211–229. [Google Scholar]
- 23.Britton T, et al. Stochastic epidemic models with inference. Berlin: Springer; 2019. [Google Scholar]
- 24.Zhang Y, et al. Prediction of the covid-19 outbreak in china based on a new stochastic dynamic model. Sci. Rep. 2020;10:1–10. doi: 10.1038/s41598-020-76630-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Donnat, C. & Holmes, S. Modeling the heterogeneity in covid-19’s reproductive number and its impact on predictive scenarios. J. Appl. Stat. 1–29 (2021). [DOI] [PMC free article] [PubMed]
- 26.Tkachenko, A. V. et al. Time-dependent heterogeneity leads to transient suppression of the covid-19 epidemic, not herd immunity. Proc. Natl. Acad. Sci.118 (2021). [DOI] [PMC free article] [PubMed]
- 27.Moreno Y, Pastor-Satorras R, Vespignani A. Epidemic outbreaks in complex heterogeneous networks. Eur. Phys. J. B Condens. Matter Complex Syst. 2002;26:521–529. [Google Scholar]
- 28.Boschi T, Di Iorio J, Testa L, Cremona MA, Chiaromonte F. Functional data analysis characterizes the shapes of the first covid-19 epidemic wave in italy. Sci. Rep. 2021;11:1–15. doi: 10.1038/s41598-021-95866-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Carroll C, et al. Time dynamics of covid-19. Sci. Rep. 2020;10:1–14. doi: 10.1038/s41598-020-77709-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Cremona, M. A. & Chiaromonte, F. Probabilistic -mean with local alignment for clustering and motif discovery in functional data. arXiv preprintarXiv:1808.04773 (2018).
- 31.Hilton J, Keeling MJ. Estimation of country-level basic reproductive ratios for novel coronavirus (sars-cov-2/covid-19) using synthetic contact matrices. PLoS Comput. Biol. 2020;16:e1008031. doi: 10.1371/journal.pcbi.1008031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Szapudi I. Heterogeneity in sir epidemics modeling: superspreaders and herd immunity. Appl. Netw. Sci. 2020;5:1–12. doi: 10.1007/s41109-020-00336-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Volpert, V., Banerjee, M. & Sharma, S. Epidemic progression and vaccination in a heterogeneous population. application to the covid-19 epidemic. Ecol. Complex. 100940 (2021).
- 34.Hou, X. et al. Intracounty modeling of covid-19 infection with human mobility: Assessing spatial heterogeneity with business traffic, age, and race. Proc. Natl. Acad. Sci.118 (2021). [DOI] [PMC free article] [PubMed]
- 35.Chen S, Li Q, Gao S, Kang Y, Shi X. State-specific projection of covid-19 infection in the united states and evaluation of three major control measures. Sci. Rep. 2020;10:1–9. doi: 10.1038/s41598-020-80044-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Tian H, et al. An investigation of transmission control measures during the first 50 days of the covid-19 epidemic in china. Science. 2020;368:638–642. doi: 10.1126/science.abb6105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kraemer MU, et al. The effect of human mobility and control measures on the covid-19 epidemic in china. Science. 2020;368:493–497. doi: 10.1126/science.abb4218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.World Health Organization (WHO). Enhancing response to omicron sars-cov-2 variant. https://www.who.int/publications/m/item/enhancing-readiness-for-omicron-(b.1.1.529)-technical-brief-and-priority-actions-for-member-states.
- 39.Kurtz TG. Solutions of ordinary differential equations as limits of pure jump markov processes. J. Appl. Probab. 1970;7:49–58. [Google Scholar]
- 40.Kurtz TG. Limit theorems for sequences of jump markov processes approximating ordinary differential processes. J. Appl. Probab. 1971;8:344–356. [Google Scholar]
- 41.He S, Peng Y, Sun K. Seir modeling of the covid-19 and its dynamics. Nonlinear Dyn. 2020;101:1667–1680. doi: 10.1007/s11071-020-05743-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.López L, Rodo X. A modified seir model to predict the covid-19 outbreak in spain and italy: simulating control scenarios and multi-scale epidemics. Res. Phys. 2021;21:103746. doi: 10.1016/j.rinp.2020.103746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Yang Z, et al. Modified seir and ai prediction of the epidemics trend of covid-19 in china under public health interventions. J. Thorac. Dis. 2020;12:165. doi: 10.21037/jtd.2020.02.64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.He X, et al. Temporal dynamics in viral shedding and transmissibility of covid-19. Nat. Med. 2020;26:672–675. doi: 10.1038/s41591-020-0869-5. [DOI] [PubMed] [Google Scholar]
- 45.He, G. et al. When and how to adjust non-pharmacological interventions concurrent with booster vaccinations against Covid-19—Guangdong, China, 2022. China CDC Week.4, 199 (2022). https://weekly.chinacdc.cn//article/id/397ce3f9-9388-46c1-862e-d6e8bee63a56. [DOI] [PMC free article] [PubMed]
- 46.Zhang, M. et al. Transmission dynamics of an outbreak of the covid-19 delta variant b.1.617.2 - guangdong province, china, may-june 2021. China CDC Week.3, 584 (2021). https://weekly.chinacdc.cn//article/id/eb772589-1584-4ef9-beac-cac3ab2fbb12. [DOI] [PMC free article] [PubMed]
- 47.Broyden CG. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA J. Appl. Math. 1970;6:76–90. [Google Scholar]
- 48.Fletcher R. A new approach to variable metric algorithms. Comput. J. 1970;13:317–322. [Google Scholar]
- 49.Goldfarb D. A family of variable-metric methods derived by variational means. Math. Comput. 1970;24:23–26. [Google Scholar]
- 50.Shanno DF. Conditioning of quasi-newton methods for function minimization. Math. Comput. 1970;24:647–656. [Google Scholar]
- 51.National Bureau of Statistics of China. Annual data by province (2019). http://www.stats.gov.cn/english/Statisticaldata/AnnualData/.
- 52.Baidu Qianxi. https://qianxi.baidu.com/#/2020chunyun.
- 53.Johns Hopkins University. Covid-19 data repository by the center for systems science and engineering (csse) at johns hopkins university (2020). https://github.com/CSSEGISandData/COVID-19.
- 54.Wikipedia. List of european countries by population. https://en.wikipedia.org/wiki/List_of_European_countries_by_population.
- 55.Information Office of Hubei Provincial People’s Government. Prevention and control of pneumonia outbreak of new coronary virus infection (2020). https://www.hubei.gov.cn/hbfb/xwfbh/202002/t20200210_2023490.shtml.
- 56.Blavatnik School of Government. Covid-19 government response tracker. https://www.bsg.ox.ac.uk/research/research-projects/covid-19-government-response-tracker.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data and code used in the simulation study (“Experimental results for simulated data”) and the application to COVID-19 data in Europe (“Results for COVID-19 data in Europe”) are publicly available in the repository https://github.com/Yixuan-Tan/Statistical_Inference_Using_GLEaM_with_Spatial_Heterogeneity_and_Correlation. The COVID-19 data in China (“Results for COVID-19 data in China”) are not publicly available due to that the official websites of health commissions of some provinces in China no longer maintain the COVID-19 data before March of 2020, but are available from the corresponding author on request.

















