Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Jun 6;44(13-14):e70135. doi: 10.1002/sim.70135

A Comparison Between Markov Switching Zero‐Inflated and Hurdle Models for Spatio‐Temporal Infectious Disease Counts

Mingchi Xu 1, Dirk Douwes‐Schultz 1,, Alexandra M Schmidt 1
PMCID: PMC12142456  PMID: 40474791

ABSTRACT

In epidemiological studies, zero‐inflated and hurdle models are commonly used to handle excess zeros in reported infectious disease cases. However, they cannot model the persistence (transition from presence to presence) and reemergence (transition from absence to presence) of a disease separately. Covariates can sometimes have different effects on the reemergence and persistence of a disease. Recently, a zero‐inflated Markov switching negative binomial model was proposed to accommodate this issue. We introduce a Markov switching negative binomial hurdle model as a competitor of that approach, as hurdle models are often also used as alternatives to zero‐inflated models for accommodating excess zeroes. We begin the comparison by inspecting the underlying assumptions made by both models. Hurdle models assume perfect detection of the disease cases while zero‐inflated models implicitly assume the case counts can be under‐reported, thus, we investigate when a negative binomial distribution can approximate the true distribution of reported counts. A comparison of the fit of the two types of Markov switching models is undertaken on chikungunya cases across the neighborhoods of Rio de Janeiro. We find that, among the fitted models, the Markov switching negative binomial zero‐inflated model produces the best predictions, and both Markov switching models produce remarkably better predictions than more traditional negative binomial hurdle and zero‐inflated models.

Keywords: Bayesian inference, chikungunya, endemic‐epidemic model, under‐reporting

1. Introduction

In epidemiological studies, disease counts taken at different spatial locations across different instants in time often contain a great number of zeros. In this case, a count distribution, like the Poisson or negative binomial distribution, is often unable to capture the large number of observed zero counts present in the data. Zero‐inflated (ZI) and hurdle models [1, 2] are the two primary types of models that have been proposed to deal with count data with excess zeros.

The first paper on ZI Poisson (ZIP) regression models handled count data with excess zeros by mixing a Poisson distribution and a distribution with a point mass at zero [3]. In practice, due to the need for model flexibility, we can mix count distributions other than the Poisson with a distribution that has point mass at zero, like a ZI negative binomial model (ZINB) [4]. Overall, we will refer to these models as ZI count (ZIC) models [5]. In an epidemiological application of a ZIC model, a Bernoulli process is used to determine whether the disease is present [6]. A one from the Bernoulli process indicates the disease is present, and the number of cases comes from the count process, while a zero indicates the disease is absent and the number of cases is zero. Zeros can come from the zero mass process or the count process. Correspondingly, zero counts produced by a ZI model are often distinguished by “structural zeros”, from the zero mass process, that correspond to the absence of the disease, and “sampling zeros”, produced by the count process, which imply unreported cases from the at‐risk population during the study period [7]. We can also relate associated factors to the Bernoulli process that controls the presence/absence of the disease [8].

In comparison with ZI models, hurdle models also consist of two mixed parts: One is a zero‐generating process, while the other is a zero‐truncated count process. For example, a zero‐truncated negative binomial distribution, which leads to a negative binomial hurdle model (NBH) [9]. We can associate certain factors with the probability of disease presence in the same way as with ZI models [10]. However, unlike ZI models, in hurdle models, zeros cannot be produced by the at‐risk population. Namely, in hurdle models, all zero counts are “structural zeros” by construction. Therefore, compared to ZI models, a zero in a hurdle model, within an epidemiological context, can only arise due to the actual absence of the disease rather than the disease going undetected. Implicitly, this means that the disease is perfectly detected or that undetected cases are too few to be relevant, which is the main difference in the interpretation of zeros between ZI and hurdle models.

Under‐reporting is another challenge for researchers in epidemiology, where the reported disease counts can be less than the true counts. A zero‐truncated count process, like a zero‐truncated negative binomial distribution, can be applied to the reported counts when the disease is present under the perfect detection assumption of a hurdle model. However, a count distribution, such as the Poisson or negative binomial, can fail to approximate the true distribution of reported cases when the disease is present under the imperfect detection assumption of a ZI model. In Section 2, we explore when the approximation can be acceptable.

Under the framework of spatio‐temporal data, we can separate the presence of the disease into two categories: persistence (changing from presence to presence) and reemergence (changing from absence to presence). ZI models can only accommodate the characteristics of overall disease presence and cannot model reemergence and persistence separately. Sometimes, covariate effects can be quite different between the reemergence and persistence of an infectious disease [11]. Recently, Douwes‐Schultz and Schmidt (2022) [11] extended the finite mixture ZIC model to a zero‐state coupled Markov switching negative binomial model (ZS‐CMSNB). They assumed the disease switched between periods of presence and absence in each area through a series of coupled Markov chains, where the reemergence and persistence were modeled separately [11]. As a counterpart to the ZI models, in our framework, we follow the structure of hurdle models to assume that the zero mass process represents the reported cases when the disease is absent, and a truncated count distribution (e.g., a zero truncated negative binomial distribution) represents the reported cases when the disease is present. We then assume that a non‐homogeneous Markov chain in each area switches the disease between the presence and absence states. We compare the Markov‐switching negative binomial hurdle model to its zero‐inflated counterpart on the fit, plausibility of assumptions, and interpretation when modeling weekly reported chikungunya cases in Rio de Janeiro.

1.1. Motivating Example: Chikungunya Cases in Rio De Janeiro

Chikungunya is an infectious disease that became endemic in Rio de Janeiro, Brazil, in 2016 [12]. For our study, we obtained publicly available data from the website of the Municipal Health Secretariat of Rio de Janeiro. The data comprises weekly counts across the 160 administrative districts of Rio de Janeiro. The data spans the period between January 2015 and May 2022. It is suspected that chikungunya started circulating unnoticed in the city before the first reported transmission [13]. Due to a lack of social index information, we decided to exclude one small district, Paquetá Island.

Figure 1 illustrates weekly chikungunya cases for two administrative districts of Rio de Janeiro, one with a small population (Saúde) and the other one (Campo Grande) relatively large. In the Saúde district, chikungunya cases were only reported for a couple of weeks, and in most weeks, no cases were reported (96.61% of the study period). In the Campo Grande district, the disease showed a longer time of observed persistence (54.95% of the study period) and is observed to reemerge (go from absence to presence) quicker. These differences in chikungunya persistence and reemergence probabilities at the district level could be explained by population differences, as there is a well‐known inverse relationship between population and the rate of disease extinction in epidemiology [14]. Socioeconomic factors may also partly explain the distinct patterns [15], since districts with lower Human Development Index (HDI) tend to lack tap water supply, which allows mosquitoes to breed in water storage containers and transmit disease [16]. Because the mosquitoes are accustomed to urban utilities, the level of which is inversely correlated with the proportion of green areas (areas with agriculture, swamps and shoals, tree and shrub cover, and woody‐grass cover) [17], the level of green area in a district could be inversely correlated with the disease transmission there. In this motivating example, we are mainly interested in investigating associations between certain factors, such as HDI and green areas, and Zika emergence and persistence, as well as the problem of future case prediction in a district to help policymakers better direct resources to districts in need.

FIGURE 1.

FIGURE 1

Weekly chikungunya infectious cases for (a) Saúde district and (b) Campo Grande district; red solid circles represent 0 cases at the reported week during the study period.

This paper is organized as follows. In Section 2, we explore mathematically how differences in hurdle and ZI model structures are implicitly related to assumptions about disease detection. In Section 3, we detail our proposed coupled Markov switching hurdle model and review its ZI counterpart, the recently proposed coupled Markov switching ZI model [11]. Section 4 details the Bayesian inferential procedure for both models. In Section 4.2, we include a simulation study comparing predictions between the Markov switching ZI and hurdle models under different reporting rates. We aim to investigate at which reporting rates our proposed Markov switching hurdle model performs better than its zero‐inflated counterpart. Section 5 presents the analysis of the chikungunya data and a comparison in terms of prediction between the hurdle and zero‐inflated Markov switching models, and also more conventional zero‐inflated and hurdle alternatives. The paper concludes with a discussion in Section 6.

2. Motivating a Comparison Between Zero‐Inflated and Hurdle Negative Binomial Models

In epidemiology, both ZI and hurdle models assume that, for every area and/or time period, the disease can be either present or absent and that when the disease is absent, no cases will be reported [15, 18]. The main difference between hurdle and ZI models is that, when the disease is present, a hurdle model assumes the reported cases come from a zero‐truncated distribution, while a ZI model assumes the reported cases are generated by a count distribution that can produce zeroes, such as the negative binomial distribution. To illustrate how these differences relate to assumptions about disease detection, let z be the actual counts when the disease is present, and y be the reported counts when the disease is present. When the disease is present, the actual counts z must be greater than zero, and so it is reasonable to assume z follows a zero‐truncated negative binomial model, that is,

z|λ,rZTNB(λ,r) (1)

where λ is the mean value and r is the over‐dispersion parameter of the original negative binomial distribution before the truncation. Then, it is reasonable to assume that the distribution of the reported cases y, given the actual number of cases z, follows a binomial distribution, that is,

y|z,p0Binomial(z,p0) (2)

where p0 is the probability of reporting any one case when the disease is present. Under these assumptions, it can be shown that the marginal distribution of y (with respect to z) is given by

p(y|λ,r,p0)=(rr+λ)rp0y(r+y1)!(λλ+r)y(λp0+rλ+r)ry(r1)!y!(1(rr+λ)r)y>0,(rλ)r(1(λp0+rλ+r)r)(rλ)r(λ+rλ)ry=0 (3)

See Section 1 of the Supporting Information (SM) for the derivation of this distribution.

Under perfect detection, when p0=1, then y=z; that is, y follows a hurdle model with the count part given by Equation (1). If, instead, p0<1, the observed zeroes could represent the situation wherein the disease is present but is undetected. This leads to a ZI model with count part given by Equation (3). Although this looks like a reasonable model, in practice, it cannot be used. This is because the p.m.f. in Equation (3) involves the reporting probability p0 which is not identifiable from the reported cases alone [19].

Commonly, one chooses a negative binomial random variable W to accommodate the count part of a ZI model. Therefore, one can think that the distribution of W approximates the distribution of the true reported counts y [20]. To check the appropriateness of this approximation, one can match the mean and variance of W with those of the exact distribution in Equation (3),

WNB(μ(w),r(w)) (4)

where μ(w)=p0λ1(rr+λ)r and r(w)=1(1(rr+λ)r)(1+1r)1. See Section 1 of the SM for this derivation.

Figure 2 shows three scenarios of the comparison between the exact distribution in Equation (3) and the approximated negative binomial distribution (4) of the reported counts. When the reporting rate is small (p0=10%), using a negative binomial distribution is close to the exact distribution of the reported counts. This suggests that a ZI model with the commonly used negative binomial count part is a reasonable choice under a small reporting rate. However, the approximation becomes poor as the reporting rate increases. This suggests that a hurdle model may be more applicable when the reporting rate p0 is close to 1. Note that this result is intuitive since if the reporting rate is close to 1, we would not expect many zeroes due to a failure to detect the disease even if the expected number of actual cases were small. The negative binomial distribution places a lot of weight on 0 when the mean is small, and thus it would not fit the true reported cases well.

FIGURE 2.

FIGURE 2

Probability mass functions of the true distribution of reported counts (y) and the approximated NB distribution (W) under (a) a low reporting rate (p0=0.1), (b) a moderate reporting rate (p0=0.3), and (c) a large reporting rate (p0=0.8). The over‐dispersion of the actual counts (r=2) and the incidence of actual cases (λ=8) are the same for all three scenarios.

Note that the discussion above is simply to motivate the comparison between ZI and hurdle models when modeling reported cases of a disease. The proposed approaches in the following subsections do not accommodate under‐reporting. This is outside the scope of this paper. See Stoner et al. [19] and Oliveira et al. [21] for examples of models that handle under‐reporting. For the zero‐inflated models considered here, we only focus on modeling the mean and overdispersion of the reported, not actual, counts, that is, μ(w) and r(w) in Equation (4). Since we only model the mean of the reported counts, we cannot tell if a covariate effect is due to changes in disease transmission or reporting rates [22]. Note that, from Equation (4), both μ(w) and r(w) depend on λ, the mean of the actual counts, which likely varies across space and time. Therefore, in Section 3 below, we assume that both the mean and overdispersion of the Markov switching zero‐inflated and hurdle models can depend on covariates.

3. Modeling Zeros: Zero‐Inflated and Hurdle Models

Assume we have infectious disease cases in areas i=1,2,,N and at times t=1,2,,T. Let yit be the reported disease case counts from area i at time t. Let Xit be a binary random variable that indicates the true presence or absence of the disease in area i at time t, that is, Xit=1 if the disease is present and Xit=0 if the disease is absent. Let yk=(y1k,y2k,,yNk)T be the vector containing the observed counts of the disease across the areas at time k. Additionally, let y(t1)=(y1,y2,,yt1)T be the vector of counts up to time t1. Finally, all available observations are stacked onto the vector y=(y1,y2,,yT)T.

Under the assumptions of a zero‐inflated or hurdle model, when the disease is present, the reported cases are generated by a count distribution, p(yit|θit,Xit=1,y(t1)), and when the disease is absent, no cases are reported. Note that θit is the parameter vector defining the distribution of yit when the disease is present. To save space, we will suppress the conditioning on Xit=1 and y(t1) in p(yit|θit,Xit=1,y(t1)) and denote it as p(yit|θit). Generally, the reported cases for area i at time t, given the disease's presence/absence status and the vector of counts up to time t, can be expressed as

yit|Xit,y(t1)0ifXit=0(absence),p(yit|θit)ifXit=1(presence) (5)

As discussed in Section 2, a negative binomial distribution is a reasonable specification for p(yit|θit) if the reporting rate is small. Then a zero that comes from the negative binomial distribution represents the disease going undetected, while a zero from the zero mass distribution represents the true absence of the disease. Such models are known as zero‐inflated models [15]. In contrast to zero‐inflated models, if we assume perfect detection of the disease, we can specify p(yit|θit) as a zero‐truncated negative binomial distribution since, under perfect detection, there will always be cases reported when the disease is present. Such models are known as hurdle models [18].

The practical difference between ZI and hurdle models is that for ZI models, the zeros can come from both disease absence or undetected cases, while in hurdle models, zero counts can only be generated due to the actual absence of the disease. That is, hurdle models assume perfect detection of the cases or at least that undetected cases are too few to be relevant. However, a zero‐inflated model allows for the imperfect detection of disease cases.

3.1. Modeling the Presence/Absence of Disease

If the disease is present or absent, we would expect it to be more likely to be present or absent again in the next reported time. Therefore, we assume that Xit, conditional on all the previous cases before time t, y(t1), and the presence/absence of the disease in all neighboring areas in the previous time, denoted by X(i)(t1), follows a two‐state non‐homogeneous Markov chain. The transition probability matrix of the Markov chain is given by,

3.1.

where

p01it=P(Xit=1|Xi,t1=0,y(t1),X(i)(t1))(probability of reemergence)
p11it=P(Xit=1|Xi,t1=1,y(t1),X(i)(t1))(probability of persistence)

We also want a statistical model that can investigate how disease persistence and reemergence may be explained by multiple risk factors. Due to the characteristics of infectious diseases, the disease is more likely to be present in an area when the disease is present in its neighboring areas in the previous week. Therefore, the probability of reemergence in area i at time t, that is, p01it, can depend on a vector of risk factors git and the spatial neighbors as

logit(p01it)=α0(0)+gitTα(0)+γ1jNei(i)Xj,t1 (7)

where Nei(i) represents the set of all neighboring areas of area i, and git=(git(1),git(2),,git(D))T is a D‐dimensional covariate vector. Similarly, the probability of persistence in area i at time t, that is, p11it, is modeled as

logit(p11it)=α0(1)+δ(1)log(yi,t1+1)+gitTα(1)+γ2jNei(i)Xj,t1 (8)

where log(yi,t1+1) is a term representing the reported case counts for area i at time t1. The term log(yi,t1+1) is included in the model for p11it because we find it reasonable to assume that the disease will be less likely to go extinct if there are many cases previously. In Equations (7) and (8), α(0) and α(1) represent the effects of the covariates on the disease's reemergence and persistence probabilities respectively, and they can be different. Note that this is distinct from more classical zero‐inflated and hurdle models where α(0)=α(1)=α and α0(0)=α0(1)=α0, that is, each covariate must have the same effect on the reemergence and persistence of the disease [23]. For the Markov chain, we also need to set initial state distributions for the first time in each area, which we denote by p0(Xi0) for i=1,2,,N.

3.1.1. Modeling the Parameters of the Count Part

It is assumed that p(yit|θit) in Equation (5) follows either a negative binomial distribution, in the case of the ZI models, or a truncated negative binomial distribution, in the case of the hurdle models, with mean μit and overdispersion parameter rit (for the ZTNB μit and rit are the mean and overdispersion parameter of the NB before truncation).

For infectious disease counts, previous cases are likely to transmit the disease to other individuals, creating new cases. That is, for an area i, the previously reported cases, yi,t1, may affect the expected value μit of the reported cases yit. Thus, we decompose μit as in Bauer and Wakefield (2018) [24], that is,

μit=μitARyi,t1+μitEN (9)

where μitAR is the autoregressive rate, which is a multiplier on the previous week's cases that is meant to capture transmission from the previous cases, and μitEN is an endemic component meant to capture infectious risk from other sources like the environment and imported cases.

The autoregressive AR rate μitAR is modeled as

μitAR=exp(b0i+gitTβAR) (10)

where b0i|σb02IIDN(β0AR,σb02) is an area level random effect and βAR represents the possible effects of risk factors git on μitAR. The endemic part μitEN is modeled as

μitEN=expbi+β2ENsint522π+β3ENcost522π (11)

where bi|σb2IIDN(β0EN+β1ENlog(Ni),σb2) is an areal level random effect whose mean is a linear function of the population size of the ith district. A possible annual seasonal component is modeled by the sine and cosine components [25]. It is known that environmental variables such as temperature and precipitation impact the life cycle of the mosquito that transmits chikungunya [15]. As we did not have access to these environmental variables in Rio de Janeiro, we include sine/cosine components as a surrogate to account for the seasonal structure that might be present in the data. We expect there to be more reported cases in the summer than in the winter due to the strong effects of climate variables on the mosquito's life cycle [26].

As shown in Equation (4), the overdispersion parameter of the negative binomial approximation to the true reported counts depends on the expected number of actual counts and so should vary across space and time with covariates. Therefore, we model the overdispersion parameter of the NB and ZTNB distributions, rit, as a log‐linear function of covariates and past cases,

log(rit)=α0(2)+gitTα(2)+δ(2)log(yi,t1+1) (12)

In this paper, we will refer to the models defined by the following equations:

  • ZINB: Equation (5) with p(yit|θit)=NB(μit,rit); ([Link], (7), (8)) with α(0)=α(1)=α, α0(0)=α0(1)=α0 and γ1=γ2=0; ((9), (10), (11), (12)),

  • NBH: Equation (5) with p(yit|θit) following a zero truncated negative binomial distribution, that is, p(yit|θit)=ZTNB(μit,rit); ([Link], (7), (8)) with α(0)=α(1)=α, α0(0)=α0(1)=α0 and γ1=γ2=0; ((9), (10), (11), (12)),

  • ZS‐MSNB (Zero‐state Markov switching negative binomial) [11]: Equation (5) with p(yit|θit)=NB(μit,rit), ([Link], (7), (8), (9), (10), (11), (12)),

  • ZS‐MSNBH (Proposed zero‐state Markov switching negative binomial hurdle): Equation (5) with p(yit|θit)=ZTNB(μit,rit), ([Link], (7), (8), (9), (10), (11), (12)).

The ZINB and NBH models represent classical commonly fit versions of ZI and hurdle models [7, 27] while the ZS‐MSNB and ZS‐MSNBH models represent their Markov switching counterparts. The Markov switching models have some important advantages, including allowing for separate covariate effects between the reemergence and persistence. They can also more easily account for many consecutive 0s and positive counts since when the disease is in the presence or absence states, it is usually more likely to remain there due to the Markov chain [11].

There are some similarities between the specifications of the ZS‐MSNB and ZS‐MSNBH models. For a specific number of reported cases in district i at time t, yit, they both assume a latent indicator variable Xit to distinguish the case‐generating process. However, the indicator variable Xit in a ZS‐MSNB model is assumed to be not observed when there are zero reported cases [6]. In the ZS‐MSNB model, both the negative binomial process and the zero process can produce a zero count, which means an observed zero count could be due to either the disease being absent or undetected. These differences in the model specification of yit|Xit lead to divergent interpretations. The ZS‐MSNBH model assumes perfect detection of the counts, while the ZS‐MSNB model allows for the imperfect detection of the disease counts.

Furthermore, ZS‐MSNB and ZS‐MSNBH are a priori plausible for different patterns of case data. When a time series shows switching between long periods of only zero counts and long periods of positive counts, interspersed with some zeros, a ZS‐MSNB model is more applicable, like the time series of reported cases shown in Figure 1b. In contrast, for the case where a time series shows switching between long periods of zero counts and long periods of only positive counts, a ZS‐MSNBH model may fit better than a ZS‐MSNB model.

4. Inferential Procedure

Let X=(X1,X2,,XT)T be the vector of all state indicators, where Xt=(X1t,X2t,,XNt)T. Let Inline graphicbe the whole parameter vector apart from state indicators X.

In a ZS‐MSNBH model, the marginal likelihood function given y, marginalizing out the state indicators, is given by

p(y|Θ0)=i=1Nt=2Tp(yit|y(t1),Θ0)=i=1Nt=2TZTNB(yit|μit,rit)p01it1I[yi,t1>0]p11itI[yi,t1>0]+I[yit=0](1p01it)1I[yi,t1>0](1p11it)I[yi,t1>0] (13)

where I[] represents an indicator function; and, ZTNB(yit|μit,rit) represents a zero‐truncated negative binomial distribution, where the mean and over‐dispersion parameters of the associated negative binomial are given by μit and rit, respectively. We follow the Bayesian paradigm to estimate the parameters of the models. One of the reasons for using the Bayesian approach is that we cannot marginalize out X in the ZS‐MSNB model, and so it is the only tractable method for that model [11]. Further, the Bayesian approach provides uncertainty quantification about the estimates of interest in a straightforward fashion. We assume prior independence among the components of Θ0. Then we specify independent, zero‐mean normal prior distributions, with some large variance, for βAR, β2EN, β3EN, β0EN, β1EN, α0(0), α0(1), α0(2), α(0), α(1), α(2), δ(1), δ(2), γ1 and γ2; inverse gamma priors for σb02 and a uniform prior for σb. Regardless of the prior specification, the posterior distribution is not available in closed form. Thus, we will use Markov chain Monte Carlo methods, particularly a Gibbs sampler with some steps of the Metropolis‐Hastings algorithm, to draw samples from the resultant posterior distribution.

In a ZS‐MSNB model, the joint likelihood function considering X and y is given by

p(y,X|Θ0)=i=1Nt=2Tp(yit|Xit,y(t1),Θ0)×i=1Np(Xi1)t=2Tp(Xit|Xt1,y(t1),Θ0) (14)

The Gibbs sampler procedure for the ZS‐MSNB model is challenging as X is not fully observed and X cannot be marginalized from the likelihood function. Therefore, we follow a data augmentation algorithm to obtain samples from the posterior distribution of this model [11].

4.1. Model Comparison Criteria, Temporal Prediction and Missing Values

In our Bayesian framework, we can use the Watanabe‐Akaike information criterion (WAIC) [28] to compare different model specifications. For a ZS‐MSNBH model, the WAIC is calculated by

lpdd=i=1Nt=2Tlog1QMm=M+1Qp(yit|y(t1),Θ0[m]),pwaic=i=1Nt=2TVarm=M+1Qlogp(yit|y(t1),Θ0[m]),WAIC=2(lpddpwaic) (15)

where the superscript [m] denotes a draw from the posterior distribution of the parameter, M is the size of the burn‐in period, Q is the size of the MCMC sample, and Varm=M+1Qzm=1QMm=M+1Q(zmz)2 represents the sample variance. The WAIC calculation is different for the ZS‐MSNB model. We follow the method where the calculation is conditional on the state in the ZS‐MSNB model [11], while the state is marginalized in the ZS‐MSBH model as shown in Equation (15). Since it is not easy to integrate out X from Equation (14), applying WAIC to compare a ZS‐MSNB model to the other models can be unfair because it has many more parameters [29]. Therefore, we only use WAIC for choosing between separate specifications of the same class of models, while we use proper scoring rules, explained in more detail below, for comparing the predictive performance of different models. A model specification with the lowest WAIC is considered to have the best fit, and two specifications with a difference of 10 or more in WAIC are usually considered to have significant differences.

Proper scoring rules [30] compare different models based on their out‐of‐sample predictive performance. Scoring rules measure how well the probabilistic forecasts are by assigning scores based on the predictive distribution and the observation [30]. One of the most popular proper scoring rules is the Ranked Probability Score (RPS) [31]. The model with the lowest RPS is considered the best predictive model.

To produce K‐step‐ahead temporal predictions, we used a simulation process, detailed in SM Section 2, to draw multiple samples from the posterior predictive distributions. Let T0 be the final time point that was used for model fitting; the out‐of‐sample prediction is performed by obtaining a sample from the posterior predictive distribution at time T0+k for k=1,2,,K, where K is the maximum step we are interested in. A realization from the posterior predictive distribution is denoted as yi,T0+k[m]p(yi,T0+k|y). See SM Section 2 for the Monte Carlo approximations of the posterior predictive distributions for the ZS‐MSNB and ZS‐MSNBH models.

To compare the models in terms of their ability to predict the cases, we used the ranked probability score approximated by draws from the posterior predictive distributions. The ranked probability score [31] for the k‐th step ahead prediction in district i is defined as

rps(i,T0,k)=j=0(Pi,T0,k(j)I[yi,T0+k(obs)j])2 (16)

where yi,T0+k(obs) is the observed future value for district i, and Pi,T0,k(j) is the empirical cumulative distribution function calculated using the draws yi,T0+k[m]p(yi,T0+k|y), evaluated at j. The ranked probability score is given by the average ranked probability score over a set of time points from Ta to Tb, that is,

rps(k)=1N(TbTa+1)i=1NT0=TaTbrps(i,T0,k) (17)

The model with the lowest rps(k) is considered to be the best model at k‐step‐ahead prediction for the evaluation period Ta to Tb.

Finally, if observations are missing from the middle part of the time series, we show that the proposed ZS‐MSNBH can accurately estimate the missing cases. See Section 7 of the SM for details.

4.2. Simulation Study

In Section 2, we showed that a hurdle model implicitly assumes the disease is perfectly detected, while a negative binomial ZI model gives a good approximation to the true distribution of reported counts when reporting rates are low. Therefore, the Markov switching hurdle model should better fit data with high reporting rates, and the Markov switching ZI model should better fit data with low reporting rates. Since the reporting rate is not known in a typical application, we designed a simulation study to investigate this hypothesis.

We simplified the models slightly for the simulation study to reduce the computational cost of having to run many simulations. For i=1,2,,159 and t=1,2,,84, the true cases, zit, are simulated from Equation (5). We assume p(zit|θit)=ZTNB(μit,r), where μit=exp(β0+β1HDIi+β2tempt). Here, HDIi is the human development index for the ith neighborhood of Rio de Janeiro, see Section 5 below. The covariate tempt is the monthly maximum temperature in Rio between 2011 and 2017 [11]. We also let the persistence and reemergence probabilities in Equations (7) and (8) depend on HDIi and tempt. Therefore, we have the coefficients α(0)=(αHDI(01),αtemp(01))T and α(1)=(αHDI(11),αtemp(11))T. For simplicity, we did not include δ(1) in the model. The reported cases are then simulated from

yit|zitBinomial(p0,zit) (18)

where p0 is the reporting rate.

In the simulation study, we set β0=0.5, β1=0.1, β2=0.4, α0(0)=3, αHDI(01)=1.15, αtemp(01)=1.1, γ1=0.6, α0(1)=1.5, αHDI(11)=1.18, αtemp(11)=1.2, γ2=0.3 and r=1.5. This corresponds to the disease being present around 50% of the time.

Four reporting rates are considered in the simulation study: p0(1)=1, p0(2)=0.8, p0(3)=0.6 and p0(4)=0.1. For each of these scenarios, we simulated data according to the above. Then, we fitted ZS‐MSNB and ZS‐MSNBH models to investigate which model predicts the reported counts better under each reporting rate, according to proper scoring rules. We used Ta=40 to Tb=80 as the evaluation period. The ranked probability scores for the Markov‐switching ZI and hurdle models are shown in Figure 3, while a comparison of logarithmic scores is shown in SM Section 3 Figure S1. In Figure 3, the permutation test p‐values for the first forecast week under the four reporting rates are 1.55 ×105 (100% reporting), 0.002 (80% reporting), 0.18 (60% reporting) and 3.8×106 (10% reporting).

FIGURE 3.

FIGURE 3

Averaged RPS under the 100% (top left), 80% (top right), 60% (bottom left) and 10% (bottom right) reporting rates. The solid line represents the ZS‐MSNB model, while the dashed line represents the ZS‐MSNBH model.

The results from the RPS and logarithmic scores show that the Markov‐switching hurdle model gives more accurate predictions, compared to the Markov‐switching ZI model, under a large reporting rate (100%, 80% reporting). The two models produce similar predictions at a medium reporting rate (60% reporting), and the Markov switching ZI model produces more accurate predictions under a small reporting rate (10% reporting). This conclusion agrees with the motivation in Section 2.

5. Analysis of the Chikungunya Infection Data in Rio De Janeiro

In this section, we explore different model structures for the chikungunya dataset, described in Section 1.1. We first assign a prior distribution to the parameter vector. We assume independent prior distributions; a zero mean normal prior distribution with some large variance, for all unbounded parameters; and we assume independent prior distributions for σb02InvGamma(0.1,0.1) and σbUnif(0,10). For the ZS‐MSNB model, when yi1=0 we assign the prior distribution for the initial state to p(Xi1)Bernoulli(0.5), while if yi1>0 then Xi1=1. For the ZS‐MSNBH model, there is no need to specify the initial distribution of Xi1 as Xi1=I[yi1>0].

We first investigate the a priori plausibility of the ZI/hurdle models based on the model assumptions. When chikungunya was introduced, its circulation was usually not characterized by health authorities, in which case a lot of under‐reporting of cases is expected [32]. Therefore, the assumptions of the ZINB/ZS‐MSNB model, which allows for undetected disease cases, are more plausible than the NBH/ZS‐MSNBH model. Also, as discussed in Section 2, the likely low reporting rates suggest that a negative binomial count part for the ZI models is appropriate.

For each of the ZS‐MSNB and ZS‐MSNBH models, we use WAIC to compare the inclusion or exclusion of the spatial neighbor terms in Equations (7) and (8), that is, γ1 and γ2. As shown in Table 1, the WAIC supports the inclusion of the spatial terms for the two Markov switching models. Therefore, in this Section, we considered the ZS‐MSNB and ZS‐MSNBH models, with spatial terms, as well as the NBH and ZINB models, as defined in Section 3. We also considered a model which assumes the disease is always present, that is, Xit=1 for all i and t, which we call the negative binomial (NB) model.

TABLE 1.

Different model specifications when fitted to the chikungunya infectious data compared using WAIC. “Spatial/no spatial” represents the inclusion/exclusion of the spatial neighbors term γ1 and γ2 (in Equations (7) and (8)). The best specification for each model is indicated in italics.

Model Specification WAIC
ZS‐MSNB No spatial 75524.15
Spatial 68450.44
ZS‐MSNBH No spatial 84958.10
Spatial 80379.48

Motivated by our discussion in Section 1.1, the vector of covariates is specified as git=(HDIi,popi,greenareai)T, where HDIi is the Human Development Index in district i, popi is the population in district i obtained from the 2010 Census, the latest available, and greenareai is the proportion of green areas in district i [15]. We obtain the Human Development Index data from ipeadata (http://www.ipeadata.gov.br/Default.aspx), and we obtain the green area data from datario (http://www.data.rio).

The posterior distribution of the fitted models is obtained through MCMC methods as described above and in Section 4 using the R package NIMBLE [33]. For all five models, we ran the Gibbs sampler for 80 000 iterations on 3 chains, with an initial 30 000 iterations considered as burn‐in. All the sampling processes began from a random value to avoid local optimization. The codes to run the MCMC are available from GitHub (https://github.com/MingchiXu/Markov_Switching_Hurdle_code. To check the convergence of the chains, we used the Gelman‐Rubin statistic (all estimated parameters < 1.05) and the minimum effective sample size (>1000) [34]. The fitted values in two example districts are shown in Section 4 of the SM for the ZS‐MSNB and ZS‐MSNBH models. The fitted values were constructed by simulating from the fitted models and show a good agreement between the models and the observed data. However, from SM Section 4, an issue with the ZS‐MSNBH model is that it must always switch states when the counts change from positive to zero and vice versa. This leads to rapid switching between periods of disease presence and absence, which seems unrealistic. Autocorrelation functions (ACF) of the Pearson residuals in two example districts are shown in SM Section 6. No significant autocorrelation is observed in the ACF plots, which suggests that the Markov‐switching hurdle model effectively captured the autocorrelation structure in the data.

Table 2 shows the posterior summaries from the count part of the ZS‐MSNB and ZS‐MSNBH models, that is, Equations (10) and (11) for the two fitted models. The coefficients for the population in both the autoregressive and endemic parts of the mean reported cases are positive, which means that higher‐populated districts have higher transmission of the disease. However, we found there is no evidence of an association between HDI and disease transmission. One possible explanation is that districts with higher HDI are likely associated with higher reporting rates. Therefore, the effects of reporting and disease transmission could cancel each other out. We also did not find an association between green areas and disease transmission.

TABLE 2.

Posterior mean and 95% CI of parameters in the structure of the expected reported cases by different models fitted to the chikungunya data. (A bold‐faced estimate means 0 is not included in the 95% CI).

Posterior mean & 95% CI
Parameter ZS‐MSNB ZS‐MSNBH
β0AR
−0.379 −0.216
(Intercept)
(−0.419, −0.341) (−0.244, −0.192)
β1AR
0.002
0.001
(pop)
(0.001, 0.002) (0.000, 0.001)
β2AR
0.046
0.136
(HDI)
(−0.473, 0.558) (−0.315, 0.336)
β3AR
0.161
−0.038
(greenarea)
(−0.324, 0.000) (−0.149, 0.068)
β2EN
0.583
0.65
(sine)
(0.536, 0.630) (0.566, 0.733)
β3EN
‐0.357
‐0.335
(cosine)
(−0.403, −0.310) (−0.414, −0.256)
β0EN
‐0.776
‐1.352
(Intercept)
(−0.860, −0.694) (−1.465, −1.242)
β1EN
0.435
0.277
(log(pop) on Endemic) (0.357, 0.515) (0.195, 0.363)

Figure 4 shows the estimated seasonal trend of the endemic rate under both Markov switching models. The seasonal trend is highest in the summer when mosquito activity is at its highest, and lowest during the winter. The ZS‐MSNBH model shows a similar seasonal variation compared to the ZS‐MSNB model.

FIGURE 4.

FIGURE 4

Posterior mean and 95% credible intervals of endemic rates for (a) Saúde district and (b) Campo Grande district by the ZS‐MSNB and ZS‐MSNBH models. Summer (December to March)/winter (June to September) seasons are highlighted in red/blue.

Table 3 presents the odds ratios for the Markov chain part (see Equations (7) and (8)) of the fitted ZS‐MSNB and ZS‐MSNBH models. The intercept row can be interpreted as the probabilities of reemergence or persistence, assuming one case reported previously for the persistence, keeping other covariates at their average values, and assuming no disease presence in any neighboring areas previously. From Table 3, the estimated probabilities of persistence are much higher than reemergence for the ZS‐MSNB and ZS‐MSNBH models, meaning the models expect that if the disease is present/absent, it will be more likely to be present/absent in the following times. Also, in the ZS‐MSNBH model, the average probability of staying in the current state (presence or absence) is lower compared to the ZS‐MSNB model. This is reasonable because the ZS‐MSNBH model is forced to switch states whenever going from positive case counts to zero case counts and vice versa.

TABLE 3.

Odds ratio or probability and 95% posterior credible intervals of parameters in the probabilities of reemergence and persistence for the Markov switching models (including spatial terms) fitted to the chikungunya data. The intercept row shows the probabilities of reemergence or persistence assuming one case reported previously for the persistence, keeping other covariates at their average values, and assuming no disease presence in any neighboring areas previously. A bold‐faced estimate means 1 is not included in the 95% CI.

Probability or Odds ratio
Persistence Reemergence
(presence to presence) (absence to presence)
Covariates ZS‐MSNBH ZS‐MSNB ZS‐MSNBH ZS‐MSNB
Intercept (shifted avg prob) 0.226 (0.212, 0.24) 0.230 (0.196, 0.267) 0.046 (0.044, 0.048) 0.029 (0.026, 0.032)
HDI(.1) 1.18 (0.53, 2.272) 0.744 (0.098, 2.706) 1.148 (0.671, 1.83) 1.866 (0.553, 4.701)
Population(1 000s) 1.004 (1.003, 1.005) 0.999 (0.997, 1.001) 1.009 (1.008,1.009) 1.006 (1.004, 1.008)
Green area(%) 0.512 (0.406, 0.638) 1.64 (0.932, 2.759) 0.328 (0.28, 0.383) 0.377 (0.258, 0.528)
Spatial effects(1) 1.456 (1.416, 1.496) 2.009 (1.893, 2.136) 1.99 (1.946, 2.035) 3.485 (3.22, 3.767)
log(yi(t‐1)+1)
7.7 (6.79, 8.726) 3.427 (2.989, 3.911)

A higher population size is associated with higher odds of chikungunya persistence and reemergence. This means the disease is less likely to go extinct and reemerges quickly in high‐population areas, which follows well‐known epidemiological theory [14]. Also, population size has a larger effect on the reemergence of the disease compared to the persistence. Areas with more green space generally have lower odds of chikungunya reemergence and persistence. Disease presence in neighboring districts is associated with higher odds of both disease reemergence and persistence. The previous number of reported counts has a larger effect on disease persistence for the ZS‐MSNBH model compared to the ZS‐MSNB model. This is likely because the disease can not persist when there are zero cases previously in the hurdle model, which is possible in the ZI model. Therefore, to be relatively equal in the probability of persistence, the Markov switching hurdle model needs to have a larger slope compared to the Markov switching ZI model. Interestingly, HDI has no significant effect on the risk of chikungunya reemergence or persistence according to either model.

Posterior summaries of eight weeks ahead forecasting for the ZS‐MSNB and ZS‐MSNBH models in two example districts are shown in Section 5 of the SM. Panels of Figure 5 show the four‐step‐ahead predictions at week T=231 for the five models considered in this section (NB, ZINB, NBH, ZS‐MSNB, and ZS‐MSNBH), for two districts. It seems there are not many differences in the posterior predictive means for the two districts, but different widths of the 95% credible intervals. The Markov switching ZI model has narrower credible intervals than the other four models, which is likely due to the ZI models switching less between presence and absence compared to hurdle models.

FIGURE 5.

FIGURE 5

Four steps ahead prediction for (a) Penha district and (b) Estácio district at the observation week T=231 by different models and their 95% prediction interval.

Figure 6 shows the averaged up to four steps‐ahead RPS for five different models (NB, ZINB, NBH, ZS‐MSNB, and ZS‐MSNBH). We fit each model up to the time points T0=280,281,,380, and calculated the 1st to 4th step ahead forecast averaged RPS according to (16) and (17). Samples from the models fit up to T0 from 280 to 380 all converged according to the Gelman‐Rubin statistics [34]. The results suggest the NB, NBH, and ZINB models are close to each other in prediction. However, they are all significantly worse than our Markov switching zero‐inflated and hurdle models by a large gap at each forecast horizon. Between the two Markov switching models, we performed the permutation test [35] and found that the ZS‐MSNB model shows a significantly better predictive performance than the ZS‐MSNBH model at all forecast horizons (p‐values <2.2×1016). The test also shows there are significant differences between the best predictive model (the ZS‐MSNB model) and the others (all the two‐tailed p‐values <2.2×1016).

FIGURE 6.

FIGURE 6

Averaged ranked probability scores across forecast horizons (weeks) according to five models.

6. Discussion

This paper focused on a comparison between Markov switching zero‐inflated and hurdle models when accounting for an excess of zeros in spatio‐temporal infectious disease counts. The class of zero‐state coupled Markov switching negative binomial (ZS‐CMSNB) models was recently introduced by Douwes‐Schultz and Schmidt (2022) [11]. Their approach assumes the disease switches between periods of presence and absence in each area through a series of coupled Markov chains. This has several advantages over more standard finite mixture zero‐inflated approaches such as allowing one to model differently the persistence (presence at time t1 to presence at time t) and reemergence (absence at t1 and presence at time t) of the disease [11]. The ZS‐MSNB model can be considered a type of zero‐inflated model since it assumes that when the disease is present in an area, the reported cases are generated by a negative binomial distribution, which can produce zeroes. In contrast, hurdle models are also often used to account for many zeroes in spatio‐temporal infectious disease counts and assume differently that the reported cases, when the disease is present, are generated by a zero‐truncated distribution [18]. Therefore, we explored differences in model assumptions and fit between hurdle and zero‐inflated models, specifically in an epidemiological and Markov switching context.

In Section 2, we showed that the hurdle model implicitly assumes the disease is perfectly detected, while the zero‐inflated model assumes there is some non‐zero probability of zero reported cases arising due to a failure to detect the disease. Further, we showed that using a negative binomial distribution as the count part of a zero‐inflated model, as is common, is only appropriate when the reporting rate is small. Therefore, we would recommend practitioners consider hurdle models for diseases with high detection rates and zero‐inflated models for diseases with low detection rates, which is further supported by our simulation study (Section 4.2).

The analysis of the first epidemic of chikungunya experienced in the city of Rio de Janeiro between 2015 and 2016 was discussed in Section 5. Different models were fitted to the data, the usual NB, ZINB, and NBH models, together with different structures of the ZS‐MSNB and ZS‐MSNBH models. Both Markov‐switching models produced better predictions compared to the NBH, ZINB, and negative binomial models. Among the fitted models, the ZS‐MSNB model was the one with the best predictions according to the ranked probability score. The result is plausible, as the ZS‐MSNB model makes the most realistic assumptions by allowing for zeroes due to undetected cases. The reporting rates for chikungunya are around 40% [36], which agrees with our speculation that a ZI model is more plausible for the chikungunya dataset than a hurdle model, since the reporting rate is likely to be small. We also found that chikungunya is more likely to persist and reemerge in areas with high population size, when there is a high amount of neighboring disease presence, when there is less green area, and when there are more cases reported in the previous week. We did not find an important association between HDI and the reemergence or persistence of chikungunya.

There are also some limitations to our approach. The Markov switching ZI model assumes the transition probabilities in Equation (6) can depend on covariates and latent disease states in neighboring areas. Therefore, the transition matrix may fluctuate between persistence and non‐persistence, leading to rapid switching between disease presence and absence, which is not realistic. From examining the fitted values, we found some evidence of rapid state switching in the smaller districts for the Markov switching ZI model, although the problem was much worse for the hurdle model, see Figure S2. The issue of rapid switching is innate for the Markov switching hurdle model, as it always switches states when the counts change from zero to positive or vice versa. However, for the ZI model, one solution is to introduce “clone states” with determined transitions to enforce minimum disease state durations, like in Douwes‐Schultz et al. (2024) [22].

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Data S1. Supporting Information.

SIM-44-0-s001.zip (860KB, zip)

Acknowledgments

Schmidt is grateful for financial support from the Natural Sciences and Engineering Research Council (NSERC) of Canada (Discovery Grant RGPIN‐2017‐04999) and Institut de Valorisation des Données (IVADO) (Schmidt – PRF‐2019‐6839748021 and Douwes‐Schultz PhD‐2021‐9070375349).

Funding: This work was supported by Institut de Valorisation des Données (Grant Nos. PhD‐2021‐9070375349, RF‐2019‐6839748021) and the Natural Sciences and Engineering Research Council of Canada (Grant No. RGPIN‐2017‐0499).

Data Availability Statement

The data and codes to run all models are available on GitHub (https://github.com/MingchiXu/Markov_Switching_Hurdle_code).

References

  • 1. Mullahy J., “Specification and Testing of Some Modified Count Data Models,” Journal of Econometrics 33, no. 3 (1986): 341–365. [Google Scholar]
  • 2. Heilbron D. C., “Zero‐Altered and Other Regression Models for Count Data With Added Zeros,” Biometrical Journal 36, no. 5 (1994): 531–547. [Google Scholar]
  • 3. Lambert D., “Zero‐Inflated Poisson Regression, With an Application to Defects in Manufacturing,” Technometrics 34, no. 1 (1992): 1–14, 10.2307/1269547. [DOI] [Google Scholar]
  • 4. Greene W. H., “Accounting for Excess zeros and Sample Selection in Poisson and Negative Binomial Regression Models” (1994).
  • 5. Young D. S., Roemmele E. S., and Yeh P., “Zero‐Inflated Modeling Part I: Traditional Zero‐Inflated Count Regression Models, Their Applications, and Computational Tools,” Wiley Interdisciplinary Reviews: Computational Statistics 14, no. 1 (2022): e1541. [Google Scholar]
  • 6. Fernandes M. V., Schmidt A. M., and Migon H. S., “Modelling Zero‐Inflated Spatio‐Temporal Processes,” Statistical Modelling 9, no. 1 (2009): 3–25. [Google Scholar]
  • 7. Feng C. X., “A Comparison of Zero‐Inflated and Hurdle Models for Modeling Zero‐Inflated Count Data,” Journal of Statistical Distributions and Applications 8, no. 1 (2021): 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Vergne T., Korennoy F., Combelles L., Gogin A., and Pfeiffer D. U., “Modelling African Swine Fever Presence and Reported Abundance in The Russian Federation Using National Surveillance Data From 2007 to 2014,” Spatial and Spatio‐temporal Epidemiology 19 (2016): 70–77. [DOI] [PubMed] [Google Scholar]
  • 9. Pohlmeier W. and Ulrich V., “An Econometric Model of the Two‐Part Decisionmaking Process in the Demand for Health Care,” Journal of Human Resources 30, no. 2 (1995): 339–361. [Google Scholar]
  • 10. Sengupta P., Biswas B., Kumar A., Shankar R., and Gupta S., “Examining the Predictors of Successful Airbnb Bookings With Hurdle Models: Evidence From Europe, Australia, USA and Asia‐Pacific Cities,” Journal of Business Research 137 (2021): 538–554. [Google Scholar]
  • 11. Douwes‐Schultz D. and Schmidt A. M., “Zero‐State Coupled Markov Switching Count Models for Spatio‐Temporal Infectious Disease Spread,” Journal of the Royal Statistical Society, Series C 71, no. 3 (2022): 589–612. [Google Scholar]
  • 12. De Souza T. M. A., Ribeiro E. D., Corrêa V. C., et al., “Following in the Footsteps of the Chikungunya Virus in Brazil: The First Autochthonous Cases in Amapá in 2014 and Its Emergence in Rio de Janeiro During 2016,” Viruses 10, no. 11 (2018): 623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Xavier J., Giovanetti M., Fonseca V., et al., “Circulation of Chikungunya Virus East/Central/South African Lineage in Rio de Janeiro, Brazil,” PLoS One 14, no. 6 (2019): e0217871. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Bartlett M. S., “Measles Periodicity and Community Size,” Journal of the Royal Statistical Society. Series A (General) 120, no. 1 (1957): 48–70, 10.2307/2342553. [DOI] [Google Scholar]
  • 15. Freitas L. P., Schmidt A. M., Cossich W., Cruz O. G., and Carvalho M. S., “Spatio‐Temporal Modelling of the First Chikungunya Epidemic in an Intra‐Urban Setting: The Role of Socioeconomic Status, Environment and Temperature,” PLoS Neglected Tropical Diseases 15, no. 6 (2021): e0009537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Schmidt W. P., Suzuki M., Dinh Thiem V., et al., “Population Density, Water Supply, and the Risk of Dengue Fever in Vietnam: Cohort Study and Spatial Analysis,” PLoS Medicine 8, no. 8 (2011): e1001082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Freitas M. G. R., Tsouris P., Reis I. C., et al., “Dengue and Land Cover Heterogeneity in Rio de Janeiro,” Oecologia Australis 14, no. 3 (2010): 641–667. [Google Scholar]
  • 18. Harris M., Caldwell J. M., and Mordecai E. A., “Climate Drives Spatial Variation in Zika Epidemics in Latin America,” Proceedings of the Royal Society B: Biological Sciences 286, no. 1909 (2019): 20191578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Stoner O., Economou T., and Silva G., “A Hierarchical Framework for Correcting Under‐Reporting in Count Data,” Journal of the American Statistical Association 114, no. 528 (2019): 1481–1492. [Google Scholar]
  • 20. Combelles L., Corbiere F., Calavas D., Bronner A., Hénaux V., and Vergne T., “Impact of Imperfect Disease Detection on the Identification of Risk Factors in Veterinary Epidemiology,” Frontiers in Veterinary Science 6 (2019): 66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Oliveira G., Argiento R., Loschi R., Assunção R., Ruggeri F., and Branco M., “Bias Correction in Clustered Underreported Data,” Bayesian Analysis 17, no. 1 (2022): 95–126. [Google Scholar]
  • 22. Douwes‐Schultz D., Schmidt A. M., Shen Y., and Buckeridge D., “A Three‐State Coupled Markov Switching Model for COVID‐19 Outbreaks Across Quebec Based on Hospital Admissions,” Annals of Applied Statistics 19, no. 1 (2024): 371–396. [Google Scholar]
  • 23. Young D. S., Roemmele E. S., and Shi X., “Zero‐Inflated Modeling Part II: Zero‐Inflated Models for Complex Data Structures,” Wiley Interdisciplinary Reviews: Computational Statistics 14, no. 2 (2022): e1540. [Google Scholar]
  • 24. Bauer C. and Wakefield J., “Stratified Space–Time Infectious Disease Modelling, With an Application to Hand, Foot and Mouth Disease in China,” Journal of the Royal Statistical Society: Series C: Applied Statistics 67, no. 5 (2018): 1379–1398. [Google Scholar]
  • 25. Meyer S., Held L., and Höhle M., “Spatio‐Temporal Analysis of Epidemic Phenomena Using the R Package Surveillance,” Journal of Statistical Software 77, no. 11 (2017): 1–55. [Google Scholar]
  • 26. Bala Murugan S. and Sathishkumar R., “Chikungunya Infection: A Potential Re‐Emerging Global Threat,” Asian Pacific Journal of Tropical Medicine 9, no. 10 (2016): 933–937. [DOI] [PubMed] [Google Scholar]
  • 27. Tawiah K., Iddrisu W. A., and Asampana Asosega K., “Zero‐Inflated Time Series Modelling of COVID‐19 Deaths in Ghana,” Journal of Environmental and Public Health 2021, no. 1 (2021): 5543977, 10.1155/2021/5543977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Gelman A., Hwang J., and Vehtari A., “Understanding Predictive Information Criteria for Bayesian Models,” Statistics and Computing 24, no. 6 (2014): 997–1016. [Google Scholar]
  • 29. Merkle E. C., Furr D., and Rabe‐Hesketh S., “Bayesian Comparison of Latent Variable Models: Conditional Versus Marginal Likelihoods,” Psychometrika 84 (2019): 802–829. [DOI] [PubMed] [Google Scholar]
  • 30. Gneiting T. and Raftery A. E., “Strictly Proper Scoring Rules, Prediction, and Estimation,” Journal of the American Statistical Association 102, no. 477 (2007): 359–378. [Google Scholar]
  • 31. Epstein E. S., “A Scoring System for Probability Forecasts of Ranked Categories,” Journal of Applied Meteorology (1962–1982) 8, no. 6 (1969): 985–987, 10.1175/1520-0450(1969)008<>2.0.CO;2. [DOI] [Google Scholar]
  • 32. Russo G., Subissi L., and Rezza G., “Chikungunya Fever in Africa: A Systematic Review,” Pathogens and Global Health 114, no. 3 (2020): 111–119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Valpine P., Turek D., Paciorek C. J., Anderson‐Bergman C., Lang D. T., and Bodik R., “Programming With Models: Writing Statistical Algorithms for General Model Structures With NIMBLE,” Journal of Computational and Graphical Statistics 26, no. 2 (2017): 403–413, 10.1080/10618600.2016.1172487. [DOI] [Google Scholar]
  • 34. Plummer M., Best N., Cowles K., and Vines K., “CODA: Convergence Diagnosis and Output Analysis for MCMC,” R News 6, no. 1 (2006): 7–11. [Google Scholar]
  • 35. Bracher J. and Held L., “Endemic‐Epidemic Models With Discrete‐Time Serial Interval Distributions for Infectious Disease Prediction,” International Journal of Forecasting 38, no. 3 (2022): 1221–1233. [Google Scholar]
  • 36. Riou J., Poletto C., and Boëlle P. Y., “A Comparative Analysis of Chikungunya and Zika Transmission,” Epidemics 19 (2017): 43–52. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1. Supporting Information.

SIM-44-0-s001.zip (860KB, zip)

Data Availability Statement

The data and codes to run all models are available on GitHub (https://github.com/MingchiXu/Markov_Switching_Hurdle_code).


Articles from Statistics in Medicine are provided here courtesy of Wiley

RESOURCES