Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Aug 15.
Published in final edited form as: J Am Stat Assoc. 2012 Dec 21;107(500):1410–1426. doi: 10.1080/01621459.2012.713876

Tracking Epidemics With Google Flu Trends Data and a State-Space SEIR Model

Vanja Dukic 1, Hedibert F Lopes 2, Nicholas G Polson 3
PMCID: PMC10426794  NIHMSID: NIHMS1921101  PMID: 37583443

Abstract

In this article, we use Google Flu Trends data together with a sequential surveillance model based on state-space methodology to track the evolution of an epidemic process over time. We embed a classical mathematical epidemiology model [a susceptible-exposed-infected-recovered (SEIR) model] within the state-space framework, thereby extending the SEIR dynamics to allow changes through time. The implementation of this model is based on a particle filtering algorithm, which learns about the epidemic process sequentially through time and provides updated estimated odds of a pandemic with each new surveillance data point. We show how our approach, in combination with sequential Bayes factors, can serve as an online diagnostic tool for influenza pandemic. We take a close look at the Google Flu Trends data describing the spread of flu in the United States during 2003–2009 and in nine separate U.S. states chosen to represent a wide range of health care and emergency system strengths and weaknesses. This article has online supplementary materials.

Keywords: Flu, Google correlate, Google insights, Google searches, Google trends, H1N1, Infectious Diseases, Influenza, IP surveillance, Nowcasting, Online surveillance, Particle filtering

1. INTRODUCTION

The 2009 H1N1 strain was met with little immunity in humans and was able to infect almost 300,000 people worldwide by mid-September of 2009, according to the World Health Organization (WHO). Unlike H5N1 (the avian influenza), which is slowspreading but a more deadly strain, the fast-spreading H1N1 influenza was quickly declared a pandemic. A pandemic toll far exceeds that of a regular seasonal influenza, which usually severely sickens 3 to 6 million people, and results in between a quarter and a half million deaths worldwide each year (Vaillant et al. 2009).

Infectious disease surveillance has traditionally played a sentinel role in public health preparedness of a pandemic. In the United States, the Centers for Disease Control and Prevention (CDC) serve as the main agency in charge of surveillance of “reportable” infectious diseases, such as SARS, influenza, or West Nile virus. Similarly, WHO tracks infectious diseases throughout the world, including endemic diseases in the developing countries. Public health officials rely on surveillance data to estimate disease activity levels and prepare intervention strategies. To this end, epidemic models have become an important part of public health response planning and early warning systems (Kaplan, Craft, and Wein 2002; Webby and Webster 2003; Eubank et al. 2004; Elderd, Dukic, and Dwyer 2006).

1.1. Mathematical Models for Epidemics

Modern mathematical epidemiology models date back to the early twentieth century, most notably to the work by Kermack and McKendrick (1927) whose susceptible-infectious-recovered (SIR) model was used for modeling the plague (London 1665–1666, Bombay 1906) and cholera (London 1865) epidemics. The basic SIR model assumes that at any given time, a fixed population can be split into three compartments (fractions): susceptible people (those naive to the disease), infectious people (those with disease that are able to infect others), and recovered people (those who had the disease and are now immune). The total number of people in all three compartments, N, is assumed constant through time, with no births and no deaths from causes other than the disease itself. These models assume homogeneous mixing, where each individual is equally likely to come in contact with any other.

The SIR model is an example of models commonly referred to as “compartmental models,” as they describe the flow (transition) of people through different compartments (which represent the stages of disease) over time. When considering influenza, however, an immediate extension of the original SIR model is to introduce a fourth compartment corresponding to the incubation (disease latency) stage—when a person is infected with influenza but still not infectious enough to be able to transmit it. This extension is called the “susceptible-exposed-infected-recovered” (SEIR) model (Anderson and May 1991; Hethcote 2000) and describes the epidemic over time as follows:

S˙t=βStIt/NE˙t=βStIt/NαEtI˙t=αEtγItR˙t=γIt. (1)

Here, the dot denotes a time derivative, and the parameters θ=(β,α,γ) are related to the transition rates from one disease stage to the next. The first equation describes disease transmission resulting from contacts between susceptible and infectious people—each infectious individual transmits the pathogen to β individuals per unit time, but the new disease cases only arise if the contact is with a susceptible person (i.e., with probability St/N). Thus, at time t, the individuals in the class S move to the “exposed but not yet infectious” class E at the rate βIt/N. The exposed but not yet infectious individuals move to the infectious class at the rate α per unit time, while γ is the rate (per unit time) at which infectious individuals I cease to be infectious because of recovery (or, in some cases, death). In the contact process terminology, α and γ correspond to the inverse of the average of an exponentially distributed time to onset of infectiousness and to recovery, respectively.

The model (1) is completed with the specification of initial values, S0,E0,I0, and R0 : often, epidemics are modeled with an introduction of a single infectious person into a society where everyone else is susceptible, meaning that I0=1,S0=(N-1), E0=0, and R0=0. It is also possible to consider I0=k, where k is an unknown number of initially infected people, to be estimated from the data. As in the classic SIR model, the SEIR model in this form assumes constant population size: St+Et+It+Rt=N, for all t. Though extensions of the SIR-type models exist where the population size is allowed to vary (e.g., via birth, death, and migration processes), for many fast evolving outbreaks in large populations N can be considered approximately constant and estimated from the census statistics.

Mathematically, one can prove that the epidemic will not be able to take off if E˙+I˙<0 for all times, or equivalently, βS0/Nγ<1. If S0N, the quantity β/γ is commonly of interest instead, and is referred to as the basic reproductive ratio, or R0. That quantity can be interpreted as the number of secondary infections a single infected person would cause during his or her infectious stage in an entirely susceptible population. Higher values of R0 are associated with faster spreading infections. When γ=1, that is, when there is on average 1 recovery per unit time, the value of R0 equals the value of transmission parameter β.

Solving the system of equations (1) is done numerically, using a solver such as the one implemented in the lsoda function in the statistical software R, based on the method originally developed by Petzold (1983) and Hindmarsh (1983). An example of the solution to the deterministic SEIR system of equations (1), for a specific value of θ, is shown in Figure 1. The solution describes St,Et,It, and Rt trajectories over time, thus allowing the fraction of susceptible, latent, infectious, and recovered people to be determined at any point in time t. Compartmental models with various modifications (including birth and death rates for example, or migration) have proven useful in analyzing epidemics and particularly for modeling the spread of moderately to highly infectious diseases in a larger and well-mixed society (Anderson and May 1991; Gani and Leach 2001; Ferguson et al. 2003; Elderd, Dukic, and Dwyer 2006; Cauchemez and Ferguson 2008).

Figure 1.

Figure 1.

An example solution to an SEIR system specified in Equation (1), in a population of size 100.

1.2. State-Space Models for Epidemics

The main appeal of compartmental models lies in their simplicity, well-understood behavior, and intuitive interpretation of the model parameters. Their simplicity is, however, also a limiting factor when it comes to capturing changes in the epidemic course, such as those induced by a public health intervention or a media event, variations in behavior, contact, and vaccination patterns. Casting the traditional compartmental models in a state-space framework is one way to allow the models to capture changes in the dynamics over time in a flexible way. In this article, we provide a state-space extension of the SEIR model, specifically designed to track epidemic behavior based on surveillance data.

In addition, epidemic outbreaks are almost always observed with error, making it necessary to estimate the solution of the system in (1) in the presence of statistical noise. In such situations, the true solution (the true susceptible, latent, infected, and recovered fractions) is referred to as the hidden state of the system. In many state-space models, estimation of the trajectory of the hidden state over time is the primary objective.

In our state-space SEIR model, one objective will be to estimate the trajectory of the hidden state vector xt=St,Et,It,Rt, based on a noisy time series of epidemic surveillance data yt (e.g., counts of the newly infected people or some function thereof). However, along with the hidden state, we will also want to estimate the parameter vector driving the SEIR system, θ=(β,α,γ), which contains the transmission, latency, and recovery parameters, and quantify the uncertainty in those parameters. Joint inference for states and parameters has been a topic of interest in the recent state-space modeling literature (Liu and West 2001; Fearnhead 2002; Storvik 2002; Fearnhead 2008; Doucet and Johansen 2009; Kantas et al. 2009; Lopes, Carvalho, Johannes, and Polson 2011).

2. INFLUENZA DATA

In the United States, flu surveillance starts with the sentinel network of health care establishments, including individual health care professionals, clinics, diagnostic test laboratories, and public health departments, called the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). Some 2400 sites in over 122 cities and 50 states are responsible for monitoring and reporting observed influenza-like cases to the CDC, which then analyzes and publishes consolidated reports on estimated flu activity in nine major U.S. regions. ILINet tracks several indicators of flu activity throughout the United States (hospitalizations, mortality, and outpatient visits due to “influenza-like illness” (ILI)) on a weekly basis during the regular flu season from October through mid-May. According to the CDC guidelines, ILI is defined as fever of 100°F (or higher) and a cough and/or sore throat in the absence of a known cause other than influenza.

According to the CDC estimates, the average number of U.S. ILI-related patient visits is about 16 million per year. The reported fraction of ILI-visits among all patient visits is weighted based on the population of each state and averaged to form the overall U.S. ILI activity, as well as the activity for 10 major U.S. regions. Estimates for the finer geographic resolution are not provided due to unevenly distributed locations and catchment areas of the ILINet members, and consequently, lower precision for ILI estimates. As with many traditional surveillance systems, the CDC reports are published with a delay of approximately 2 weeks, and all past postings are subject to a retroactive adjustment reflecting receipt of corrected reports from the ILINet members. More information about the CDC surveillance program and the definition of the 10 regions can be found on the CDC website (http://www.cdc.gov/flu).

2.1. Google Flu Trends

Due to a remarkable increase in the online community and search engine activity over the last decade, several alternative surveillance systems have been proposed. Some are based on search engines, such as Google or Bing, and some on tracking micro-blogging content, such as Twitter. Following an extensive variable selection process in collaboration with CDC, Ginsberg et al. (2009) were first to identify a set of search words, termed “ILI-related queries,” that were most highly predictive of the CDC’s ILI counts.

The Flu Trends algorithm that Google uses for prediction of ILI cases is based on a regression model that links the logit-transformed fraction of ILI visits to the logit-transformed fractions of the top search terms. The algorithm was found to track the ILI percentages well (see Figure 2) and now consistently predicts the ILI activity 1–2 weeks ahead of CDC publication. The results are archived every week as a part of the Google Flu Trends project (http://www.google.org/flutrends/). Unlike the CDC surveillance, these reports are made available daily and are not in general subject to future revisions. Flu Trends provides localized predictions, based on the IP address of the computer from which a search was done. IP addresses are usually tied to a specific metropolitan area, allowing for “IP surveillance” at the level of individual states as well as cities.

Figure 2.

Figure 2.

Google Flu Trends estimated ILI percentages (dashed line) and CDC ILI surveillance percentages (solid line) for the United States, from June 2003 until September 2009. Separate plots correspond to separate influenza years, with each new influenza season starting in autumn and ending in spring. Note that CDC did not post ILI reports in summers prior to 2009, and thus no solid line appears during summer months prior to 2009.

The “National Report Card on the State of Emergency Medicine” (American College of Emergency Physicians 2009) has found that the overall U.S. emergency care system has been under a severe strain. However, as with other health-care aspects, states vary in their quality of emergency care. As a result, they vary in their pandemic preparedness, in addition to varying in their density of population and contact networks. For this reason, we will also examine individual results for nine states spanning a wide range of quality of care. We focus on two dimensions of the emergency medicine report, “public health” and “disaster preparedness,” as they are directly relevant to the management of influenza epidemics. For example, one of the fields of the “public health” category is the percentage of adults 65 years of age or older who have received an influenza vaccine in the past 12 months. Similarly, “disease preparedness” measures characteristics such as the fraction of nurses and physicians registered in a state-based emergency system, presence of rapid notification systems, and regular drills for medical and emergency personnel. Perhaps not surprisingly, states that are largely rural and face challenges, such as workforce shortages, lack of large medical facilities, and large uninsured populations, are found to have the most difficulty with this category and might be particularly vulnerable during a pandemic outbreak. According to the report, states that are among the best prepared are Maryland, Massachusetts, and Pennsylvania, while those that did not rank highly in the areas of “disaster preparedness” and “public health” include South Carolina, Oklahoma, Mississippi, South Dakota, Tennessee, and Arkansas. Google Flu Trends estimates for these nine states are shown in Figure 3.

Figure 3.

Figure 3.

Google Flu Trends ILI surveillance in nine representative states, 2003–2009. The states were chosen to span a range of health care preparedness criteria based on the results published in the American College of Emergency Physicians 2009 Report. The states that are ranked among the best in quality of health care are Maryland, Massachusetts, and Pennsylvania. The states that ranked low in the areas of “disaster preparedness,” “emergency care access,” and “public health” include South Carolina, Oklahoma, Mississippi, South Dakota, Tennessee, and Arkansas. Note some states’ search term counts were too low to procure the Flu Trends surveillance data early on, from 2003 through 2005.

In addition to individual U.S. states and cities, Google Flu Trends has recently expanded to other countries including most of Europe, Russia, Japan, Australia, New Zealand, Canada, and Mexico. We also employ our method to study the influenza epidemic in New Zealand, a southern hemisphere and a relatively rural and well-off country with a good health care system, and two separated islands. We chose New Zealand, as it may provide insight into the subsequent influenza season in the United States, since the southern hemisphere flu epidemics generally precede the northern hemisphere ones. The New Zealand analysis is basically similar to the United States one, and the full discussion and results can be found in the online supplementary material.

2.2. Influenza Epidemics and Pandemics in the Past

Pandemics are relatively rare, with only a handful of influenza pandemics occurring in the last hundred years. The most infamous one was the H1N1 pandemic in 1918/1919, also known as the “Spanish Flu,” estimated to have caused 20–50 million deaths—more deaths than any pandemic since the bubonic plague (the Black Death) of the fourteenth century. The estimates of its basic reproductive number R0 range from 1.8 to 3.5 in different communities (Mills, Robins, and Lipsitch 2004; Chowell et al. 2006, Chowell, Nishiura, and Bettencourt 2007; Nishiura 2007). The other notable influenza pandemics were the Asian Influenza (H2N2) of 1957–1958 with 70,000 estimated deaths in the United States, and the Hong Kong Flu of 1968–1969 (H3N2) with 34,000 estimated deaths in the United States. Both had basic reproductive numbers in the range of 1.5–2.2 (Longini et al. 2004; Gani et al. 2005; Vynnycky and Edmunds 2008). A pandemic is considered mild if its reproductive rate is below 1.5, moderate if between 1.5 and 1.8, and severe if above 1.9 (Yang et al. 2009). On the other hand, seasonal influenza’s basic reproductive number is lower and historically estimated to range up to 1.35 (Cintrón-Arias et al. 2009).

In the most recent H1N1 epidemic in 2009, the virus’ potential for a pandemic was deemed nonnegligible (Fraser et al. 2009). Its overall basic reproductive rate was estimated between 1.3 and 1.7 based on the first few months of data, but in some instances it was found to be as high as 2.9 based on data from several city initial outbreaks (Yang et al. 2009). In terms of the other influenza parameters, namely the latency (α) and recovery rate (γ), most estimates seem to point to the average incubation time being between 3 and 4 days, while the average infectious time is 7–8 days (Tuite et al. 2010). People infected with the recent H1N1 virus are thought to be infectious longer however, as continued viral shedding was observed for over 10 days post infection, with nearly half of the people continuing to shed the virus on and after the 7th day of illness (Center for Infectious Disease Research & Policy 2009). Under the best fit exponential distribution, these preliminary studies would imply the mean recovery time of about 10 days.

3. STATE-SPACE SEIR MODELS

State-space modeling [often termed dynamic modeling, West and Harrison (1997)] usually relies on sequential Bayes inference that facilitates sequential learning by incorporating additional information with every new surveillance data point. It can be designed to sequentially learn about the epidemic parameters, to produce near real-time estimates of the epidemic states while accounting for uncertainty in the epidemic parameters, and to provide the posterior odds of a pandemic at any point in time. In this section, we describe a state-space extension of the classic SEIR-type model for influenza dynamics and introduce a sequential learning algorithm to update the posterior distributions of the hidden (dynamic) states xt=St,Et,It,Rt (the vector of susceptible, latent, infectious, and recovered fractions in the population) at any time t, and the parameters guiding the disease evolution θ=(β,α,γ). We also show how the algorithm can be used to provide online pandemic alerts based on sequential Bayes factors.

3.1. Notation

The dynamics of influenza are described by the evolution of hidden (unobserved) states of the SEIR-type epidemics, xt=St,Et,It,Rt, which depends on the unknown three-dimensional vector of epidemic parameters θ=(β,α,γ) as in Equation (1). A discretized version of the influenza dynamics in (1), assuming a discretization time-step of 1 week, can be expressed as follows:

St=St1βSt1It1/NEt=(1α)Et1+βSt1It1/NIt=(1γ)It1+αEt1Rt=Rt1+γIt1, (2)

where N is the total population size. The discretization replaces S˙t in Equation (1) by the weekly change in the susceptible fraction, St-St-1, and does so analogously for E˙t,I˙t, and R˙t.

Due to the nature of ILI surveillance data, our observations will consist only of noisily observed weekly counts of ILI visits, I˜t, which can be thought of as a proxy to the true fraction of infected population, It, in each week-long time period (t-1,t]. Instead of working directly with I˜t however, we will model the observed growth rate of the infectious population, yt=I˜t-I˜t-1/I˜t-1. This leads to the following state-space model for the growth rate:

yt=gt+εty       εty~N(0,σy2) (3)
gt=γ+αEt1It1+εtg       εtg~N(0,σg2). (4)

We will refer to Equation (3) as the “observation equation” and Equation (4) as the “evolution equation” for the growth rate. The mean component of Equation (4) is derived from the deterministic evolution of It-1 based on the discretized SEIR model (2) above, with the true number of infections It related to gt via It=1+gtIt-1. With the infectious state It modeled directly in the growth rate evolution Equation (4), the state-space SEIR model is then completed with the evolution of the rest of the state components:

(StEtRt)=(St1Et1Rt1)+(βSt1/N0βSt1/Nαγ0)(It1Et1). (5)

Given that we are now working with the growth rate that can be both positive and negative, it may be computationally convenient to assume that εty and εtg are normally distributed, with means 0 and variances σy2 (observation variance) and σg2 (evolution variance), respectively. Before doing so, we recommend a normality check for the growth rates. In the Google dataset, normality seems to be a reasonable assumption (see Figure 4 for the U.S. growth rates). However, if normality had not seemed appropriate, a transformation of the growth rate (e.g., a log transformation) could have been employed to help achieve approximate normality.

Figure 4.

Figure 4.

Normality assumption checks: the left column shows the box plots of growth rates and the right column shows the empirical (unfilled circles) and normal cumulative distribution functions (CDFs) (filled circles). The top row shows the 2003/2004 season and the bottom row shows the 2008/2009 season.

The classical SEIR formulation assumes that σg2=0. In fact, the magnitude of σg2 can in essence be viewed as a measure of the underlying deterministic SEIR model fit, while the relative magnitudes of the two variances, σy2 and σg2, can be viewed as confidence in observations (data) and the underlying autonomous SEIR model, respectively.

While it is tempting to translate concepts and intuition from the classical compartmental models directly to their state-space counterparts, it is important to note that there are substantial differences between the two. For example, while the classical mathematical biology models produce smooth solutions for the entire disease trajectory over time, the state-space models will only yield a set of pointwise state estimates. The latter only gives an illusion of the trajectory. Also, in general, largestep discretizations and addition of weekly error pulses would not be recommended in pure nonlinear compartmental models (Atkinson 1978; Cauchemez and Ferguson 2008; King et al. 2008; He, Ionides, and King 2009); however, the state-space models are, in principle, able to compensate for the consequences of such errors via their evolution variances.

3.2. Sequential Learning Algorithm

Recently, particle filtering methods have been proposed for surveillance and early detection of epidemics (Rodeiro and Lawson 2006; Jagat et al. 2008), though not within the context of state-space compartmental models. While powerful for rapid online estimation, particle filter methods can suffer from the “particle impoverishment” problem and loss of inferential capability as the process evolves (Storvik 2002; Fearnhead 2008). Motivated by the desire for a fast online surveillance method, we implement a sequential learning algorithm based on a particle filter that is a hybrid of the Liu-West (2001) filter and the particle learning filter (Carvalho et al. 2010), relying on the use of sufficient statistics to help alleviate particle impoverishment and information loss over time (Fearnhead 2002; Kantas et al. 2009; Lopes, Carvalho, Johannes, and Polson 2011).

The proposed sequential learning algorithm proceeds as follows. For notational convenience, we introduce Zt, the “essential state vector” containing the hidden state vector xt=It,Et,It,Rt, the vector of unknown static disease parameters θ=(α,β,γ), the observation and evolution variances σy2 and σg2, and all (partial) sufficient statistics st. The sufficient statistics st govern sequential parameter learning via pθst. (We will talk more about sufficient statistics in Section 3.3.) The goal of the algorithm is to track the distribution of the essential state vector at each point in time t via sequential Monte Carlo (SMC)—that is, sets of M particles, Zt(1),,Zt(M) (denoted hereafter by Zt(i)i=1M. The set of particles at time t will thus need to be sampled from the posterior distribution of the essential state vector Zt, given the observed infection growth rates up to time t,yt=y1,y2,,yt. Formally, Zt(i)i=1M will need to be iid draws from pZtyt.

The algorithm for sampling Zt(i)i=1M from pZtyt is based on the following decomposition of the posterior distribution:

p(Zt+1yt+1)p(Zt+1Zt,yt+1)p(yt+1Zt)dP(Ztyt), (6)

which is a consequence of the following:

p(Ztyt+1) p(yt+1Zt)p(Ztyt) (7)
p(Zt+1yt+1) = p(Zt+1Zt,yt+1)dP(Ztyt+1). (8)

Here, and throughout this section, p() refers to the appropriate continuous/discrete measure and

p(yt+1Zt)= p(yt+1Zt+1)p(Zt+1Zt)dZt+1 (9)

plays the role of the predictive density of yt+1.

Expressions (6)–(9) suggest a two-step algorithm for sampling Zt+1(i)i=1M from the posterior pZt+1yt+1 at time t+1, given that we have stored the set of particles from the previous time t,Zt(i)i=1M. The first step would be to resample the old particles Zt(i)i=1M with weights proportional to pyt+1Zt(i) and generate M resampled particles Zt(*)i=1M. These resampled particles can be viewed as a sample from pZtyt+1 in (7). Once we have the resampled particles Zt(*)i=1M, we will sample a new set of particles Zt+1(i)i=1M from the mixture of densities pZt+1Zt(*),yt+1, an approximation to the integral in Equation (8). In short, the sequential learning algorithm comprises repeating the following steps for i=1,,M, at each time point:

  • Step 1 (Resample): Sample, with replacement, integers ki from the set {1,,M}, such that Prki=jpyt+1Zt(j), for each j=1,,M.

  • Step 2 (Sample): Sample Zt+1(i) from pZt+1Ztki,yt+1.

The key ingredients in the two-step algorithm are thus the posterior predictive density pyt+1Zt and the posterior updating rule pZt+1Zt,yt+1.

The sequential learning algorithm above can be used to produce out-of-sample forecasts and to provide estimates of sequential predictive densities and Bayes factors. This comes from the fact that the predictive density for h periods ahead, pyt+hyt, can be approximated by

pM(yt+hyt)=1Mi=1Mp(yt+hZt(i)), (10)

where Zt(i) come from the current set of particles Zt(i)i=1M, acting as an approximation to pZtyt.

A natural further application of the above approximations is to sequential Bayes factors, which can be used to sequentially test a set of hypotheses. For example, we could sequentially compare the evidence for a seasonal epidemic 1 with the evidence for a pandemic 2, given all the observed data up to the week t. The approximate sequential Bayes factor is computed via the following equation:

BFtM(1,2)=pM(yt1)pM(yt2),

where

pM(ytm)=k=1tpM(ykyk1,m).

Here, pMytyt-1,m are the one-step-ahead approximate predictive densities [based on Equation (10)], for m=1,2.

The two-step sequential learning algorithm presented above produces a sequence of particle sets, Z0(i)i=1M,,Zt(i)i=1M, which can also be used to perform online parameter learning for the static parameters (θ,σy2, and σg2). Given the current set of particles Zt(i)i=1M, one can simply draw, using the Metropolis–Hastings algorithm for example, a new set of θ(*i)i=1M~pθst(i),yt, which will in fact be a sample from the marginal density pθyt (recall that sufficient statistics are a part of Zt(i)i=1M). Similar learning can be done for the two variance parameters, σy2 and σg2. These additional sampling (“learning”) steps are, of course, unnecessary for posterior inference at time t, which can be performed via Rao-Blackwellization, but they are important to further replenish the particles and alleviate particle impoverishment (Lopes, Carvalho, Johannes, and Polson 2011).

The “look ahead” step in Equation (7) also provides extra protection against particle degeneration in the algorithm (see Kong, Liu, and Wong 1994; Pitt and Shephard 1999) and reduces the propagation of the Monte Carlo error (Lopes, Carvalho, Johannes, and Polson 2011). To alleviate particle degeneration even further, a Liu and West (2001) kernel-shrinkage approximation can be used to reweigh and propagate (“jitter”) the static parameters and can be added to the Sample step. Indeed, we do so in the implementation of the sequential surveillance algorithm described in Section 3.3.

Although we use only one sequential learning approach, it is important to note that there are multiple other filtering variations that could be used instead, as long as they take steps to alleviate and assess particle degeneration and information loss. For recent reviews of SMC methods and alternative filtering approaches, as well as issues with particle degeneration, see, among others, Arulampalam et al. (2002), Storvik (2002), Ristic, Arulampalam, and Gordon (2004), Cappé, Godsill, and Moulines (2007), Fearnhead (2008), Doucet and Johansen (2009), Kantas et al. (2009), and Lopes and Tsay (2011). They highlight some of the recent developments over the last decade, including efficient particle smoothers, particle filters for highly dimensional dynamical systems, parameter learning, and the interconnections between Markov chain Monte Carlo (MCMC) and SMC methods.

3.2.1. Example of the Sequential Learning Algorithm for AR(1) Model.

We give now an example of the sequential learning algorithm implemented for a simple AR(1) state-space model. We choose this model in part as an illustration, before moving to the full implementation of the sequential learning algorithm for the state-space SEIR model in the next subsection. The AR(1) model is also a simpler alternative that could be used to model the observed growth rate of infection, yt, instead of the more complex SEIR model. As such, we will also treat the AR(1) state-space model as a simple benchmark model and compare it with the state-space SEIR model performance in the Results section.

In the AR(1) plus noise model, the observed growth rate of infection, yt, is modeled via the standard first order dynamic linear model of West and Harrison (1997), with the hidden state gt (the growth rate at time t) evolving according to an autoregressive process of order one, that is,

ytgt,ξ~N(gt,V)gtgt1,ξ~N(μ+ϕgt1,W),

where ξ=(V,μ,ϕ,W) and g0 comes from an initial distribution Nm0,C0 with fixed values of m0 and C0.

When the joint prior distribution, p(ξ)=p(V)p(μ,ϕ,W) with V~IGa0,b0,W~IGc0,d0, and (μ,ϕW)~Nq0,WQ0, then the joint posterior distribution pξyt,gtpξst is given as pVstpμ,ϕ,Wst. Here, st is again the vector of conditional sufficient statistics for ξ given the data up to time t. More specifically, for gt=g1,,gt,xt=1,gt-1, and Xt=x1,,xt, we have μ,ϕW,gt,Xt~Nqt,WQt and Wgt,Xt~IGct,dt, where ct=ct-1+1/2,Qt-1=Qt-1-1+xtxt,Qt-1qt=Qt-1-1bt-1+gtxt, and dt=dt-1+gt-qtxtyt/2+qt-1-qtQt-1-1qt-1/2. Additionally, Vyt,gt~IGat,bt, where at=at-1+1/2 and bt=bt-1+yt-gt2/2. Therefore, st=at,bt,ct,dt,qt,Qt.

Furthermore, pgtyt,ξpgtstk,ξ~Nmt,Ct, where stk=mt(ξ),Ct(ξ) are the standard Kalman filter moments at time t. In this state-space model, the key ingredients in the sequential learning algorithm are thus all available: pytst-1k,ξ=pMyt;μ+ϕmt-1,V+W+ϕ2Ct-1,pstkst-1k,ξ (a deterministic mapping) and pξst (above updates). In this example, the essential state vector is Zt=st,stx,ξ) and Step 2 (sampling) of the sequential learning algorithm translates into deterministic updates for st given st-1,yt,gt and for stk given st-1k,ξ,yt.

3.3. Sequential Learning Algorithm Implementation for Flu Trends Data

This subsection describes the specifics of the sequential learning algorithm implemented for Google Flu Trends surveillance. The algorithm consists of three modules: predictive density, posterior updating rule, and parameter learning. Below we describe the details of each of the three modules. We refer the reader to the algorithm box for the detailed implementation steps for the Flu Trends data.

3.3.1. Predictive Density.

This is Step 1 (Resample) of the sequential learning algorithm of Section 3.2. The tracking and learning algorithm presented in the previous section depends crucially on the predictive density pyt+1Zt. To find this density, observe that yt+1gt+1,θ,σy2~Ngt+1,σy2, which follows from Equation (3) and the fact that εty~N0,σy2. Similarly, gt+1Zt~N-γ+αEt/It,σg2, based on Equation (4) and the fact that εtg~N0,σg2. Combining these two densities, and integrating gt+1 out, leads to the predictive density for next growth rate observation, that is, yt+1Zt~N-γ+αEt/It,σy2+σg2. Note that this computation can be done for any step size, including those smaller than the intervals at which the observations are collected, by solving the SEIR equations numerically forward, and using the final values at the previous time-step as the initial values for the next.

3.3.2. Posterior Updating Rule.

This is Step 2 (Sample) of the sequential learning algorithm of Section 3.2. After resampling the particles with weights proportional to the predictive distribution above, the next step is to “propagate” these particles and obtain a sample from the updated posterior at time t+1. The update for the hidden growth rate of infection, gt, follows from the conditional linear state-space model, and can be done by the standard Kalman-type recursions (West and Harrison 1997). More precisely, let the initial (time = 0) growth rate of infection be modeled as g0~Nm0,C0. Then, for any time t+1, it follows that gt+1Zt,yt+1~Nmt+1,Ct+1 with moments

mt+1=Ct+1(σy2yt+1+σg2(γ+αEt/It)) and Ct+11=σy2+σg2.

Then, It+1=1+gt+1It, and the other states of the SEIR model, St+1,Et+1,Rt+1 are deterministically updated via Equation (5). The particle set St+1,Et+1,It+1,Rt+1(i)i=1M serves as an approximation to pSt+1,Et+1,It+1,Rt+1yt+1.

3.3.3. Parameter Learning.

To carry out parameter learning, we also need to identify a set of conditional sufficient statistics for the next time t+1, which we denote by st+1. These conditional sufficient statistics are a part of Zt+1 and allow us to easily obtain new parameter samples from pθ,σy2,σg2Zt+1. Note, we have implicitly assumed that given the complete state history up to time t+1,xt+1=x1,,xt+1, the parameters admit conditional sufficient statistics, so that pθ,σy2,σg2xt+1,yt+1=pθ,σy2,σg2st+1, with st+1 being recursively and deterministically obtained from st,xt+1,yt+1, as follows.

Assuming an inverse gamma prior distribution for the observational variance σy2 in Equation (3), that is, σy2~IGa0,b0, it follows σy2yt+1,gt+1~IGat+1,bt+1, where at+1=at+1/2 and bt+1=bt+yt+1-gt+12. Then, st+1 is a deterministic function of st,yt+12,gt+12 and gt+1yt+1. Similarly, a bivariate normal-inverse gamma prior for γ,α,σg2 leads to a bivariate normal-inverse gamma posterior with sufficient statistics, Et/It,Et2/It2, and gt+1Et/It, included in st+1. The transmission parameter β appears nonlinearly via Et and It in the evolution equation and is sampled via the Liu and West (2001) filter, together with α and γ. For that reason, particle replenishing (via particle learning) is only performed for the two variances, σy2 and σg2.

3.3.4. Sequential Bayes Factors.

In situations where rapid decisions are needed, an estimate of the odds of pandemic might be the only quantity desired. In that case, we will be testing βpan versus βepi, with βepi corresponding to a regular (seasonal) epidemic and βpan to a pandemic regime. Sequential computation of the Bayes factor describing the odds of a pandemic through time is then straightforward following the details in Section 3.2.

Hence, for an online detection of a pandemic, we can append the sequential learning algorithm with the sequential Bayes factor computation, comparing the cases where the parameter β takes one of two levels. Evidence for the high-level β indicates that the epidemic is about to become a pandemic, and evidence for the low-level β indicates a regular seasonal epidemics where the disease spreads to a relatively small fraction of the population (CDC estimates 5%−20%) and dies out in a few months in a typical yearly cycle. Note that in the Bayes factor computation, different prior odds of a pandemic can be used: for example, they could be 1:20 (roughly corresponding to the historical frequency of flu pandemics in the past) or 1:1, which could be viewed as corresponding to a “pandemic vigilance” prior.

Sequential Learning Algorithm for state-space SEIR.
Definitions:
  • M = the number of particles used at each iteration (M used in the paper is 1,000,000)

  • α is the latency parameter, β transmission parameter, and γ recovery parameter in SEIR

  • ψ=(log α,log β,log γ) is the log-transformation of the SEIR parameters

  • mψ and Vψ are the sample mean and variance of the ψ(i) draws (i=1,,M), at each time point t,t=1,,T

  • η is the Liu-West shrinkage factor (η used in the paper was 0.99)

  • σg2 is the evolution variance, and σy2 is the observation variance

  • ILIt is the observed (Google Flu Trends) ILI percentage for week t

The algorithm:
  1. Draw the initial particle set β,α,γ,σg2,σy2(i)i=1M from the priors: β~N1.5,0.52Iβ>0,α~N2,0.52Iα>0,γ~N1,0.52Iγ>0,σg2~IG(1.1,0.005),σy2~IG(1.1,0.05) (see Section 4 of the article)

  2. Initialize the particle set for states (S,E,I,R)(i)=(1-ILI1,0,ILI1,0, for i=1,,M

Repeat the following steps for t=1,,T:

  1. Compute mψ,Vψ

  2. Compute ψ˜(i)=ηψ(i)+(1-η)mψ

  3. Obtain α˜(i)=expψ˜1(i),β˜(i)=expψ˜2(i), and γ˜(i)=expψ˜3(i)

  4. Compute μg(i)=-γ˜(i)+α˜(i)Et-1(i)/It-1(i)

  5. Compute weights ωt(i)pytμg(i),σg2(i)+σy2(i)

  6. Resample ψ˜,σg2,σy2,St-1,Et-1,It-1,Rt-1 with weights ωt(i)

  7. Draw ψ(i) from Nψ˜(i),(1-η)2Vψ

  8. Obtain α(i),β(i),γ(i) as in line 3 above

  9. Obtain μ˜g(i)=-γ(i)+α(i)Et-1(i)/It-1(i)

  10. Sample gt(i)~N(b,B), where b=Byt/σy2(i)+μ˜g(i)/σg2(i) and B=1/1/σg2(i)+1/σy2(i)

  11. Obtain
    1. It(i)=It-1(i)1+gt(i)
    2. Et(i)=β(i)It-1(i)St-1(i)+1-α(i)Et-1(i)
    3. Rt(i)=Rt-1(i)+γ(i)It-1(i)
    4. St(i)=1-It(i)-Rt(i)-Et(i)
  12. Compute weights πt(i)pytμ˜g(i),σg2(i)+σy2(i)/ωt(i)

  13. Resample ψ,σg2,σy2,St,Et,It,Rt with weights πt(i)

  14. Sample σg2 and σy2 based on updated conditional sufficient statistics (according to the parameter learning paragraph in Section 3.3)

4. RESULTS

In this section, we present the results for influenza tracking, based on the U.S. Google Flu Trends. Individual years will be analyzed separately, with each year having a different set of epidemic parameters (latency, transmission, and recovery parameters, as well as the evolution and observation variances). The population sizes in all years are assumed known, with yearly estimates provided by the Census Bureau (U.S. Census Bureau 2009). We assume that in each season, the epidemics were started by an unknown number of infected individuals, estimated separately from the data.

We use the season-specific SEIR model within the state-space framework to track the epidemics. As a result, season-specific issues like cross-immunity from previous years will be partly accounted for; for example, the estimated transmission rate is expected to be lower in the years with residual immunity. While any compartmental influenza model—for example, a model with nonconstant population size (e.g., migration) or more detailed contact patterns—could be embedded into a state-space model, our goal here is not to build a more complex SEIR model, but to show how a simple SEIR model within a state-space framework can be successfully used to track the epidemic.

Given the abundance of prior information available for influenza, the hyper-parameters used were derived largely from the information based on historical epidemics and pandemics (see Section 2.2) and are as follows:

Transmission parameter:β~N(1.5,0.52)Iβ>0 
Latency parameter:α~N(2,0.52)Iα>0
Recovery parameter:γ~N(1,0.52)Iγ>0
Evolution variance:σg2~IG(1.1,0.005)
Observation variance:σy2~IG(1.1,0.05).

Here, Ix>0 is an indicator function indicating that x is positive. The 95% ranges of the prior distributions were constructed so that they encapsulate most of the parameter estimates reported in published work. Though these priors are still somewhat informative, their influence is expected to diminish with time as more surveillance data points are incorporated into the analysis.

We show the results for two flu seasons in the United States: the first season, 2003/2004, and the last season, 2008/2009. The epidemics in these two seasons had moderately more complex trajectories than those in the other four seasons. The first season, 2003/2004, shown in the first plot in Figure 2, was characterized by a notable epidemic peak in January 2004, when the number of Google-derived ILI cases increased to around 8%. The sharpness of the peak of that epidemic is somewhat at odds with the slowness of its spread early in the season. In such situations, the classic SEIR model with a time-invariant transmission rate β and no evolution variance would likely have difficulties describing the disease activity adequately. The state-space formulation of the SEIR model however should be able to capture this sharp peak.

The 2008/2009 influenza season (the last plot in Figure 2) is the season with the most complexity in the epidemic trajectory. This season had multiple epidemic waves and multiple influenza strains merging together. The joint epidemic wave, widened by the late spring/summer H1N1 activity and the early second-wave onset of H1N1, would have presented an even greater challenge for the simple SEIR model without the state-space framework.

Although the state-space implementation is sensitive to the choice of variance parameters initially, the tracking algorithm is able to track the time progression of the 2003/2004 (see Figure 5) and 2008/2009 (see Figure 6) epidemics rather well. The uncertainty at each point in time is notable and can be assessed by examining the bottom, middle, and upper curves in all plots, which correspond to the lower 2.5th, median, and the upper 2.5th percentile of the posterior distribution for the hidden states and parameters as we learn more about them over time. For the 2003/2004 season, we see in Figure 5 that the transmission parameter decays over time as the epidemic subsides, while the latency and recovery parameters seem to stabilize: the latency parameter settled down around 1.45 (implying an average latency time of 4.8 days and median latency time of 3.2 days), while the recovery parameter settled down between 0.3 and 0.4 (implying an average recovery time between 2.5 and 3 weeks—with the median recovery time between 1.6 and 2 weeks). The 95% posterior ranges at the end of the epidemic were 0.15–0.9 for the transmission parameter, 1–2 for the latency parameter, and 0.1–0.6 for the recovery parameter. The estimate of R0=β/γ starts off between 1.5 and 2 but gradually settles down to around 1.1–1.3. This was in fact true for all seasons and regions we analyzed. The last panel in Figure 5 shows that even under 1:1 prior odds of pandemic, the Bayes factor steadily increases in favor of the regular epidemic as time progresses during the 2003/2004 season.

Figure 5.

Figure 5.

Flu tracking results in the United States for the 2003/2004 influenza season. In the I plot (second plot in the top row), the points represent weekly Google Flu Trends values, while the lines correspond to the lower 2.5 th percentile, median, and the upper 2.5 th percentile of the infectious state It posterior distribution as time progresses. In the other plots, the two lines present the lower and upper 2.5 th percentiles, while the points present the weekly posterior medians. The results for Bayes factors for the two competing basic reproductive ratios (1.25 vs. 2.2), under 1:1 prior odds, are presented in the last panel, with higher log-Bayes factor meaning stronger evidence in favor of seasonal epidemics.

Figure 6.

Figure 6.

Flu tracking results in the United States for the 2008/2009 influenza season. In the I plot, the points represent weekly Google Flu Trends values, while the lines correspond to the lower 2.5th percentile, median, and the upper 2.5th percentile of the infectious state It posterior distribution as time progresses. In the other plots, the two lines present the lower and upper 2.5th percentiles, while the points present the weekly posterior medians. The results for Bayes factors for the two competing basic reproductive ratios (1.25 vs 2.2), under 1:1 prior odds, are presented in the last panel, with higher log-Bayes factor meaning stronger evidence for seasonal epidemics.

For the 2008/2009 season, we see in Figure 6 a similar set of findings as in the 2003/2004 season. The transmission rate of H1N1 seems slightly lower than the one for the 2003/2004 flu, while the latency parameter is approximately the same as in 2003/2004. The recovery parameter however settled down around 0.25, implying the median recovery time of 3 weeks. This is consistent with the findings that the most recent H1N1 recovery may be longer on average than the recovery from the other recent flu strains (Center for Infectious Disease Research & Policy 2009). The 95% posterior ranges at the end of the epidemic were 0.1–0.4 for the transmission parameter, 0.75−2 for the latency parameter, and 0.1–0.35 for the recovery parameter. Again, the last panel in Figure 6 shows that even under 1:1 prior odds of pandemic, the Bayes factor steadily increased in favor of the regular epidemic as time progressed during the 2008/2009 season.

All results show that while the state-space SEIR can track the epidemic processes reasonably well, there does seem to be a fair amount of uncertainty in sequential state and parameter estimates. This is also reflected in the estimated variances, with evolution variance consistently higher than the observation variance. Note that this does not imply that the state-space SEIR model does not fit well—on the contrary, the state-space model tracks the observed data fairly well. However, the large evolution variance can be taken to indicate that the underlying autonomous SEIR model would likely not describe the epidemics trajectory adequately on its own without the state-space framework.

A notable consequence of using the state-space framework is that the updated information can result in estimates of hidden states without the classic monotonicity constraints. In particular, the number of susceptibles can be updated to a higher level than in the previous time period. The shown hidden states are not actual trajectories over time, as the classic SEIR forward simulation would produce, but rather a sequence of point-wise estimates of hidden states over time—as a result, they need not be monotone.

Figure 7 shows the prior sensitivity analysis under two additional priors on the transmission rate: the prior with mean of 1.4 and a slightly more “optimistic” prior with the mean of 1.1. As can be seen in both 2003/2004 and 2008/2009 seasons, the posterior means of the transmission parameter are similar under these two priors to the results under the prior mean of 1.5 shown in Figures 5 and 6. The similarity is increasing, albeit slowly, with additional data, as expected. The two Bayes factors (under 1:1 prior odds of pandemic) show slight differences under the two priors but are qualitatively the same: all still favor a regular epidemic over a pandemic.

Figure 7.

Figure 7.

Sensitivity analysis under two additional priors on transmission rate: the gray lines correspond to a prior with the mean of 1.4 and the black lines correspond to an “optimistic” prior with the prior β mean of 1.1. The three black and gray line sets in the left column plots correspond to the upper 97.5th percentile, posterior mean, and the 2.5 th percentile of the sequentially simulated marginal posteriors of the transmission parameter. The right column shows the log-Bayes factors, under 1:1 prior odds, with higher log Bayes factors indicating support for a regular epidemic. The top row shows the 2003/2004 season and the bottom row shows the 2008/2009 season.

In addition, Figure 8 shows the sensitivity analysis for the 1-week-ahead prediction (posterior mean and 95% credible interval) for the ILI counts in the 2003/2004 season (top row) and the 2008/2009 season (bottom row). The analysis was done under two different priors on transmission rate: the right column corresponds to the prior mean of 1.4 and the left column to the prior mean of 1.1. As we can see, 1-week-ahead prediction shows little sensitivity to the priors.

Figure 8.

Figure 8.

Sensitivity analysis for the 1-week-ahead prediction under two different priors on transmission rate: the right column corresponds to a prior with the mean of 1.4 and the left column to an optimistic prior (with the prior β mean of 1.1). The three gray lines correspond to the upper 97.5th percentile, posterior mean, and the 2.5 th percentile of the sequentially simulated predictive distributions, while the black line with points corresponds to the observed data. The top row shows the 2003/2004 season and the bottom row shows the 2008/2009 season. One-week-ahead prediction shows little sensitivity to the priors.

We also compared the performance of the state-space SEIR model with the simpler state-space AR(1) benchmark model. We only present a few of the interesting comparisons: Figure 9 shows the sequential posterior densities of the growth rate pgtyt and the infected fraction pItyt for both state-space SEIR and AR(1) model for the 2003/2004 flu season in the United States. It is immediately apparent that the AR(1) model has difficulty capturing changes in the epidemic behavior and fails to track the epidemic trajectory closely after its peak. Similarly, Figure 10 shows the one-step-ahead prediction of AR(1) and SEIR state-space models in the 2003/2004 and 2008/2009 flu seasons. The state-space SEIR model’s 1-week-ahead predictions seem to be closer to the actual observations, while the state-space AR(1) model’s predictions are not as accurate after the peak, reflecting the inability of this simple model to capture the structure of the epidemic process well. The relative mean squared error of the AR(1) model versus the state-space SEIR model is 5.09 for the 2003/2004 season and 2.34 for the 2008/2009 season.

Figure 9.

Figure 9.

Sequential posterior distributions for the state-space SEIR model (left column) and the simple AR(1) benchmark model (right column) presented in Section 3, for the 2003/2004 flu season. The top row presents results for the growth rate of the infected population and the bottom row for the infected population fraction. The black circles correspond to the observations, gray squares (with gray line) are the fitted weekly values, and gray dashed lines are the 95% pointwise credible intervals. The AR(1) model is unable to capture the structure of the process as well as the state-space SEIR model.

Figure 10.

Figure 10.

Comparison of the one-step ahead forecasts produced by the state-space SEIR model (left column) and the simple AR(1) benchmark model (right column) presented in Section 3. The top row presents results for the 2003/2004 flu season and the bottom row for the 2008/2009 flu season. The black squares correspond to the observations, gray circles (with gray line) are the predicted values (using data up to the previous week only), and gray dashed lines are the 95% pointwise credible intervals for the predictions. The AR(1) model predictions are not very accurate and reflect the inability of this simple model to capture the structure of the epidemic process well. The relative mean squared error of the AR(1) model versus the state-space SEIR model is 5.09 for the 2003/2004 season and 2.34 for the 2008/2009 season.

The other flu seasons for the entire United States showed no evidence of strong epidemics, and we do not present them for that reason. The nine individual states chosen as widely representative of the emergency health care systems present largely a similar story to the overall U.S. results. Consequently, we single out only two of the more severe epidemic states, Oklahoma and South Dakota, and present the tracking algorithm results for the 2008/2009 influenza season in those two states in Figure 11.

Figure 11.

Figure 11.

Flu tracking results in South Dakota (top row) and Oklahoma (bottom row) for the 2008/2009 influenza season. In the I plots (first plots in each row), the points represent weekly Google Flu Trends values, while the lines correspond to the lower 2.5 th percentile, median, and the upper 2.5th percentile of the posterior distribution of It as time progresses. In the other plots, the two lines present the lower and upper 2.5th percentiles, while the points present the weekly posterior medians. The log-Bayes factor results for the two competing basic reproductive ratios, a mild one (1.25) and severe one (2.2), under 1:1 prior odds, are presented in the last panel. There seems to be little evidence for a pandemic.

The Bayes factor results are shown in the last panel of all result figures, under the 1:1 prior odds of a pandemic. A higher log-Bayes factor represents the stronger evidence for a seasonal epidemic. In all our analyses, the evidence for a regular epidemic seems to be increasing steadily over the course of the epidemic, starting to level off toward the end. None of the Bayes factors supported evidence for a pandemic in the United States and New Zealand. The full analysis of the New Zealand data is provided in the online supplementary material.

4.1. Comparison With MCMC

Pure compartmental models (without the state-space extension) have traditionally been fitted off-line, using nonlinear least-squares estimation procedures, or (as of recently) Bayesian estimation and MCMC techniques (O’Neill and Roberts 1999; Meligkotsidou and Fearnhead 2004; Neal and Roberts 2004; Elderd, Dukic, and Dwyer 2006; Leman, Chen, and Lavine 2009; Jewel et al. 2009). However, the lack of explicit likelihoods for these models generally results in slow estimation and lengthy MCMC runs. There is a large body of recent work on MCMC algorithms for dynamic models (Gilks and Berzuini 2001; Fearnhead 2002, 2008; Polson, Stroud and Muller 2008), discussing some of the computational issues with MCMC in dynamic models.

However, comparing a particle-filtering based algorithm with MCMC is useful to assess whether particle collapse might have been a problem. The sequential learning algorithm proposed in this article should perform well for dynamic models when there is a high level of conditional sufficiency for parameters of interest, which is not necessarily the case in real-life epidemics. For that reason, we take a closer look at the posterior distribution of the epidemic parameters (α,β,γ) and the two variances, and assess how the posteriors estimated via MCMC compared with those estimated via the sequential learning algorithm. The results are shown in Figure 12, at the end of the 2003/2004 U.S. flu season. There seems to be little difference between the marginal posterior densities of the three epidemic parameters and two variances. However, there was a notable difference in the length of time MCMC and sequential learning algorithm required to run: the sequential learning algorithm with 1,000,000 particles took on average less than 2 min on a 3.1GHz i5 processor for this season, while the MCMC with 1,000,000 iterations took approximately 15hr on the same processor. While this does not make MCMC infeasible for online surveillance, the time savings with sequential learning are notable.

Figure 12.

Figure 12.

Comparison of posterior distributions between the sequential learning algorithm and MCMC, at the end of the 2003/2004 U.S. flu season. Gray histograms correspond to the marginal posterior distributions obtained via MCMC (based on 1500 samples), while the white histograms correspond to those obtained via the sequential learning algorithm (“SLA”) proposed in this article based on 1,000,000 particles

5. CONCLUSIONS

This article presents a state-space SEIR analysis of an IP influenza surveillance dataset, the Google Flu Trends. The U.S. Flu Trends surveillance has been found to closely track the CDC reports and is able to precede it by 1–2 weeks, holding potential for developing real-time surveillance mechanisms. As a result, flexible epidemic models and fast tracking algorithms capable of near real-time estimation and prediction, as new data become available, are particularly important. We present one approach to near real-time disease tracking based on the state-space methodology, compartmental modeling, and sequential Bayesian learning.

Classical compartmental models of mathematical epidemiology have been the staple of epidemic modeling for over a century. However, the unchanging dynamical structure, present in most classical models, is often not appropriate for real-life epidemics, due to seasonality (Cauchemez and Ferguson 2008), behavior changes, vaccination, quarantine, migration, or a myriad of other reasons that affect how people interact with and react to a disease. The state-space approach is one of the most flexible and yet simple ways to incorporate changes in the disease dynamics through time, as it relaxes the determinism of the compartmental models through the presence of the evolution variance. Yet, compartmental models also provide simple but powerful insight into the process of disease dynamics that can be readily tied to intervention (e.g., reducing contact intensity through school closures and hygiene, shortening recovery period through antiviral drugs, etc.). The simple state-space extension of the classic SEIR model presented in this article combines the familiar mathematical epidemiology theory with computational speed and statistical flexibility.

Although information loss and particle collapse can be a problem in our sequential learning algorithm as well as in all other particle filtering approaches, in modest-scale applications, for problems where parameters and states vary smoothly and slowly over time, any reasonable SMC scheme should perform well (Kitagawa 1998). However, when sharp changes in the dynamics are present, as may happen during real-life epidemics due to media activity or public health interventions, tracking might prove challenging. As computational power increases, taking advantage of GPU (graphics processing unit) and cloud computing, the serious information loss issues might be somewhat lessened through increased number of particles used in these algorithms.

The Bayesian framework used in the article is able to easily provide uncertainty estimates. As a result, the method such as the one presented here can be used to guide dynamic allocation of resources (Ludkovski and Niemi 2010) and to facilitate comparisons of different intervention strategies. Such comparisons can be done based on the predictive distribution of outcomes rather than just their expectations, allowing the full propagation of uncertainty in the nonlinear decision problems.

Although this article uses IP surveillance data, it is important to note that CDC surveillance plays a crucial role in the U.S. surveillance and threat preparedness and that the approach we present here is only one way among several designed to aid CDC in continuing their mission. Combining our approach with the CDC’s scan statistic methodology would be a valuable contribution, which we hope to pursue in the future as we extend our algorithm to account for spatial structure across the United States.

Finally, although CDC has done an extensive validation on Google Flu Trends, the Flu Trends algorithm has still not been validated specifically for most states and cities. There are many ways in which states and localities differ, and search terms may be correlated within state (or even within subregions of states and individual metropolitan areas). For example, the search terms found likely to be indicative of ILI for Rhode Island could differ from those used in California, especially when one allows the use of other languages. If more localized online surveillance is to be put in place, Google Trends algorithms will likely need further refinement, in collaboration with local public health authorities and CDC, to capture some of these region-specific differences. Expert opinion on geographical variations and relations among search terms might be able to shed light onto this issue.

Supplementary Material

Supplement_NZ_Analysis

Acknowledgments

The authors thank Google.org, NSF CNH (GEO-1211668), NSF EID and NIH NIGMS (U01GM087729 and R01GM096655), and NIH NIDA (R12DA027624-01) for partial support, as well as the Editor, Associate Editor, and two anonymous reviewers. Special thanks to Drs. David Bortz, Greg Dwyer, and John Younger for helpful discussions. All code is available from the authors.

Footnotes

SUPPLEMENTARY MATERIAL

The supplementary material contains data and code used to generate Figures 5 and 10 in this article. It also presents additional analyses for New Zealand data.

Contributor Information

Vanja Dukic, Applied Mathematics, University of Colorado at Boulder.

Hedibert F. Lopes, Department of Econometrics and Statistics, The University of Chicago Booth School of Business..

Nicholas G. Polson, Department of Econometrics and Statistics, The University of Chicago Booth School of Business..

REFERENCES

  1. American College of Emergency Physicians. (2009), The National Report Card on the State of Emergency Medicine, Irving, Texas: American College of Emergency Physicians. Available at: http://www.emreportcard.org/uploadedFiles/ACEP-ReportCard-10-22-08.pdf [Google Scholar]
  2. Anderson RM, and May RM (1991), Infectious Diseases of Humans: Dynamics and Control, Oxford: Oxford University Press. [Google Scholar]
  3. Arulampalam M, Maskell S, Gordon N, and Clapp T (2002), “A Tutorial on Particle Filters for On-Line Nonlinear/Non-Gaussian Bayesian Tracking,” IEEE Transactions on Signal Processing, 50, 174–188. [Google Scholar]
  4. Atkinson KE (1978), Introduction to Numerical Analysis, New York: Wiley. [Google Scholar]
  5. Cappé O, Godsill S, and Moulines E (2007), “An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo,” IEEE Proceedings in Signal Processing, 95, 899–924. [Google Scholar]
  6. Carvalho CM, Johannes M, Lopes HF, and Polson NG (2010), “Particle Learning and Smoothing,” Statistical Science, 25, 88–106. [Google Scholar]
  7. Cauchemez S, and Ferguson NM (2008), “Likelihood-Based Estimation of Continuous-Time Epidemic Models From Time-Series Data: Application to Measles Transmission in London,” Journal of the Royal Society Interface, 5, 885–897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Center for Infectious Disease Research & Policy. (2009), “Novel H1N1 Influenza (swine flu),” Technical Report, Academic Health Center, University of Minnesota. [Google Scholar]
  9. Chowell G, Ammon CE, Hengartner NW, and Hyman JM (2006), “Transmission Dynamics of the Great Influenza Pandemic of 1918 in Geneva, Switzerland: Assessing the Effects of Hypothetical Interventions,” Journal of Theoretical Biology, 241, 193–204. [DOI] [PubMed] [Google Scholar]
  10. Chowell G, Nishiura H, and Bettencourt L (2007), “Comparative Estimation of the Reproduction Number for Pandemic Influenza From Daily Case Notification Data,” Journal of the Royal Society Interface, 4, 155166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cintrón-Arias A, Castillo-Chávez C, Bettencourt L, Lloyd A, and Banks HT (2009), “The Estimation of the Effective Reproductive Number From Disease Outbreak Data,” Mathematical Biosciences and Engineering, 6, 261–282. [DOI] [PubMed] [Google Scholar]
  12. Doucet A, and Johansen A (2009), “A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later,” in Handbook of Nonlinear Filtering, eds. Crisan D and Rozovskii B, Oxford: Oxford University Press. [Google Scholar]
  13. Elderd B, Dukic V, and Dwyer G (2006), “Uncertainty in Predictions of Disease Spread and Public-Health Responses to Bioterrorism and Emerging Diseases,” in Proceedings of the National Academy of Sciences, 103, pp. 15693–15697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Eubank S, Guclu H, Kumar V, Marathe M, Srinivasan A, Toroczkai Z, and Wang N (2004), “Modelling Disease Outbreaks in Realistic Urban Social Networks,” Nature, 429, 180–184. [DOI] [PubMed] [Google Scholar]
  15. Fearnhead P (2002), “Markov Chain Monte Carlo, Sufficient Statistics, and Particle Filters,” Journal of Computational and Graphical Statistics, 11, 848–862. [Google Scholar]
  16. ———(2008), “MCMC for Space Models,” Technical Report, Lancaster University. [Google Scholar]
  17. Ferguson NM, Keeling MJ, Edmunds WJ, Gant R, Grenfell BT, Amderson RM, and Leach S (2003), “Planning for Smallpox Outbreaks,” Nature, 425, 681–685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Fraser C, Donnelly CA, Cauchemez S, Hanage WP, Van Kerkhove MD, Hollingsworth TD, Griffin J, Baggaley RF, Jenkins HE, Lyons EJ, Jombart T, Hinsley WR, Grassly NC, Balloux F, Ghani AC, Ferguson NM, Rambaut A, Pybus OG, Lopez-Gatell H, Alpuche-Aranda Ietza Bojorquez Chapela CM, Palacios Zavala E, Espejo Guevara DM, Checchi F, Garcia E, Hugonnet S, Roth C, and The WHO Rapid Pandemic Assessment Collaboration (2009), “Pandemic Potential of a Strain of Influenza a (H1N1): Early Findings,” Science, 324, 1557–1561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gani R, Hughes H, Fleming D, Griffin T, Medlock J, and Leach S (2005), “Potential Impact of Antiviral Drug Use During Influenza Pandemic,” Emerging and Infectious Diseases, 11, 1355–1362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Gani R, and Leach S (2001), “Transmission Potential of Smallpox in Contemporary Populations,” Nature, 414, 748–751. [DOI] [PubMed] [Google Scholar]
  21. Gilks W, and Berzuini C (2001), “Following a Moving Target-Monte Carlo Inference for Dynamic Bayesian Models,” Journal of the Royal Statistical Society, Series B, 63, 127–146. [Google Scholar]
  22. Ginsberg J, Mohebbi M, Patel R, Brammer L, Smolinski M, and Brilliant L (2009), “Detecting Influenza Epidemics Using Search Engine Query Data,” Nature, 457, 1012–1014. [DOI] [PubMed] [Google Scholar]
  23. He D, Ionides EL, and King AA (2009), “Plug-and-Play Inference for Disease Dynamics: Measles in Large and Small Towns as a Case Study,” Journal of the Royal Society Interface, 7, 271–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hethcote HW (2000), “The Mathematics of Infectious Diseases,” SIAM Review, 42, 599–653. [Google Scholar]
  25. Hindmarsh A (1983), “ODEPACK: A Systematized Collection of ODE Solvers,” in Scientific Computing, ed. R. S. Stepleman, Amsterdam: NorthHolland, pp. 55–64. [Google Scholar]
  26. Jègat C, Carrat F, Lajaunie C, and Wackernagel H (2008), “Early Detection and Assessment of Epidemics by Particle Filtering,” in geoENV VI - Geostatistics for Environmental Applications, Proceedings of the Sixth European Conference on Geostatistics for Environmental Applications, eds. Soares A, Pereira MJ, and Dimitrakopoulos R, Quantitative Geology and Geostatistics, 15, 23–35. Available at: http://www.springerlink.com/content/g5h7t3j980304372/ [Google Scholar]
  27. Jewel C, Kypraios T, Neal P, and Roberts G (2009), “Bayesian Analysis for Emerging Infectious Diseases,’ Bayesian Analysis, 4, 465496. [Google Scholar]
  28. Kantas N, Doucet A, Singh S, and Maciejowski J (2009), “An Overview of Sequential Monte Carlo Methods for Parameter Estimation on General State Space Models,” IFAC proceedings Volumes (IFAC-Papers Online), 15, 774–785. [Google Scholar]
  29. Kaplan EH, Craft DL, and Wein LM (2002), “Emergency Response to a Smallpox Attack: The Case for Mass Vaccination,” in Proceedings of the National Academy of Sciences of the United States of America, 99, pp. 10935–10940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kermack W, and McKendrick A (1927), “Contribution to the Mathematical Theory of Epidemics,” Proceedings of the Royal Society of London, Series A, 115, 700–721. [Google Scholar]
  31. King AA, Ionides EL, Pascual M, and Bouma MJ (2008), “Inapparent Infections and Cholera Dynamics,” Nature, 454, 877–880. [DOI] [PubMed] [Google Scholar]
  32. Kitagawa G (1998), “A Self-Organizing State-Space Model,” Journal of the American Statistical Association, 93, 1203–1215. [Google Scholar]
  33. Kong A, Liu JS, and Wong WH (1994), “Sequential Imputations and Bayesian Missing Data Problems,” Journal of the American Statistical Association, 89, 278–288. [Google Scholar]
  34. Leman S, Chen Y, and Lavine M (2009), “The Multiset Sampler,” Journal of the American Statistical Association, 104, 1029–1041. [Google Scholar]
  35. Liu J, and West M (2001), “Combined Parameters and State Estimation in Simulation-Based Filtering,” in Sequential Monte Carlo Methods in Practice, eds. Doucet A, de Freitas N, and Gordon N, New York: Springer-Verlag, pp. 197–223. [Google Scholar]
  36. Longini I, Halloran M, Nizam A, and Yang Y (2004), “Containing Pandemic Influenza With Antiviral Agents,” American Journal of Epidemiology, 159, 623–633. [DOI] [PubMed] [Google Scholar]
  37. Lopes HF, Carvalho CM, Johannes MS, and Polson NG (2011), “Particle Learning for Sequential Bayesian Computation,” in Bayesian Statistics 9, eds. Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, and West M, Oxford: Oxford Science Publications. [Google Scholar]
  38. Lopes HF, and Tsay RE (2011), “Particle Filters and Bayesian Inference in Financial Econometrics,” Journal of Forecasting, 30, 168209. [Google Scholar]
  39. Ludkovski M, and Niemi J (2010), “Optimal Dynamic Policies for Influenza Management,” Statistical Communications in Infectious Diseases, 2, available at http://www.degruyter.com/view/j/scid.2010.2.1/scid.2010.2.1.1020/scid.2010.2.1.1020.xml [Google Scholar]
  40. Meligkotsidou L, and Fearnhead P (2004), “Exact Filtering for PartiallyObserved Continuous-Time Models,” Journal of the Royal Statistical Society, Series B, 66, 771–789. [Google Scholar]
  41. Mills CE, Robins JM, and Lipsitch M (2004), “Transmissibility of 1918 Pandemic Influenza,” Nature, 432, 904–906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Neal PJ, and Roberts GO (2004), “Statistical Inference and Model Selection for the 1861 Hagelloch Measles Epidemic,” Biostatistics, 5, 249–261. [DOI] [PubMed] [Google Scholar]
  43. Nishiura H (2007), “Time Variations in the Transmissibility of Pandemic Influenza in Prussia, Germany, From 1918–19,” Theoretical Biology and Medical Modelling, 4, 20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. O’Neill P, and Roberts GO (1999), “Bayesian Inference for Partially Observed Stochastic Epidemics,” Journal of the Royal Statistical Society, Series A, 162, 121–129. [Google Scholar]
  45. Petzold L (1983), “Automatic Selection of Methods for Solving Stiff and Nonstiff Systems of Ordinary Differential Equations,” SIAM Journal on Scientific and Statistical Computing, 4, 136–148. [Google Scholar]
  46. Pitt M, and Shephard N (1999), “Filtering via Simulation: Auxiliary Particle Filters,” Journal of the American Statistical Association, 94, 590599. [Google Scholar]
  47. Polson N, Stroud J, and Muller P (2008), “Practical Filtering With Sequential Parameter Learning,” Journal of the Royal Statistical Society, Series B, 70, 413–428. [Google Scholar]
  48. Ristic B, Arulampalam S, and Gordon N (2004), Beyond the Kalman Filter: Particle Filters for Tracking Applications, Boston, MA: Artech House. [Google Scholar]
  49. Rodeiro CV, and Lawson A (2006), “Online Updating of Space-Time Disease Surveillance Models via Particle Filters,” Statistical Methods in Medical Research, 15, 1–22. [DOI] [PubMed] [Google Scholar]
  50. Storvik G (2002), “Particle Filters in State Space Models With the Presence of Unknown Static Parameters,” IEEE Transactions on Signal Processing, 50, 281–289. [Google Scholar]
  51. Tuite AR, Greer AL, Whelan M, Winter A-L, Lee B, Yan P, Wu J, Moghadas S, Buckeridge D, Pourbohloul B, and Fisman DN (2010), “Estimated Epidemiological Parameters and Morbidity Associated With Pandemic H1N1 Influenza,” Canadian Medical Association Journal, 182, 131–136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. U.S. Census Bureau. (2009), Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2000 to July 1, 2009, Washington, DC: Population Division. [Google Scholar]
  53. Vaillant L, La Ruche G, Tarantola A, and Barboza P (2009), “Epidemiology of Fatal Cases Associated With Pandemic H1N1 Influenza 2009,” Eurosurveillance, 14, 1–6. [DOI] [PubMed] [Google Scholar]
  54. Vynnycky E, and Edmunds WJ (2008), “Analyses of the 1957 (Asian) Influenza Pandemic in the United Kingdom and the Impact of School Closures,” Epidemiology & Infection, 136, 166–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Webby RJ, and Webster RG (2003), “Are We Ready for Pandemic Influenza?,” Science, 302, 1519–1522. [DOI] [PubMed] [Google Scholar]
  56. West M, and Harrison J (1997), Bayesian Forecasting and Dynamic Models (2nd ed.), New York: Springer-Verlag. [Google Scholar]
  57. Yang Y, Sugimoto J, Halloran M, Basta N, Chao D, Matrajt L, Potter G, Kenah E, and Longini I (2009), “The Transmissibility and Control of Pandemic Influenza a (H1N1) Virus,” Science Express, 729733. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement_NZ_Analysis

RESOURCES