Real-time estimation of disease activity in emerging outbreaks using internet search information

Emily L Aiken; Sarah F McGough; Maimuna S Majumder; Gal Wachtel; Andre T Nguyen; Cecile Viboud; Mauricio Santillana

doi:10.1371/journal.pcbi.1008117

. 2020 Aug 17;16(8):e1008117. doi: 10.1371/journal.pcbi.1008117

Real-time estimation of disease activity in emerging outbreaks using internet search information

Emily L Aiken ^1,^*, Sarah F McGough ², Maimuna S Majumder ³, Gal Wachtel ⁴, Andre T Nguyen ^5,⁶, Cecile Viboud ⁷, Mauricio Santillana ^1,^4,^8,^*

Editor: Juliet RC Pulliam⁹

PMCID: PMC7451983 PMID: 32804932

Abstract

Understanding the behavior of emerging disease outbreaks in, or ahead of, real-time could help healthcare officials better design interventions to mitigate impacts on affected populations. Most healthcare-based disease surveillance systems, however, have significant inherent reporting delays due to data collection, aggregation, and distribution processes. Recent work has shown that machine learning methods leveraging a combination of traditionally collected epidemiological information and novel Internet-based data sources, such as disease-related Internet search activity, can produce meaningful “nowcasts” of disease incidence ahead of healthcare-based estimates, with most successful case studies focusing on endemic and seasonal diseases such as influenza and dengue. Here, we apply similar computational methods to emerging outbreaks in geographic regions where no historical presence of the disease of interest has been observed. By combining limited available historical epidemiological data available with disease-related Internet search activity, we retrospectively estimate disease activity in five recent outbreaks weeks ahead of traditional surveillance methods. We find that the proposed computational methods frequently provide useful real-time incidence estimates that can help fill temporal data gaps resulting from surveillance reporting delays. However, the proposed methods are limited by issues of sample bias and skew in search query volumes, perhaps as a result of media coverage.

Author summary

Public health officials regularly make choices about treatment and prevention in disease outbreaks that have the potential to impact entire affected populations. Often these decisions are based on incomplete or unreliable information due to inherent reporting delays in healthcare-based disease surveillance systems. This issue of public health decision-making based on limited data is even more salient in emerging outbreaks, which are typically characterized by uncertain disease dynamics and limited surveillance capacity. We demonstrate the potential for using digital trace data—in this case, Internet-based information from Google search trends—for estimating disease activity in emerging outbreaks in the absence of accurate real-time healthcare-based data sources. We evaluate how data-driven methods leveraging search trend data would have performed in real-time in five recent outbreaks (yellow fever in Angola, Zika in Colombia, Ebola in the DRC, plague in Madagascar, and cholera in Yemen), and find that the methods frequently provide useful signals of disease activity ahead of standard healthcare-based surveillance methods.

Introduction

Disease outbreaks have been major drivers of morbidity and mortality since the beginning of recorded history and continue to pose a major threat to humankind. Surveillance of disease outbreaks by healthcare systems is key to effective outbreak response. In particular, surveillance data is necessary to determine the overall scale of response to an outbreak, allocate limited resources for treatment and prevention, and effectively time interventions to minimize impacts [1]. Epidemiologists use surveillance data to estimate important features of an outbreak, such as morbidity and mortality burden, case fatality rate, and transmission patterns. In recent years, the use of mathematical modeling of disease activity and transmission to predict the likely trajectory of an outbreak and guide intervention strategies has been increasingly explored [1–4].

It is particularly challenging to monitor and characterize unexpected (emerging) disease outbreaks in regions that have not experienced the presence of a specific pathogen in recent times. Such emerging disease outbreaks, especially in their early stages, are characterized by incomplete, delayed, and biased epidemiological surveillance data [1]. Reporting delays in surveillance systems inevitably emerge from limited healthcare resources and coverage, as well as the time required to process lab tests and clean, anonymize, aggregate, and communicate data from distributed healthcare facilities to central authorities. These reporting delays and issues of missingness are manifested in epidemiological reports released by the World Health Organization (WHO) and other health authorities for several recent outbreaks [5–11].

Novel Internet-based data sources have the potential to fill some of these temporal “data gaps” in tracking emerging outbreaks. Research to date on using Internet-based data sources to provide early estimations of disease activity has shown promising results for endemic diseases in high- and middle-income countries, including influenza in the United States [12–17] and dengue in Brazil, Mexico, Thailand, Singapore, and Taiwan [18]. Digital epidemiological methods use mathematical methods to combine Internet-based data—including Google search trends (data on aggregated Google query volumes) [12, 13, 18], Twitter microblogs [14, 15, 19], online news aggregators [20], electronic medical records [21, 22], and crowdsourced disease activity estimates [23, 24]—with historic epidemiological data to produce real-time estimates of disease activity (“nowcasts”).

The most famous digital epidemiological study to date, Google Flu Trends [12], tracked national influenza rates in the United States using Google query volumes and autoregressive epidemiological data. It was famously discontinued in 2013 after underestimating the H1N1 outbreak of 2009 and missing by large margins in subsequent influenza seasons [25]. Google Dengue Trends, a related project tracking dengue fever using Internet query data in Bolivia, Brazil, India, Indonesia, and Singapore [26] was similarly discontinued in August 2015 [18]. In 2015, researchers revised the Google Flu Trends algorithm for tracking seasonal and endemic diseases as ARGO, a machine learning approach based on a dynamic multivariate regularized regression that leverages autoregressive epidemiological data along with real-time generalized online data sources, including Google search trends, Twitter microblogs, electronic health records, and others [13, 27]. ARGO has been shown to produce meaningful and accurate national-level disease activity estimates for influenza in the US and Latin America, and dengue in several middle income countries, weeks ahead of reports issued by traditional surveillance systems [13, 17, 18].

Adapting digital epidemiological methods like ARGO for tracking emerging outbreaks in developing regions brings up a host of new challenges relating to an absence of historical epidemiological data for training and validation, as well as a paucity of digital data due to poorer Internet coverage. To our knowledge, three past studies have experimented with Internet-based data for emerging infections: Majumder et al. [28] demonstrate the use of digital data sources (including Google search trends and news reports) to provide estimates of R₀, the basic reproductive number, in the absence of real-time epidemiological surveillance data in the 2016 Latin American Zika outbreak. Chunara et al. [29] use Twitter and news report data to estimate R₀ in the 2010 Haitian cholera outbreak. In the only work to date on nowcasting disease incidence in an emerging outbreak with digital data sources, McGough et al. [30] incorporate information from Google search trends, Twitter, and news reports to produce nowcasts of incidence in the 2015-2016 Latin American Zika outbreak 1-3 weeks in advance of standard epidemiological reports.

Our contribution

Here we expand on McGough et al. to evaluate the performance of digital epidemiological methods for nowcasting five contemporary outbreaks: Yellow fever in Angola (2016), Zika in Colombia (2015-2016), Ebola in the Democratic Republic of the Congo (2018-present), pneumonic plague in Madagascar (2017), and cholera in Yemen (2016-2017). We propose three simple data-driven predictive models: a linear autoregression that uses historic epidemiological data to produce real-time disease activity estimates (AR), a linear regression that leverages observed Google query volumes to estimate disease incidence (GT), and a regression on both historic epidemiological data and search query data (ARGO). We find that ARGO provides useful estimates of disease activity for yellow fever in Angola, Zika in Colombia, and plague in Madagascar weeks earlier than traditional healthcare-based surveillance data. We find that our data-driven methods are less effective at tracking Ebola in the DRC and cholera in Yemen, and hypothesize that issues of sample bias and skew in search query volumes as a result of media coverage may contribute to a poor signal in these cases.

Results

Motivation for digital epidemiological methods

To motivate the use of digital data streams to monitor emerging outbreaks, we produced a series of correlations assessing the relationship between each outbreak’s epidemiological curve and the volume of a simple Google search term querying the disease of interest (e.g. the search term “Zika” in the case of Colombia). As shown in Fig 1, the search volumes appear to track the time series of cases synchronously in most countries, and we observed high correlations for Angola (r = 0.84, yellow fever), Colombia (r = 0.80, Zika), and Madagascar (r = 0.73, plague), suggesting the potential utility of digital data-driven epidemiological models.

Fig 1 — In each case, the outbreak’s epidemiological curve (in grey, normalized to the range [0, 1]) is compared with normalized search volumes for a single related search term within the country in question.

For each disease outbreak, we built three machine learning models to produce (retrospective and out-of-sample) real-time disease activity estimates that use input information that would have been available at the time of prediction. Our three models were trained dynamically on a continuously expanding time window to incorporate new information as it became available and are summarized as follows: (1) Autoregressive model (AR), that uses only historical cases from n weeks in the past to predict current cases; (2) Google search trends (GT), a multivariate model that uses only synchronous Google search terms for prediction; and (3) ARGO, a multivariate model similar to the one presented in [13] that combines both autoregressive case information and Google searches to make predictions. A handful of simple search terms were selected for inclusion in the GT and ARGO models based on their obvious relevance to the disease in question. We assessed the predictive performance of each model when compared to subsequent observations by healthcare-based disease surveillance systems. Details of model implementation can be found in the Materials and Methods section.

Evaluation assuming continuous flow of available epidemiological data

As a reality check, our first series of models compared nowcasts 1- and 2-weeks ahead of the release of case reports with the ground truth incidence available retrospectively in weekly epidemiological updates produced by local health authorities. These models were trained and built with a strategy similar to the one used in endemic and seasonal outbreaks to make sure our efforts could produce meaningful disease estimates under the assumption that disease activity reports become available with delays of one to two weeks and are continuously available. This assumption is not always satisfied in emerging disease outbreaks. Fig 2 shows these predictions over the full time series of each outbreak, while Table 1 summarizes the out-of-sample predictive performance across models and countries as captured by Pearson’s correlation (CORR), root-mean-square error (RMSE), and relative root-mean-square error (rRMSE).

Fig 2 — The left column shows how models perform assuming a 1-week reporting delay in the traditional surveillance system; the right columns shows model performance assuming a 2-week reporting delay.

Table 1. Evaluations of three computational models (AR, GT, and ARGO) across five outbreaks, based on correlation (r, Panel A), root-mean-square error (RMSE, Panel B), and relative root-mean-square error (rRMSE, Panel C).

The result of the best-performing model for each prediction scenario and metric is bolded. It is important to note that the units of the error (RMSE) are different given that the magnitude of each outbreak was different. The relative error, however, is comparable across outbreaks.

Delay (weeks)	Yellow Fever		Zika		Ebola		Plague		Cholera
Delay (weeks)	1	2	1	2	1	2	1	2	1	2
Correlation (r)
AR	0.879	0.54	0.92	0.78	0.57	0.19	0.91	0.88	0.98	0.93
GT	0.79	0.80	0.78	0.73	0.582	0.50	0.74	0.68	0.65	0.59
ARGO	0.882	0.69	0.93	0.82	0.581	0.17	0.92	0.84	0.99	0.94
Root-mean-square error (RMSE)
AR	17.60	62.65	644.24	1176.74	15.252	28.11	8.45	11.65	4224.88	9156.57
GT	17.66	17.63	997.45	1072.01	16.98	18.13	13.60	15.38	18532.22	19486.67
ARGO	13.22	20.42	542.39	823.34	15.246	27.41	7.97	11.85	3973.06	8497.43
Relative root-mean-square error (rRMSE)
AR	0.55	2.10	0.31	0.54	0.81	1.40	0.45	0.53	0.23	0.48
GT	0.56	0.59	0.58	0.50	0.90	0.90	.72	0.70	1.01	1.03
ARGO	0.42	0.69	0.26	0.38	0.81	1.37	0.42	0.54	0.22	0.44

Open in a new tab

We found that, based on RMSE and correlation, digital epidemiological models that incorporated Google information (GT and ARGO) led to reasonable disease estimates that were within range of the observed disease activity. Specifically, GT and ARGO outperformed a naïve autoregressive approach (AR) in all outbreaks and prediction horizons besides plague, in which a pure AR model performed best for 2-week delays. In general, ARGO exhibited the lowest RMSE and highest correlation in a majority of countries and prediction horizons, though Google data alone improved predictions in the case of 2-week delays in two of the outbreaks (yellow fever and Ebola). We note, however, that nowcast models were generally not skillful enough to track Ebola in the DRC, which exhibited substantially lower predictive performance compared to the other countries (correlation range: 0.17-0.58). Moreover, we observed that the ARGO method does not improve significantly upon a naïve autoregressive approach for tracking either Ebola in the DRC or Cholera in Yemen.

To assess the predictive power of the Google search terms used to nowcast cases each week and visualize changes in predictive power over the course of the epidemic, the size of ARGO model coefficients for each week of prediction are shown for each country in Fig 3, S1–S5 Figs. Because the models are dynamically trained on a 1-week expanding time window, the predictive power of the variables are seen to fluctuate over the weeks of the outbreak, with many search terms appearing most important for prediction in early stages of the outbreak.

Fig 3 — In the heatmap, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero. Since the model is trained dynamically, feature importances shift from week to week. Note that the autoregressive term is extremely important, but information from Google search trends is also used, particularly early on in the outbreak.

Evaluation based on publicly released reports

The first evaluation approach assumed that the ground truth (weekly cases) were reported accurately within 1-2 weeks of their occurrence, which is rarely the case in emerging outbreaks in which surveillance may be constrained by limited resources.

In our second approach, we evaluated the performance of the same three models (AR, GT, and ARGO) under more realistic conditions, using partial and unrevised case reports as they were released in real-time (Fig 3). In contrast to the first approach, here models were trained on a potentially (and frequently) unreliable ground truth, since future revisions of past disease activity may continually update case reports that are released at any given point. We assessed the feasibility of these models in achieving an estimate of disease activity when there are no epidemiological data available in real-time. This analysis was performed on all 7-17 reports for each of the five disease outbreaks; a selection of case studies are presented here and full charts are included in S6–S10 Figs. In addition, Table 2 compares aggregate measures of the accuracy of each nowcasting model for predicting case counts in real-time alongside the accuracy of the epidemiological reports that were released in real-time.

Table 2. Comparison of cases reported in epidemiological bulletins to projections produced by our nowcasting models in real-time.

Accuracy is measured in comparison to the ground-truth case counts eventually reported at the end of each outbreak. Accuracy of each model is calculated separately for the second-to-last week with epidemiological data reported in each bulletin, the last week with epidemiological data reported, and the first week without any epidemiological data reported (we do not evaluate longer time horizons since in several outbreaks epidemiological bulletins are produced near-weekly, rendering projections based on previous bulletins obsolete). Accuracy measures are then averaged over epidemiological bulletins for each outbreak (ranging from seven for Zika in Colombia to 17 for Ebola in the DRC), with standard deviations shown in parentheses. The accuracy of the data source producing the most accurate point estimates on average is bolded.

	Percent error			Absolute error
	Second-to-last week with epi data reported	Last week with epi data reported	First week without epi data reported	Second-to-last week with epi data reported	Last week with epi data reported	First week without epi data reported
Yellow Fever in Angola (N = 11)
Report	67.2 (24.9)	80.6 (34.4)	—	12.8 (9.8)	17.0 (15.1)	—
AR	124.4 (189.8)	610.1 (1035.2)	958.8 (1543.4)	9.0 (5.4)	16.4 (12.6)	31.0 (11.8)
GT	156.5 (286.7)	397.1 (715.4)	430.9 (761.5)	11.1 (5.1)	14.4 (7.0)	13.4 (7.7)
ARGO	105.4 (180.1)	419.9 (720.5)	540.7 (901.4)	8.5 (6.1)	16.4 (7.8)	18.5 (10.9)
Zika in Colombia (N = 7)
Report	21.7 (27.4)	33.2 (8.8)	—	418.0 (377.2)	872.2 (182.0)	—
AR	34.0 (46.2)	38.1 (26.2)	31.3 (34.7)	710.3 (529.8)	1104.4 (1088.2)	1088.8 (1622.3)
GT	36.9 (14.2)	31.6 (14.6)	39.3 (10.7)	961.4 (316.6)	962.8 (611.4)	1214.7 (603.0)
ARGO	20.9 (39.0)	14.9 (13.9)	15.6 (18.1)	344.4 (416.0)	334.0 (198.3)	584.3 (809.0)
Ebola in the DRC (N = 17)
Report	41.8 (31.5)	82.5 (20.2)	—	11.1 (10.3)	21.2 (9.4)	—
AR	46.1 (23.3)	50.2 (20.2)	54.5 (31.1)	13.3 (10.7)	14.1 (7.5)	18.1 (12.4)
GT	53.2 (21.5)	50.8 (25.2)	58.4 (23.4)	16.8 (12.3)	15.8 (12.2)	19.8 (12.9)
ARGO	43.9 (25.7)	50.6 (19.2)	59.6 (26.7)	13.3 (11.1)	14.2 (7.4)	19.5 (11.7)
Plague in Madagascar (N = 12)
Report	11.6 (14.1)	14.9 (17.2)	—	2.6 (3.3)	5.2 (7.1)	—
AR	63.3 (72.0)	40.7 (23.0)	48.6 (42.1)	10.0 (5.1)	8.6 (4.9)	9.4 (8.3)
GT	65.4 (57.6)	46.4 (32.6)	56.9 (49.7)	13.0 (13.9)	12.2 (12.0)	12.4 (12.9)
ARGO	67.7 (76.0)	38.5 (26.8)	40.6 (37.1)	10.6 (5.6)	7.3 (4.2)	7.7 (7.0)
Cholera in Yemen (N = 12)
Report	5.1 (6.2)	18.9 (9.0)	—	2045.1 (3273.2)	6773.2 (4881.6)	—
AR	9.3 (5.6)	12.4 (7.0)	20.7 (18.3)	3160.3 (2252.2)	3516.3 (1289.1)	4739.7 (3236.1)
GT	49.9 (22.7)	48.0 (23.8)	46.6 (23.5)	18657.1 (11130.5)	17618.5 (11182.9)	16013.7 (10839.1)
ARGO	9.3 (5.6)	12.4 (7.0)	20.7 (18.3)	3160.3 (2252.2)	3516.3 (1289.1)	4739.7 (3236.1)

Open in a new tab

As shown in Fig 4, we observed that, even in these realistic circumstances, ARGO produced meaningful disease activity estimates that filled the temporal gap introduced by delayed availability of epidemiological reports. The value of Google search data and nowcast models like ARGO is most apparent in outbreaks with long delays between epidemiological updates: consider the upper-right panel of Fig 4, in which ARGO projections provide useful information in the absence of any up-to-date epidemiological data.

Further, as shown in Table 2, ARGO was as or more accurate when compared to other models for estimating case counts in real-time for most outbreaks (with the exception of Ebola in the DRC, for which AR was most accurate, and yellow fever in Angola, for which ARGO and GT traded off the position of most accurate). Our models were frequently more reliable than the epidemiological data recorded in the final week in each epidemiological report (for yellow fever in Angola, Zika in Colombia, Ebola in the DRC, and cholera in Yemen, based on absolute error), and occasionally more accurate than data recorded in the second-to-last week in each epidemiological report (for yellow fever in Angola and Zika in Colombia, based on absolute error). These results suggest that even when epidemiological data are available in near-real-time, they can be complemented by nowcast models which do not suffer from issues of under-reporting.

Discussion

We show that machine learning techniques that combine real-time disease-related Google search activity with (delayed and frequently incomplete) epidemiological information available during emerging outbreaks can provide useful real-time insights on the likely trajectory of disease transmission. By assessing model predictions in (i) a setting that assumes the continuous availability of delayed epidemiological information (reporting delays of 1-2 weeks with no case revision) and (ii) a set of realistic historical settings where delayed information was unavailable or unreliable (reporting delays of variable week lengths and with case revisions in subsequent epidemiological reports), we demonstrate that incorporating disease-related Google search information improves predictions across several disparate disease and country contexts.

In particular, we demonstrate, for the first time, how a digital nowcast model like ARGO would be deployed in real-time during multiple distinct emerging disease outbreaks with reporting delays and surveillance revisions. We show specifically the insights that would have been accessible in real-time should our approaches have been implemented during the emergence of these outbreaks. Consider, for example, the real-time disease predictions for the 2017 plague outbreak in Madagascar shown on the right-middle panel in Fig 4. The black line, which indicates the number of known reported cases at the time of release of an epidemiological report (Oct. 16, 2017), suggests a sharp decline in cases in October. By the end of the outbreak, it would become clear that there was no decrease in cases in October (ground truth cases produced at the end of the outbreak are shown in gray shading), an insight which was not available in real-time, but which was captured by the Google-based model (GT, green line). We find that the pattern demonstrated in Madagascar generalizes to other diseases and regions: epidemiological reports frequently display a down-turn at the end of the case curve due to under-reporting, implying that the outbreak may be coming to an end. Since our models do not suffer from the under-reporting issue, they exhibit no such downturn, frequently suggesting that the outbreak is ongoing when the most up-to-date epidemiological data suggest otherwise. Moreover, as shown in Fig 4, predictions generated by our models could be used to fill temporal “data gaps” when up-to-date epidemiological data is unavailable.

In addition to showing the potential utility of real-time predictions trained on unreliable or incomplete epidemiological data, our analysis confirms the findings of other digital epidemiological studies that demonstrate the added value of combining Google-based predictions with autoregressive case information [13, 16, 18, 30]. Indeed, the ARGO coefficient heatmaps in Fig 3, S1–S4 Figs reveal that the epidemiological case information from previous weeks has consistently strong predictive power over the course of the outbreak, while the importance of Google predictors fluctuates over time and appears to be most useful in the earlier stages of the studied outbreaks. The phenomenon that past cases are intrinsically linked to future cases is a common feature of infectious disease outbreaks: here, we leverage this fact to improve the accuracy of our predictions, evidenced by the fact that ARGO generally outperforms the Google-only and autoregressive models across diseases and prediction horizons.

Further, our findings suggest that the relative feature importance of autoregressive information and Google search data is dependent on the timescale of disease transmission (serial interval). Specifically, we find that search data appears to posses greater predictive power in diseases with short serial intervals like influenza, and less predictive power in diseases like Cholera, where transmission time-scales are typically longer. We hypothesize that in diseases that spread quickly and affect large swaths of the population, there is in general more data—both ground truth epidemiological data and trace data from Internet searches—available, so there is a higher signal to noise ratio, and models are better able to generalize from the multiple data streams. Diseases with longer serial intervals, which spread more slowly, will naturally result in scarcer data from all sources.

While there are many promises of using Google data to track and predict outbreaks, there are several limitations to using Google data for epidemiological purposes. In the context of emerging outbreaks, these include bias in the sample of Google users, bias due to search term selection, and bias introduced as a result of media coverage.

Google users are a non-random sub-sample of the population, and this bias is particularly significant in the context of most emerging outbreaks, which occur in developing regions where Internet penetration is relatively low and in which there are significant rich-poor and urban-rural divides in Internet access. As a result, it is possible that much of the disease-related Google search activity may occur in a country’s capital, while cases of the disease may occur all over the country or in a specific region with low Internet penetration. Exploration of Google search activity on sub-national levels could help provide insight into this issue, though this bias will likely become less relevant as global Internet penetration in rural regions increases. Relatedly, not all Internet-users are Google users; cultural relevance is an important factor in determining which Internet-based data sources are appropriate for digital epidemiological studies of emerging outbreaks.

Search term selection introduces another form of bias in modeling disease trajectories with search query data. As discussed in more detail in the Materials and Methods section, it is standard in the literature on digital epidemiology to select search terms by mining correlations between search query volumes and epidemiological data during a training period which is disjoint from the period of model evaluation [13, 18, 30]. In emerging outbreaks, however, there is little time for such calibration, so a “common sense” approach to selecting intuitive search terms like the one used in this paper may be more appropriate. Future work could examine the types of search terms that are particularly relevant during outbreaks, perhaps examining temporal heterogeneity in search term frequency to understand how the behavior of searchers changes over the course of an outbreak.

Finally, media coverage may confound the interpretation of our models. In using Google query volumes as a proxy for disease activity, it may be the case that queries come from individuals who are infected or suspect infection. However, Google query volumes inevitably also contain signals resulting from high media coverage (often pervasive during novel and unexpected outbreaks), which prompts large numbers of people in the affected country to search for disease-related terms out of curiosity, seeking news articles. Consider the graph of search volumes for the term “peste” (French for “plague”) in Madagascar in Fig 1: there is a sharp spike in volumes in mid-October, which appears anomalous to the incidence curve. It is very reasonable to hypothesize that this spike is the result of the first media coverage of that outbreak.

To evaluate how media coverage may skew Google search volumes, we qualitatively compare signals in Google searches and news report volumes with epidemiological time series. Fig 5 compares the volume of news articles (obtained from the GDELT Global Knowledge Graph [31]), Google search trends, and reported cases side by side for each outbreak. Based on this analysis, it is plausible that ARGO’s weaker performance on Ebola in the DRC and on Cholera in Yemen are caused by premature spikes in Google searches. These premature spikes are correlated with early spikes in news coverage, and these early spikes are not found for the other outbreaks where ARGO had better performance. It is likely that hype caused by media coverage biases predictions based on Google search volumes in these analyses.

Fig 5 — Note how media coverage (as captured in the news alerts time-series) may bias predictions based on the GT data. Search term signals are drawn from the same search term as in Fig 1.

A final limitation of the work presented here is the use of “ground truth” epidemiological data. We assume that the final epidemiological situation report released for each outbreak we analyze represents the true timing and volume of cases in that outbreak, but in reality it is likely that many cases go unrecorded or misrecorded [1]. Our work points to the need for investments in standard methods of epidemiological data collection, in addition to the potential to complement these standard sources with nontraditional data. Earlier release of case reports—even incomplete ones—could improve model accuracy and provide sounder empirical basis for public health decision making.

Here we have shown how Internet-based data streams can be mined to monitor the progression of emerging outbreaks in low-income settings where traditional surveillance may lag substantially or be rendered inaccurate due to backfilling. We have shown that digital epidemiological methods like ARGO perform well for nowcasting plague in Madagascar, yellow fever in Angola, and Zika in Colombia, but are less effective at tracking cholera in Yemen and Ebola in the DRC. The poor performance for the Ebola and cholera outbreaks could be linked to a combination of low Internet coverage, intense response to news alerts, and rapid shifts in disease dynamics due to population unrest and violence.

We suggest two main directions for future work in this space. First, previous studies have shown that multi-prong approaches based on a variety of traditional non-traditional data sources (including epidemiological data, search data, social media data, news data, electronic health records, and more) are superior to relying on just one or two proxy data sources [14, 15, 30]. Future work could assess the non-traditional and Internet-based data sources available in the settings of novel outbreaks, and consider multi-pronged approaches to tracking outbreaks with multiple novel data streams. Given the results on media skew from this study, models incorporating data on news coverage volume alongside searches and epidemiological data are of particular interest. A second line of future work should focus on the pathogen and population conditions (digital coverage, symptoms specificity, serial interval, mode of transmission, behavior changes, and health interventions) that can make or break digital surveillance in low-income settings, and how to adjust digital surveillance signals for intense media coverage and other exogenous forces.

Materials and methods

Data sources

We digitized daily or weekly national case counts from epidemiological situation reports for outbreaks of yellow fever in Angola (Jan. 3—July 31, 2016), Zika in Colombia (Aug. 9, 2015—July 10, 2016), Ebola in the Democratic Republic of the Congo (April 30—Dec. 31, 2018), pneumonic plague in Madagascar (Aug. 1—Nov. 2016), and cholera in Yemen (Oct. 30, 2016—Nov. 26, 2017). We also downloaded country-specific time-series of Google query volumes from the Google Trends API for the same time periods.

Epidemiological data

Table 3 summarizes the sources of epidemiological data and key descriptive statistics on the epidemiological dataset for each of the five outbreaks analyzed. For each dataset, we consider the final epidemiological report to be the “ground truth” recording the true onset date for each of the cases in the outbreak; the earlier reports are considered estimates and subject to revision. Note that this assumption requires a larger leap for Ebola and cholera than for the other outbreaks analyzed, as these outbreaks were ongoing at the time of data collection whereas the other outbreaks were completed. Finally, note that, due to issues of data availability, in certain outbreaks the dataset consists of only laboratory-confirmed cases, while in other outbreaks the dataset contains both confirmed and probable (or suspected) cases.

Table 3. Epidemiological data sources.

Outbreak	Time period	Temporal granularity	Total cases	Reports	Source
Yellow Fever in Angola	Jan. 3—July 31, 2016	Weekly	879 (confirmed)	11	Digitized from plots in PDF situation reports released by WHO [5]
Zika in Colombia	Aug. 9, 2015—July 10, 2016	Weekly	91,156 (suspected)	7	Digitized from plots in PDF epidemiological updates published after Feb. 17 (only updates with Colombia-specific data are included) [6]
Ebola in the DRC	Apr. 30—Dec. 31, 2018	Weekly	628 (suspected)	17	Digitized from plots in PDF situation reports released by the WHO [7]
Pneumonic plague in Madagascar	Aug. 1—Nov. 25, 2016	Daily	1,857 (confirmed)	12	Digitized from plots in PDF situation reports released by the IPM [9] and WHO [8] (only reports containing case counts specifically for pneumonic plague are included)
Cholera in Yemen	Oct. 30, 2016—Nov. 26, 2017	Weekly	973,802 (suspected)	13	Digitized from plots in PDF situation reports released by WHO AFRO [10]

Open in a new tab

Google search trends data

Time-series downloaded from Google search trends [32] describe the number of people searching for a specific keyword, in a specified geographic region, each day, week, or month (normalized to a 0—100 range). Google search trends data was extracted for each outbreak for the same time period as the epidemiological data, on the same temporal granularity as the epidemiological data, and limited to searches in the country of the outbreak. To avoid forward-looking bias, it is standard to select keywords by using Google correlate to find search terms that correlate well with the epidemiological time-series in a training period (which is then not included in the evaluation period) [13, 18, 30]. However, since Google correlate data is not available for any of the countries we analyze, we select a few simple keywords for each outbreak that are clearly related to the disease in question. In certain cases, there is not enough Google search information to yield meaningful results in the sample available through Google search trends: for example, we identified “fièvre hémorragique” and “fievre hemorragique” as relevant search terms for Ebola in the DRC, but were unable to include them due to a lack of available search signal. Similarly, we experimented with including “diarrhea” and the Arabic versions of “cholera” and “diarrhea” for the outbreak of cholera in Yemen, but did not find an improvement in signal over using only “cholera” in English. All search terms used to model each outbreak are listed in Table 4.

Table 4. Search terms by outbreak.

Outbreak	Search terms
Yellow Fever in Angola	‘yellow fever’, ‘febre amarela’
Zika in Colombia	‘zika’, ‘zika sintomas’, ‘el zika’, ‘sintomas del zika’, ‘virus zika’, ‘zika colombia’, ‘el zika sintomas’, ‘el sica’
Ebola in the DRC	‘ebola’
Plague in Madagascar	‘plague’, ‘pesta’, ‘peste’, ‘peste pulmonaire’, ‘peste madagascar’
Cholera in Yemen	‘cholera’

Open in a new tab

News alert data

News alert data was obtained from the GDELT Global Knowledge Graph in the form of fractions of daily raw article counts that are relevant to a query. GDELT is a large and regularly updated open database and platform that monitors the world’s news media in over 100 languages [31].

Models

We explored three simple data-driven nowcasting models, emphasizing model simplicity as there is often not enough data available in emerging outbreaks to train a more complex model.

Linear autoregression (AR)

An autoregressive model uses a linear combination of past observations of disease incidence (“autoregressive terms,” y_t−i) to provide an estimate for synchronous incidence y_t. Here, we choose for simplicity to use only the single most recently observed autoregressive term, so the linear autoregression is a univariate linear regression:

\begin{matrix} y_{t} = β y_{t - h} + α \end{matrix}

(1)

The linear regression is optimized over available training observations to minimize mean squared error loss. The time horizon of prediction h depends on the reporting delay in each outbreak; for instance, if there is a two-week reporting delay in a surveillance system, the autoregressive term will be the 2-week lag, so h = 2.

Regression on Google query volumes (GT)

Our second model is a multivariate regression mapping synchronous data on a set of Google query volumes for the set of selected search terms G = {g₁, …, g_k} to estimated synchronous incidence. Depending on the number of search terms selected for each outbreak, this regression contains 1-8 variables.

\begin{matrix} y_{t} = Σ_{g \in G} β_{g} g + α \end{matrix}

(2)

We adopt a L1 regularization to prevent overfitting and provide automatic feature selection, with the regularization parameter selected via 5-fold cross validation on the training set from {10⁻⁵, 10⁻⁴, 10⁻³, 10⁻², 10⁻¹}. The LASSO regression is optimized over available training observations to minimize mean squared error loss.

Autoregression and regression on Google query volumes (ARGO)

ARGO combines the AR and GT methods in a single multivariate regression including both a single autoregressive term (the most recently observed incidence value) and a set of synchronous Google query volumes.

\begin{matrix} y_{t} = β y_{t - h} + Σ_{g \in G} β_{g} g + α \end{matrix}

(3)

As in GT, ARGO is made more robust with L1 regularization, with the regularization parameter selected via 5-fold cross validation on the training set from {10⁻⁵, 10⁻⁴, 10⁻³, 10⁻², 10⁻¹}. The ARGO method used here is a somewhat simplified version of the linear regression on autoregressive data and synchronous Google query data originally developed to nowcast influenza in the United States [13]. When interpreting LASSO coefficients for ARGO, we average coefficients over ten model runs with ten different random seeds to ensure coefficient stability. In practice, we observe that coefficient sizes do not change from run to run.

Evaluation

We had access only to publicly released epidemiological situation reports, which are typically released somewhat sporadically, exhibiting long reporting delays and gaps where no information is available at all. We assumed that the final situation report released for each outbreak was the “ground truth,” recording the true timing and volume of new cases. To capture two possible data-access scenarios, (1) an ideal scenario in which final case numbers are reported 1-2 weeks after they occur, and (2) a more realistic scenario in which case numbers are reported with some delay and possibly corrected at a later date, we adopted two separate methods of evaluation. The first evaluation method assumes a continuous flow of correct epidemiological data and a set reporting delay of one to two weeks. The second method reflects the reality of many epidemiological reporting systems by using the data presented in publicly released epidemiological reports.

Evaluation assuming continuous flow of epidemiological data

The first form of evaluation uses only a single time-series of epidemiological data; the “ground truth” (taken as the last epidemiological report on the outbreak publicly released). We assumed a h-week reporting delay and experiment with h taking on values of 1 and 2. Thus this evaluation method represents a near-ideal data access scenario in which case counts, once reported, are never adjusted or corrected. We adopted dynamic training (also known as online learning or walk-forward validation) so that, when predicting each week’s incidence, each of the models is trained on all the data available up to that week. Models were then evaluated over the entire time-series based on Pearson’s Correlation Coefficient (CORR), root-mean-square error (RMSE), and relative root-mean-square error (rRMSE). Previous work has evaluated using shorter training windows or weighting recent data more heavily in training to focus models on recent disease dynamics [13]. In this case, we elected to use all available data in training due to the short length of each time-series and resulting data scarcity.

\begin{matrix} C O R R = \frac{Σ_{i = 1}^{n} (y_{i} - \bar{y}) (x_{i} - \bar{x})}{\sqrt{Σ_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} \sqrt{Σ_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}} \end{matrix}

(4)

\begin{matrix} R M S E = \sqrt{\frac{1}{n} Σ_{i = 1}^{n} {(y_{i} - x_{i})}^{2}} \end{matrix}

(5)

\begin{matrix} r R M S E = \frac{\sqrt{\frac{1}{n} Σ_{i = 1}^{n} {(y_{i} - x_{i})}^{2}}}{\bar{y}} \end{matrix}

(6)

Evaluation based on publicly released epidemiological situation reports

The ideal data-access scenario described above is not always the case in emerging outbreaks, which are characterized by reporting gaps and revisions of case counts after initial publication. The second method of evaluation recognizes this challenge, and compares the accuracy and timeliness of epidemiological reports that were publicized in each outbreak with the accuracy and timeliness of our three digital epidemiological models. We first empirically estimated the average reporting delay for each outbreak as the average number of days or weeks from initial reporting to a stable count of cases for a given day or week of the outbreak in the epidemiological reports. To account for small human errors in reporting and digitization of reports, we defined a “stable” case count as one that does not change by more than 1% from one week to the next. In practice, we observed a 2-week reporting delay for all five outbreaks presented. Note that while this empirical method requires several weeks of published epidemiological reports, a healthcare system’s reporting delay could likely be estimated a priori by its managers.

For each report released during each outbreak, we trained the three listed digital epidemiological models on the data that was stable in the report (according to the calculated reporting delay). We trained models for every time horizon between when stable data in the report ceased to be available and when the next epidemiological report was posted (as a way to evaluate what utility digital epidemiological models would have had at the time). We assume that minimal time would be required for data entry or processing, so our models would be available more or less instantaneously upon the release of an epidemiological report.

In addition to generating and presenting the predictions for each model for each report released in each outbreak, we evaluated the accuracy of each model, on average, in comparison to the accuracy of epidemiological data reported in real time. Specifically, we calculated the absolute and percentage error (in comparison to ground-truth case data) for case counts reported in epidemiological bulletins in the second-to-last and last week of each bulletin:

\begin{matrix} A b s o l u t e E r r o r = | y_{i} - x_{i} | \end{matrix}

(7)

\begin{matrix} P e r c e n t a g e E r r o r = \frac{| y_{i} - x_{i} |}{y_{i}} \end{matrix}

(8)

We use absolute and percentage error here—rather than correlation and RMSE as in the first evaluation method—since errors are evaluated at single points in time rather than over the whole of a time-series. We then averaged these second-to-last-week and last-week errors across situation reports for each outbreak (ranging from seven situation reports for the Zika outbreak in Colombia to 17 case reports for the Ebola outbreak in the DRC). In the same way, we evaluated the accuracy of case counts projected by our models for the second-to-last and last week of each bulletin, and the first week after epidemiological data ceased in each bulletin. Again, we averaged the accuracy of each model across case reports for each outbreak, so that mean predictive accuracy could be compared across models and reported surveillance data.

Supporting information

S1 Fig. Feature importance heatmaps for nowcasting yellow fever in Angola with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(336.1KB, tif)}

S2 Fig. Feature importance heatmaps for nowcasting Zika in Colombia with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(471.1KB, tif)}

S3 Fig. Feature importance heatmaps for nowcasting Ebola in the DRC with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(318.7KB, tif)}

S4 Fig. Feature importance heatmaps for nowcasting plague in Madagascar with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(639KB, tif)}

S5 Fig. Feature importance heatmaps for nowcasting cholera in Yemen with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(321.5KB, tif)}

S6 Fig. Comparing the accuracy and timeliness of publicly released epidemiological updates from the outbreak of yellow fever in Angola to the accuracy and timeliness of our digital epidemiological models.

(TIF)

Click here for additional data file.^{(555.9KB, tif)}

S7 Fig. Comparing the accuracy and timeliness of publicly released epidemiological updates from the outbreak of Zika in Colombia to the accuracy and timeliness of our digital epidemiological models.

(TIF)

Click here for additional data file.^{(762.2KB, tif)}

S8 Fig. Comparing the accuracy and timeliness of publicly released epidemiological updates from the outbreak of Ebola in the DRC to the accuracy and timeliness of our digital epidemiological models.

(TIF)

Click here for additional data file.^{(584.9KB, tif)}

S9 Fig. Comparing the accuracy and timeliness of publicly released epidemiological updates from the outbreak of plague in Madagascar to the accuracy and timeliness of our digital epidemiological models.

(TIF)

Click here for additional data file.^{(604.2KB, tif)}

S10 Fig. Comparing the accuracy and timeliness of publicly released epidemiological updates from the outbreak of cholera in Yemen to the accuracy and timeliness of our digital epidemiological models.

(TIF)

Click here for additional data file.^{(570.3KB, tif)}

Acknowledgments

This study does not necessarily represent the views of the NIH or the US government.

Data Availability

All models and evaluation metrics are implemented in Python 3.6 with scikit-learn 0.19.1. All scripts and data used in this study are publicly available at https://github.com/emilylaiken/outbreak-nowcasting.

Funding Statement

MS was funded in part by the Bill and Melinda Gates Foundation (OPP 1195154, https://www.gatesfoundation.org/). MS was funded in part by the National Institute of General Medical Sciences of the National Institutes of Health (Award Number R01GM130668, https://www.nigms.nih.gov/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Lipsitch M, Santillana M. Enhancing Situational Awareness to Prevent Infectious Disease Outbreaks from Becoming Catastrophic In: Inglesby T, Adalja A, editors. Global Catastrophic Biological Risks. Springer; 2019. [DOI] [PubMed] [Google Scholar]
2. Lipsitch M, Finelli L, Heffernan R, Leung G, Redd S. Improving the Evidence Base for Decision Making During a Pandemic: The Example of 2009 Influenza A/H1N. Biosecurity and Bioterrorism: Biodefense Strategy, Practice, and Science. 2011;9. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Probert W, Jewell C, Werkman M, Fonnesback C, Goto Y, Runge M, et al. Real-time decision making during emergency disease outbreaks. PLOS Computational Biology. 2018;14:e1006202 10.1371/journal.pcbi.1006202 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Brooks L, Farrow D, Hyun S, Tibshirani R, Rosenfeld R. Flexible Modeling of Epidemics with an Empirical Bayes Framework. PLOS Computational Biology. 2015;11:e1004382 10.1371/journal.pcbi.1004382 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.World Health Organization. Yellow Fever Situation Reports; 2016. https://www.who.int/emergencies/yellow-fever/situation-reports/archive/en/.
6.Pan-American Health Organization. Archive by Disease—Zika virus infection; 2017. https://www.paho.org/hq/index.php?option=com_content&view=article&id=10898:2015-archive-by-disease-zika-virus-infection&Itemid=41443&lang=en.
7.World Health Organization. Ebola situation reports: Democratic Republic of the Congo; 2018. https://www.who.int/ebola/situation-reports/drc-2018/en/.
8.World Health Organization Regional Office for Africa. Plague outbreak situation reports; 2017. https://www.afro.who.int/health-topics/plague/plague-outbreak-situation-reports.
9.Institut Pasteur de Madagascar. Synthese des résultats biologiques Peste; 2017. http://www.pasteur.mg/wp-content/uploads/2017/11/20171114_Bulletin_Peste_IPM_14112017_V5.pdf.
10.World Health Organization Regional Office for the Eastern Mediterranean. Cholera; 2019. http://www.emro.who.int/health-topics/cholera-outbreak/cholera-outbreaks.html.
11.Majumder M, Rose S. Vaccine Deployment and Ebola Transmission Dynamics Estimation in Eastern DR Congo. SSRN Pre-print. 2018;(3291591).
12. Ginsberg J, Mohebbi M, Patel R, Brammer L, Smolinksi M, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457:1012–1014. 10.1038/nature07634 [DOI] [PubMed] [Google Scholar]
13. Yang S, Santillana M, Kou S. Accurate estimation of influenza epidemics using google search data via ARGO. Proceedings of the National Academy of Sciences. 2015;112:14473–14478. 10.1073/pnas.1515373112 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Santillana M, Nguyen A, Dredze M, Paul M, Nsoesie E, Brownstein J. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLOS Computational Biology. 2015;11:e1004513 10.1371/journal.pcbi.1004513 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Lu F, Hou S, Baltrusaitis K, Shah M, Leskovec J, Sosic R, et al. Accurate influenza monitoring and forecasting in the Boston metropolis using novel Internet data streams. Journal of Medical Internet Research. 2018;4:e4.7. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Lu F, Hattab M, Clemente C, Biggerstaff M, Santillana M. Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches. Nature Communications. 2019;10(147). [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Clemente L, Lu F, Santillana M. Improved real-time influenza surveillance using Internet search data in eight Latin American countries. JMIR Public Health Surveillance. 2019;5(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Yang S, Kou S, Lu F, Brownstein J, Brooke N, Santillana M. Advances in the use of Google searches to track dengue in Mexico, Brazil, Thailand, Singapore and Taiwan. PLOS Computational Biology. 2017;13:e1005607 10.1371/journal.pcbi.1005607 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Paul M, Dredze M, Broniatowski D. Twitter Improves Influenza Forecasting. PLOS Currents Outbreaks. 2014. 10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Freifeld C, Mandl K, Reis B, Brownstein J. HealthMap: Global infectious disease monitoring through automated classification and visualization of Internet media reports. Journal of the American Medical Informatics Association. 2008;15(2):150–157. 10.1197/jamia.M2544 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Viboud C, Charu V, Olson D, Ballesteros S, Gog J, Khan F, et al. Demonstrating the use of high-volume electronic medical claims data to monitor local and regional influenza activity in the US. PLOS One. 2014;9(7):e102429 10.1371/journal.pone.0102429 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Santillana M, Nguyen A, Louie T, Zink A, Gray J, Sung I, et al. Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance. Scientific Reports. 2016;6 10.1038/srep25732 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Smolinksi M, Crawley A, Baltrusaitis K, Chunara R, Olsen J, Wojcik O, et al. Flu Near You: Crowdsourced Symptom Reporting Spanning 2 Influenza Seasons. American Journal of Public Health. 2015;105(10):2124–2130. 10.2105/AJPH.2015.302696 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Paolotti D, Carnahan A, Colizza V, Eames K, Edmunds J, Gomes G, et al. Web-based participatory surveillance of infectious diseases: the Influenzanet participatory surveillance experience. Clinical Microbiology and Infection. 2014;20(1):17–21. 10.1111/1469-0691.12477 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Lazer D, Kennedy R, King A G ad Vespignanni. The Parable of Google Flu: Traps in Big Data Analysis. Science. 2014;343(6176):1203–1205. 10.1126/science.1248506 [DOI] [PubMed] [Google Scholar]
26. Chan E, Sahai V, Conrad C, Brownstein JS. Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance. PLOS Neglected Tropical Diseases. 2011;5 10.1371/journal.pntd.0001206 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Yang S, Santillana M, Brownstein J, Gray J, Richardson S, Kou S. Using electronic health records and Internet search information for accurate influenza forecasting. BMC infectious diseases. 2017;17(1). 10.1186/s12879-017-2424-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Majumder M, Santillana M, Mekaru S, McGinnis S, Khan K, Brownstein J. Utilizing Nontraditional Data Sources for Near Real-Time Estimation of Transmission Dynamics During the 2015-2016 Colombian Zika Virus Disease Outbreak. JMIR Public Health Surveillance. 2016;2 10.2196/publichealth.5814 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Chunara R, Andrews J, Brownstein J. Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. American Journal of Tropical Medicine and Hygiene. 2012;86:39–45. 10.4269/ajtmh.2012.11-0597 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. McGough S, Brownstein J, Hawkins J, Santillana M. Forecasting Zika Incidence in the 2016 Latin America Outbreak Combining Traditional Disease Surveillance with Search, Social Media, and News Report Data. PLOS Neglected Tropical Diseases. 2017;11:e0005295 10.1371/journal.pntd.0005295 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.GDELT;. https://www.gdeltproject.org/.
32.Google Trends;. https://trends.google.com/.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008117.r001

Decision Letter 0

Virginia E Pitzer, Juliet RC Pulliam

1 Mar 2020

Dear Ms. Aiken,

Thank you very much for submitting your manuscript "Real-time Estimation of Disease Activity in Emerging Outbreaks using Internet Search Information" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Please consider all feedback from the reviewers to improve the paper, with particular attention to the following points:

Reviewer 2 had two major comments that should be addressed - the first with regard to quantification of performance based on publicly released reports, and the second a more technical critique regarding the methods for estimating coefficients with limited data.
Reviewers 1 and 3 have both requested that more guidance be given about when the methods are likely to be reliable. Fleshing out the minimum use case, as suggested by Reviewer 3, will be helpful here, but also consider including some discussion of what would be considered a 'strong' use case.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Juliet R.C. Pulliam, PhD

Guest Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this work, the authors predict the short-term trajectory ("nowcasting") of infectious disease outbreaks using regression approaches that consider case reports from earlier in the outbreak, as well as country-specific Internet search activity as provided by Google Trends. Because epidemiological reports are often delayed and sometimes substantially later revised, it makes a lot of sense that we should be leveraging other kinds of information to estimate what is actually happening on the ground. This paper builds on past work that considered more predictable epidemics like seasonal influenza, now applying these methods to harder settings, where analogous past outbreaks may not exist and local political or social unrest may make transmission dynamics both complex and harder to ascertain. As test cases, they chose outbreaks that represent a good challenge to such an effort--they are diverse geographically, in mode of transmission, pathogenicity, etc. Broadly speaking, the paper is well-written and the methods seem sound. I also took a look at the github repository and the IPython notebooks the authors provided, and I appreciate that clarity and transparency therein.

I hope the authors will find my suggestions and questions helpful in further improving the manuscript.

Detailed comments:

Throughout: Zika and Ebola are conventionally capitalized as they are named after places, but yellow fever, plague and cholera are not.

Line 75-76: At this point in the manuscript, it is not clear what is meant by "historic." I initially assumed that some kind of past outbreak data was used to inform a statistical model, but it seems the authors mean strictly data from earlier in the same outbreak.

Sentence spanning 77-80: A comma is missing between "countries" and "weeks", making it hard to read this sentence correctly.

Line 91: What does "accurate" mean? I doubt we are talking about exact numerical values, and anything else is subjective/requires context.

Lines 100-105 and Discussion: The authors find that their methods vary in effectiveness. I find this understandable, but practically speaking it is problematic. How should a policy maker determine whether a nowcast is reasonable to use? Do the authors have any suggestions about how to predict the reliability of a nowcast, or how to characterize the settings where the methods are sane to apply? For context, I raise this point because the authors are clearly concerned with solving a real problem, and thus consider competing scenarios with high quality, regular surveillance data versus sporadic reports subject to revision (an experimental design choice I commend).

Fig 1, possibly some other figures: I suggest not showing 1.25 on the y-axes, as that's an undefined value for data normalized on [0,1]. Also, these legends are highly redundant--they are all normalized to [0,1], so that should just be in the caption rather than twice in every panel; also, as each panel is labeled with the location, there's no need to state it twice in every legend.

Thought motivated by Fig 1: I don't believe the authors mention this, but it seems likely that populations have a kind of "search fatigue" or saturation effect. Early in an outbreak, there may be a heightened need for information about a disease, but unless looking for news updates, it seems that people would not keep searching for the same information. Relatedly, I would imagine there are meaningful trends in the types of search terms used, where people early on look for general information about a disease, how it's transmitted, etc., and later tend to look for information about symptoms and treatment. Is there any evidence for this kind of structure in search term data?

Methods and Discussion: It should be stated that the final surveillance data here is taken as "truth" (and I don't think there's a serious alternative), although real-world reporting practices may vary in time.

Fig 2: The late August spike in the Ebola/DRC nowcasting is somehow driven by autocorrelations in the AR and ARGO models. I understand why the nowcast peaks lag behind the ground truth data, and why the problem is worse for the 2-week lag than the 1-week lag, but what is it about this August peak that is so problematic for the autocorrelation model? I've looked at the methods, and it is not clear to me why this one part of the nowcasts would be so bad. Do the regressions normalize by current size of the outbreak? An increase from 1 to 3 cases from one week to the next is less meaningful than an increase from 100 to 130, but perhaps the AR for Ebola/DRC is "learning" too much from the 1.5 months of basically nothing at the beginning of that time series. The lag between the ground truth data and the nowcasts seems to be roughly double the reporting delay, but mysteriously only for Ebola/DRC--plague/Madagascar is on a similar scale ([0,~50]), but the predictions are much better. I would appreciate if the authors would dig into this a bit more.

Fig 3, and other figures: There is no legend for the heatmaps, nor is the meaning of the colors stated anywhere in text.

Line 171: I have no idea what "within-range" means. "Meaningful" is subjective, but at least it sounds subjective. Unstated, I think, is that the authors are applying some human intuition for what kind of accuracy would be required for appropriate public health responses. They probably should say that this is what they're doing (or explain what alternative criteria are being applied). Also, I think the word "that" is missing between "estimates" and "filled".

Throughout: The space between "Fig." and the number is too big. If this is Latex, the spaces aren't being escaped properly, e.g. should be "Fig.\\ 3". Same for e.g. "Jan.\\ 3".

Lines 215-217: I do not have any intuition for why GT data would be more useful for diseases with short serial intervals. Do the authors have any idea why this might be the case?

Line 221: result, not results

Throughout: Inconsistent capitalization of Internet.

Line 228: Some explanation of how search terms are chosen should be mentioned here, or earlier in the results. This is a very important issue, and shouldn't be buried in the methods.

Lines 242-250: Can media coverage be used as a predictor (aka feature in ML jargon)? I can understand why the authors might not want to use media attention as a predictor of what's happening on the ground, but it seems like it could be very useful as a correction term for search activity. In other words, an increase in search activity is more meaningful if it does not coincide with an increase in media coverage. A locale-specific relationship between media coverage and search patterns could be determined during non-outbreak periods, or for an unrelated health problem.

Lines 298-301: Is Google universally the preferred search engine? Baidu in China? Is Google what Arabic speakers would principally be using?

AR model in Methods: If I understand correctly, at each time point, a linear regression is constructed using all past data to estimate the coefficients, but only the most recent observation (y_t-h) is used as an input. This means that having a long history of small, noisy values would result in a model that has a beta of roughly 0. If more recent data is not weighted as more informative than observations farther in the past, a dramatic increase might still be treated as uninformative, even after a couple such observations. If I am not understanding this, please clarify. In any case, what is going on with the AR model seems relevant to the Ebola/DRC prediction problems and warrants more discussion.

Fig S2: Why are there more rows in the heatmaps than there are row labels?

References: These need to be cleaned up. There are random spaces in URLs, and inconsistent/incorrect capitalization.

A final thought: As perhaps the most famous (and now aborted) example of nowcasting, Google created Flu Trends and Dengue Trends. It might be appropriate to mention that effort in the introduction.

Reviewer #2: In this work, the authors propose that time series of Google search interest can be used to improve situational awareness during outbreaks of emerging diseases. The problem of delays in traditional surveillance channels is well-known and sorely in need of a solution. Although the use of internet search data is well established in fields such as influenza surveillance in the United States, the proposal here is to use such search data in settings where the disease is unusual and internet access is less common. This proposal is relatively unexplored and potentially important. Further, it seems to me that this manuscript contains interesting supporting evidence of this idea. However, the current manuscript has need of revisions to improve the clarity, rigor, and attention to detail of its methods.

Perhaps my greatest concern is the lack of quantification of the performance of the authors' forecasters in the subsection "Evaluation Based on Publicly Released Reports". This is clearly the scenario of greater interest, and although one can see from the figures that the forecasts are reasonable, it is difficult from figure 4 to see that "ARGO appears to most closely estimate the cases that would eventually be reported throughout each outbreak," as the authors write on line 173. A table comparable to Table 1 for this scenario seems like it could provide much clearer support for such a statement. Furthermore, the inclusion of a quantitative metric would allow for later work to easily be compared with this work. On line 381, the authors seem to indicate that such quantification was not possible due to a lack of ground truth data, but the figures invite visual comparison of the forecasts with a ground truth time series, so I find that statement of the authors confusing.

My next greatest concern is the interpretation of the size of the L1-penalized regression coefficients as variable importance in Figure 3 and similar. In my experience, with small data sets such as those analyzed by the authors, the value of these coefficients can be highly sensitive to the random choice of the folds used for cross-validation. Also, the choice of the folds for a time series application where there is a clear temporal correlation structure deserves some discussion. From the code, it seems that the temporal structure of the data is ignored in the choice of folds. How do authors justify this and what effect do they anticipate this has on the selected regularization parameter? Finally, since many of the linear model predictors are likely highly correlated, I think an elastic net penalty would be more appropriate than a lasso penalty if the authors would like to draw some conclusions about the relative importance of the different variables. Zou and Hastie (2005) have shown that the lasso penalty can result in one variable in a correlated group randomly getting a large coefficient, whereas the elastic net penalty can lead to a more equal size of coefficients among members of the group.

I will now list some smaller problems that I noticed when reading the manuscript.

1. Figure 4: The dates on the top of the panels do not always align with the vertical reference lines. For example, consider the panel in the upper left corner.

2. Figure 4 caption: Figs. S9-12 should be Figs. S6-10?

3. Line 215: "we find that GT data appears to posses greater predictive power in diseases with short serial intervals like influenza, and less predictive power in diseases like Cholera, where transmission time-scales are typically longer." The supporting evidence for this statement is unclear.

4. Line 314: "The time horizon of prediction h depends on the reporting delay in each outbreak" It is unclear to me why the authors link the prediction horizon and model lag in this manner. I would consider it simpler to use a lag-1 term in all models and projecting forward multiple steps when necessary for prediction.

5. Equations (2) and (3) make use of variables that the authors never define and do not seem to accurately represent a linear model.

Reviewer #3: I applaud the authors for tacking the difficult problem what do to in the context of inaccurate surveillance reports, particularly for diseases new to specific geographic areas (or new altogether). Current events are only the most recent reminder of how unfortunate delays can be. The paper is relevant, well-written, and the examples provided are comprehensive. Below are some suggestions for improvement.

Lines 171-172: Is there a word missing here? Please check.

The severe delays in getting even preliminary reports out is on painful display in the examples provided (Figure 4). Each epi curve in that figure has a ‘report released’ line and a ‘next report released’ line, but the ‘next report released’ timing does not align with the following epicurve’s ‘report released’ line. I see the supplemental files with all of the reports, but how did the authors choose which to present in the main paper?

Were the authors able to make some kind of quantitative measurement of the about of ‘back corrections’ that appeared in consecutive reports and were the magnitude of ‘misreports’ associated with the model performance?

Lines 196 – 200: The authors propose a minimal case use for these models – that they could at least signal whether the outbreaks are increasing or decreasing, or ‘over’ or ‘not’. Can the authors quantify in the results the number of times each model got this right when the report got it wrong? Some kind of quantification to justify this claim would be helpful. It’s clear that it happened in one case, but some additional evidence that this is a common occurrence would be useful. Also, it’s not clear in the methods, but is the assumption that the forecast would be available on the day that the reports came out? Is there any data entry or analysis time included in the assumptions of when forecasts would be available or is it reasonable to assume these are instantaneous?

The authors should consider making a plug for investments in better traditional epi surveillance, particularly for emerging threats. Models are very helpful, but only once we identify outbreaks and have some data coming in. The findings showing that even incomplete reports could yield useful model results is a strong rationale for releasing data as early as possible. Early data release remains a barrier to protecting global health.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Thomas J. Hladish

Reviewer #2: Yes: Eamon B. O'Dea

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. 2020 Aug 17;16(8):e1008117. doi: 10.1371/journal.pcbi.1008117.r002

Author response to Decision Letter 0

24 Apr 2020

Attachment

Submitted filename: Outbreaks Paper Review Responses.pdf

Click here for additional data file.^{(309.7KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008117.r003

Decision Letter 1

Virginia E Pitzer, Juliet RC Pulliam

27 May 2020

Dear Ms. Aiken,

Thank you very much for submitting your revised manuscript "Real-time Estimation of Disease Activity in Emerging Outbreaks using Internet Search Information" for consideration at PLOS Computational Biology, and for your careful response to the previous round of reviews. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the review of the revision, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

In particular, please make sure that the GitHub repository is updated to include the data underlying the revised figures and summary statistics.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Juliet R.C. Pulliam, PhD

Guest Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: The reviewers have scrupulously and satisfactorily addressed all my prior concerns. On reviewing the revisions I have just one final concern, which is the level of statistical significance of reported differences in forecasting accuracy. First, I may have missed it, but I did not see the meaning of the parenthesized values in Table 2 of the revised manuscript. I suspect these are standard errors. In the case that they are, it would appear that most differences in forecast accuracy do not have a very high level of statistical significance. Supposing that to be the case, I don't think that is necessarily a reason to remove the claims. My suggestion is simply that the authors should qualify any claims about the method of producing the most accurate forecasts if they cannot provide evidence that it is significantly better than other methods. Alternatively, the authors could make the results stronger by reporting the level of significance of a difference in accuracy if it is non-negligible.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #2: No: The GitHub repository the authors refer readers to in "Code Availability" has not been updated since September 2019, therefore it seems unlikely that the underlying data for revised figures and summary statistics are available.

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Eamon B. O'Dea

Figure Files:

Data Requirements:

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. 2020 Aug 17;16(8):e1008117. doi: 10.1371/journal.pcbi.1008117.r004

Author response to Decision Letter 1

21 Jun 2020

Attachment

Submitted filename: Outbreaks Paper Review Responses.pdf

Click here for additional data file.^{(309.7KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008117.r005

Decision Letter 2

Virginia E Pitzer, Juliet RC Pulliam

1 Jul 2020

Dear Ms. Aiken,

We are pleased to inform you that your manuscript 'Real-time Estimation of Disease Activity in Emerging Outbreaks using Internet Search Information' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Juliet R.C. Pulliam, PhD

Guest Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008117.r006

Acceptance letter

Virginia E Pitzer, Juliet RC Pulliam

10 Aug 2020

PCOMPBIOL-D-19-02045R2

Real-time Estimation of Disease Activity in Emerging Outbreaks using Internet Search Information

Dear Dr Aiken,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Sarah Hammond

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Feature importance heatmaps for nowcasting yellow fever in Angola with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(336.1KB, tif)}

S2 Fig. Feature importance heatmaps for nowcasting Zika in Colombia with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(471.1KB, tif)}

S3 Fig. Feature importance heatmaps for nowcasting Ebola in the DRC with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(318.7KB, tif)}

S4 Fig. Feature importance heatmaps for nowcasting plague in Madagascar with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(639KB, tif)}

S5 Fig. Feature importance heatmaps for nowcasting cholera in Yemen with ARGO.

In the heatmaps, darkest reds correspond to largest positive coefficients and darkest blues correspond to largest negative coefficients; grey indicates a coefficient of zero.

(TIF)

Click here for additional data file.^{(321.5KB, tif)}

(TIF)

Click here for additional data file.^{(555.9KB, tif)}

S7 Fig. Comparing the accuracy and timeliness of publicly released epidemiological updates from the outbreak of Zika in Colombia to the accuracy and timeliness of our digital epidemiological models.

(TIF)

Click here for additional data file.^{(762.2KB, tif)}

S8 Fig. Comparing the accuracy and timeliness of publicly released epidemiological updates from the outbreak of Ebola in the DRC to the accuracy and timeliness of our digital epidemiological models.

(TIF)

Click here for additional data file.^{(584.9KB, tif)}

(TIF)

Click here for additional data file.^{(604.2KB, tif)}

S10 Fig. Comparing the accuracy and timeliness of publicly released epidemiological updates from the outbreak of cholera in Yemen to the accuracy and timeliness of our digital epidemiological models.

(TIF)

Click here for additional data file.^{(570.3KB, tif)}

Attachment

Submitted filename: Outbreaks Paper Review Responses.pdf

Click here for additional data file.^{(309.7KB, pdf)}

Attachment

Submitted filename: Outbreaks Paper Review Responses.pdf

Click here for additional data file.^{(309.7KB, pdf)}

Data Availability Statement

[pcbi.1008117.ref001] 1. Lipsitch M, Santillana M. Enhancing Situational Awareness to Prevent Infectious Disease Outbreaks from Becoming Catastrophic In: Inglesby T, Adalja A, editors. Global Catastrophic Biological Risks. Springer; 2019. [DOI] [PubMed] [Google Scholar]

[pcbi.1008117.ref002] 2. Lipsitch M, Finelli L, Heffernan R, Leung G, Redd S. Improving the Evidence Base for Decision Making During a Pandemic: The Example of 2009 Influenza A/H1N. Biosecurity and Bioterrorism: Biodefense Strategy, Practice, and Science. 2011;9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref003] 3. Probert W, Jewell C, Werkman M, Fonnesback C, Goto Y, Runge M, et al. Real-time decision making during emergency disease outbreaks. PLOS Computational Biology. 2018;14:e1006202 10.1371/journal.pcbi.1006202 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref004] 4. Brooks L, Farrow D, Hyun S, Tibshirani R, Rosenfeld R. Flexible Modeling of Epidemics with an Empirical Bayes Framework. PLOS Computational Biology. 2015;11:e1004382 10.1371/journal.pcbi.1004382 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref005] 5.World Health Organization. Yellow Fever Situation Reports; 2016. https://www.who.int/emergencies/yellow-fever/situation-reports/archive/en/.

[pcbi.1008117.ref006] 6.Pan-American Health Organization. Archive by Disease—Zika virus infection; 2017. https://www.paho.org/hq/index.php?option=com_content&view=article&id=10898:2015-archive-by-disease-zika-virus-infection&Itemid=41443&lang=en.

[pcbi.1008117.ref007] 7.World Health Organization. Ebola situation reports: Democratic Republic of the Congo; 2018. https://www.who.int/ebola/situation-reports/drc-2018/en/.

[pcbi.1008117.ref008] 8.World Health Organization Regional Office for Africa. Plague outbreak situation reports; 2017. https://www.afro.who.int/health-topics/plague/plague-outbreak-situation-reports.

[pcbi.1008117.ref009] 9.Institut Pasteur de Madagascar. Synthese des résultats biologiques Peste; 2017. http://www.pasteur.mg/wp-content/uploads/2017/11/20171114_Bulletin_Peste_IPM_14112017_V5.pdf.

[pcbi.1008117.ref010] 10.World Health Organization Regional Office for the Eastern Mediterranean. Cholera; 2019. http://www.emro.who.int/health-topics/cholera-outbreak/cholera-outbreaks.html.

[pcbi.1008117.ref011] 11.Majumder M, Rose S. Vaccine Deployment and Ebola Transmission Dynamics Estimation in Eastern DR Congo. SSRN Pre-print. 2018;(3291591).

[pcbi.1008117.ref012] 12. Ginsberg J, Mohebbi M, Patel R, Brammer L, Smolinksi M, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457:1012–1014. 10.1038/nature07634 [DOI] [PubMed] [Google Scholar]

[pcbi.1008117.ref013] 13. Yang S, Santillana M, Kou S. Accurate estimation of influenza epidemics using google search data via ARGO. Proceedings of the National Academy of Sciences. 2015;112:14473–14478. 10.1073/pnas.1515373112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref014] 14. Santillana M, Nguyen A, Dredze M, Paul M, Nsoesie E, Brownstein J. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLOS Computational Biology. 2015;11:e1004513 10.1371/journal.pcbi.1004513 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref015] 15. Lu F, Hou S, Baltrusaitis K, Shah M, Leskovec J, Sosic R, et al. Accurate influenza monitoring and forecasting in the Boston metropolis using novel Internet data streams. Journal of Medical Internet Research. 2018;4:e4.7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref016] 16. Lu F, Hattab M, Clemente C, Biggerstaff M, Santillana M. Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches. Nature Communications. 2019;10(147). [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref017] 17. Clemente L, Lu F, Santillana M. Improved real-time influenza surveillance using Internet search data in eight Latin American countries. JMIR Public Health Surveillance. 2019;5(2). [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref018] 18. Yang S, Kou S, Lu F, Brownstein J, Brooke N, Santillana M. Advances in the use of Google searches to track dengue in Mexico, Brazil, Thailand, Singapore and Taiwan. PLOS Computational Biology. 2017;13:e1005607 10.1371/journal.pcbi.1005607 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref019] 19. Paul M, Dredze M, Broniatowski D. Twitter Improves Influenza Forecasting. PLOS Currents Outbreaks. 2014. 10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref020] 20. Freifeld C, Mandl K, Reis B, Brownstein J. HealthMap: Global infectious disease monitoring through automated classification and visualization of Internet media reports. Journal of the American Medical Informatics Association. 2008;15(2):150–157. 10.1197/jamia.M2544 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref021] 21. Viboud C, Charu V, Olson D, Ballesteros S, Gog J, Khan F, et al. Demonstrating the use of high-volume electronic medical claims data to monitor local and regional influenza activity in the US. PLOS One. 2014;9(7):e102429 10.1371/journal.pone.0102429 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref022] 22. Santillana M, Nguyen A, Louie T, Zink A, Gray J, Sung I, et al. Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance. Scientific Reports. 2016;6 10.1038/srep25732 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref023] 23. Smolinksi M, Crawley A, Baltrusaitis K, Chunara R, Olsen J, Wojcik O, et al. Flu Near You: Crowdsourced Symptom Reporting Spanning 2 Influenza Seasons. American Journal of Public Health. 2015;105(10):2124–2130. 10.2105/AJPH.2015.302696 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref024] 24. Paolotti D, Carnahan A, Colizza V, Eames K, Edmunds J, Gomes G, et al. Web-based participatory surveillance of infectious diseases: the Influenzanet participatory surveillance experience. Clinical Microbiology and Infection. 2014;20(1):17–21. 10.1111/1469-0691.12477 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref025] 25. Lazer D, Kennedy R, King A G ad Vespignanni. The Parable of Google Flu: Traps in Big Data Analysis. Science. 2014;343(6176):1203–1205. 10.1126/science.1248506 [DOI] [PubMed] [Google Scholar]

[pcbi.1008117.ref026] 26. Chan E, Sahai V, Conrad C, Brownstein JS. Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance. PLOS Neglected Tropical Diseases. 2011;5 10.1371/journal.pntd.0001206 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref027] 27. Yang S, Santillana M, Brownstein J, Gray J, Richardson S, Kou S. Using electronic health records and Internet search information for accurate influenza forecasting. BMC infectious diseases. 2017;17(1). 10.1186/s12879-017-2424-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref028] 28. Majumder M, Santillana M, Mekaru S, McGinnis S, Khan K, Brownstein J. Utilizing Nontraditional Data Sources for Near Real-Time Estimation of Transmission Dynamics During the 2015-2016 Colombian Zika Virus Disease Outbreak. JMIR Public Health Surveillance. 2016;2 10.2196/publichealth.5814 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref029] 29. Chunara R, Andrews J, Brownstein J. Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. American Journal of Tropical Medicine and Hygiene. 2012;86:39–45. 10.4269/ajtmh.2012.11-0597 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref030] 30. McGough S, Brownstein J, Hawkins J, Santillana M. Forecasting Zika Incidence in the 2016 Latin America Outbreak Combining Traditional Disease Surveillance with Search, Social Media, and News Report Data. PLOS Neglected Tropical Diseases. 2017;11:e0005295 10.1371/journal.pntd.0005295 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008117.ref031] 31.GDELT;. https://www.gdeltproject.org/.

[pcbi.1008117.ref032] 32.Google Trends;. https://trends.google.com/.

PERMALINK

Real-time estimation of disease activity in emerging outbreaks using internet search information

Emily L Aiken

Sarah F McGough

Maimuna S Majumder

Gal Wachtel

Andre T Nguyen

Cecile Viboud

Mauricio Santillana

Roles

Abstract

Author summary

Introduction

Our contribution

Results

Motivation for digital epidemiological methods

Fig 1. Motivation for digital epidemiological modeling of five emerging outbreaks.

Evaluation assuming continuous flow of available epidemiological data

Fig 2. Series of plots comparing the nowcasts produced by three digital epidemiological models (available in real-time) to “ground truth” epidemiological data (available at a delay).

Table 1. Evaluations of three computational models (AR, GT, and ARGO) across five outbreaks, based on correlation (r, Panel A), root-mean-square error (RMSE, Panel B), and relative root-mean-square error (rRMSE, Panel C).

Fig 3. Evaluating feature importances (coefficients in linear regression) in ARGO for nowcasting plague in Madagascar assuming a reporting delay of one week.

Evaluation based on publicly released reports

Table 2. Comparison of cases reported in epidemiological bulletins to projections produced by our nowcasting models in real-time.

Fig 4. Summary of evaluation approach based on publicly released reports.

Discussion

Fig 5. Comparison of signals in ground-truth epidemiological data, Google search query volumes, and news alerts data from the GDELT Global Knowledge Graph.

Materials and methods

Data sources

Epidemiological data

Table 3. Epidemiological data sources.

Google search trends data

Table 4. Search terms by outbreak.

News alert data

Models

Linear autoregression (AR)

Regression on Google query volumes (GT)

Autoregression and regression on Google query volumes (ARGO)

Evaluation

Evaluation assuming continuous flow of epidemiological data

Evaluation based on publicly released epidemiological situation reports

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Virginia E Pitzer

Juliet RC Pulliam

Roles

Author response to Decision Letter 0

Decision Letter 1

Virginia E Pitzer

Juliet RC Pulliam

Roles

Author response to Decision Letter 1

Decision Letter 2

Virginia E Pitzer

Juliet RC Pulliam

Roles

Acceptance letter

Virginia E Pitzer

Juliet RC Pulliam

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases