Automated monitoring of tweets for early detection of the 2014 Ebola epidemic

Aditya Joshi; Ross Sparks; Sarvnaz Karimi; Sheng-Lun Jason Yan; Abrar Ahmad Chughtai; Cecile Paris; C Raina MacIntyre

doi:10.1371/journal.pone.0230322

. 2020 Mar 17;15(3):e0230322. doi: 10.1371/journal.pone.0230322

Automated monitoring of tweets for early detection of the 2014 Ebola epidemic

Aditya Joshi ^1,^*, Ross Sparks ¹, Sarvnaz Karimi ¹, Sheng-Lun Jason Yan ², Abrar Ahmad Chughtai ², Cecile Paris ¹, C Raina MacIntyre ^3,⁴

Editor: Eric Forgoston⁵

PMCID: PMC7077840 PMID: 32182277

Abstract

First reported in March 2014, an Ebola epidemic impacted West Africa, most notably Liberia, Guinea and Sierra Leone. We demonstrate the value of social media for automated surveillance of infectious diseases such as the West Africa Ebola epidemic. We experiment with two variations of an existing surveillance architecture: the first aggregates tweets related to different symptoms together, while the second considers tweets about each symptom separately and then aggregates the set of alerts generated by the architecture. Using a dataset of tweets posted from the affected region from 2011 to 2014, we obtain alerts in December 2013, which is three months prior to the official announcement of the epidemic. Among the two variations, the second, which produces a restricted but useful set of alerts, can potentially be applied to other infectious disease surveillance and alert systems.

Introduction

Infectious disease surveillance systems differ in terms of their objective and scope[1]. These systems have traditionally utilised information sources such as health encounters, medical records, hospital statistics, disease registries, over-the-counter drug sales data, laboratory results and surveys. However, for the purpose of early epidemic detection, traditional surveillance data are less timely and sensitive due to factors such as the long process of data validation, influence of bureaucracy, politics, higher costs and resource requirements[2, 3]. The WHO website states that early indicators for more than 60% of epidemics can be found through informal sources such as social media. Therefore, traditional surveillance can be supplemented through the use of publicly available data from internet-based or electronic platforms such as search engines, social media, blogs or forums[4, 5]. When combined with signals and information from traditional sources and agencies, social media-based surveillance of infectious diseases can assist early detection of public health emergencies. While all infections or symptoms may not necessarily get reported on social media due to a variety of reasons, we claim that a change in the prevalence of symptom reports on social media can be an indicator of an outbreak, and can supplement traditional infectious disease surveillance.

In this paper, we investigate if early indicators of the Ebola epidemic in West Africa in 2014 can be obtained from social media posts. This epidemic is regarded as one of the deadliest epidemics in recent times that resulted in a high loss of life and severe stress to medical services in the countries impacted. We evaluate the possibility of surveillance using a dataset of tweets—social media posts published on Twitter—posted between 2011 and 2014 in key cities of the affected region in West Africa. A typical pipeline for a surveillance system to monitor infectious diseases using social media consists of two steps: (a) collection of social media posts reporting symptoms related to the infectious disease; and, (b) application of a monitoring algorithm to generate alerts for an epidemic[6]. For step (a), we employ automatic techniques based on computational linguistics and for step (b), statistical time series monitoring. A lead time of more than two months can be obtained for the Ebola epidemic when these automatic techniques are applied to tweets.

Most past work in social media-based infectious disease surveillance uses datasets created using a set of symptom words. The goal is to signal a disease outbreak before official health systems detect it. This is not to say that social media-based epidemic intelligence would replace human expertise. However, social media brings value in terms of being real-time and originated by human users. However, past work monitors several symptoms of an illness as a collection, and constructs a stream of tweets that is then monitored. For example, for detection of influenza, tweets containing reports of cough, cold, and fever may be considered. However, different symptoms have different prior probabilities of being reported (for example, a fever may be more common than rash or a fever may be reported with less stigma as compared to rash) and different seasonal appearances (for example, cold may be more common in winters while dehydration may be more common in summers). Therefore, tweets reporting each of these symptoms may have different time series behaviours in terms of magnitudes and trends. In this regard, our paper differs from past work in social media-based infectious disease surveillance. Specifically, we address the question:

“Since different symptoms of a disease may have different attributes, in what ways can tweets that report each of these symptoms be combined to detect the disease?”

An investigation into this question is likely to impact social media-based syndromic surveillance for other epidemics as well. Towards this, we adapt an architecture that we had previously reported for early detection of disease events using social media[7]. The architecture takes as input a set of tweets and returns a set of alerts, where an alert indicates the possibility of an unexpected behaviour, thereby indicating an epidemic. The original architecture handles only one symptom at a time. To address the question above, we experiment with two variations of the architecture. The first variation uses a combined stream of tweets reporting different symptoms. The second variation obtains alerts from the architecture for each symptom separately and then combines these alerts. We refer to the prior as ‘Data Aggregation’ because it provides the data for different symptoms together as an input to the architecture, and to the latter as ‘Alert Aggregation’ because it collects the alerts generated by the architecture for each symptom. To the best of our knowledge, this is the first study that examines whether the monitoring algorithm should handle symptoms separately or together, for social media-based infectious disease surveillance. In addition, it is the first automatic monitoring for the Ebola epidemic using symptomatic data.

The Ebola epidemic of 2014

The 2014–16 epidemic of Ebola Virus Disease (EVD) in West Africa highlighted the importance of Internet-based surveillance methods. The suspected index case for the epidemic was a 2-year-old boy in Guinea who died on December 6, 2013. The Guinean Ministry of Health first noted an unidentified disease on March 13, 2014. HealthMap retrieved the first public notification on March 14, 2014 from a French news website with headlines reporting a strange fever. Following laboratory confirmation, the WHO released a public statement confirming Ebola on March 23, 2014. Table 1 summarises the timeline of alerts from different sources for the Ebola epidemic of 2014.

Table 1. Comparing web-based sources on the West African Ebola epidemic detection and dissemination [8, 9].

Source	Date of first detection	Type of dissemination
Meliandou Health Post	January 24, 2014	Internal alert
Guinea Ministry of Health	March 13, 2014	Internal alert
HealthMap	March 14, 2014	Textual and graphical alert
Bing	March 20, 2014	No specific information given
ProMED-mail	March 22, 2014	Textual alert, RFI
WHO	March 23, 2014	Textual online statement

Open in a new tab

The first ProMED report relevant to the Ebola outbreak was a request for information (RFI) on an undiagnosed viral haemorrhagic disease. Hossein et al. [9] found that searching the keyword ‘Ebola’ in ProMED reports show awareness of spread of Ebola in early April 2014. Similar keyword searches in Google Trends revealed few initial results in March and April, 2014. Alicino et al. [8] compared Ebola-related relative search volumes (RSVs) by regions reported by Google Trends. The highest RSVs were from the three main affected countries. Liberia had the highest score, followed by Sierra Leone and Guinea. However, the most searches for the keyword ‘Ebola’ occurred on October 16, 2014 when President Obama issued a press release calling National Guard reserves to contain Ebola[8]. In contrast, the first tweet from affected countries mentioning ‘Ebola’ came as early as December 26, 2013. However, the content of the tweet–“This Ebola of a virus come bad pass HIV …May God help us,” appeared ambiguous as the user typically posted about football. The next tweet containing Ebola was on March 25th, 2014 –“Guinea has banned the sale and consumption of bats to prevent the spread of the deadly Ebola virus,” showing stronger relevance to the outbreak[10]. Different programs or algorithms have been used to extract data from online sources, ranging from text parsing and using the Twitter API to using Google Trends[9, 11, 12]. Some studies did not specify their methods or details relating to data extraction and mining and extent of human moderation[8, 10, 13]. No literature to our knowledge has yet analysed the utility of symptom data from tweets for the detection of EVD[14]. Thus, the aim of our study is to evaluate the value of symptomatic tweets for rapid infectious disease surveillance of the 2014 Ebola epidemic in West Africa, using automated monitoring.

Architecture

We adapt an architecture that has been reported for early detection of disease events using tweets[7]. This architecture consists of four steps:

Initial selection: Tweets are selected based on location, date range and keywords. The keywords are words indicating a symptom. The location is derived based on either the tweet location, the author profile location or the location mention. The code is attached as a python notebook, Step 1.ipynb. Please note that Twitter authorisation keys to access the API have been removed since they are personal to this author.
Personal health mention classification: This step is necessary because a tweet containing a symptom word may not be the report of a symptom. The classifier uses tweet vectors as the feature representation for a statistical classification algorithm. A tweet vector is the average of word embeddings of content words in a tweet. The word embeddings used are pre-trained on a large corpus. A word embedding is a distributional representation of a word that is expected to capture the semantics of a word. The classification step uses support vector machines trained on a labelled dataset where each tweet is represented as its tweet vector. The code is attached as a python notebook, Step 2.ipynb.
Duplication removal: We retain the first tweet per day per user. This prevents multiple reports by the same user from swamping the system. It must be noted that this step follows the second step where a classifier has predicted a tweet as a health report. The code is attached as a python notebook, Step 3.ipynb.
Monitoring algorithm: In this step, we use a monitoring algorithm based on time-between-events[15]. Time-between-events corresponds to the duration between consecutive events in a time series. Of relevance to our algorithm is the event of a tweet being posted. Using in-control data, the algorithm fits a Weibull distribution and estimates its parameters. During test time, the algorithm computes the expected duration between times of posting for consecutive tweets. When the time between consecutive tweets is shorter than an expected value, the tweet is flagged. When p such consecutive tweets are flagged, an alert is generated. The detailed code of the monitoring algorithm has been implemented in R, and is included in the appendix.

In the paper that reported the above architecture[7], we experimented with individual symptoms related to asthma. Therefore, the four steps above were applied in sequence separately for each symptom. Alerts for a symptom were used as a proxy for an alert of the disease. In this paper, we adapt the architecture to be able to handle a collection of symptoms pertaining to the disease being monitored. The two variations are called Data Aggregation and Alert Aggregation.

Data aggregation

Fig 1 shows the architecture for Data Aggregation. In this case, we use two modules of the initial selection step, one for each symptom. The data is aggregated into a common pool, indicated by the ‘+’. In other words, tweets for different symptoms are indistinguishable after this stage. Following this, the three steps, namely personal health mention classification, duplication removal and monitoring algorithm, remain the same as in the original architecture. Therefore, the monitoring algorithm works on tweets related to all symptoms of interest together. Since steps 2, 3 and 4 are the same as the base 4-step architecture, data aggregation closely resembles past architecture.

Alert aggregation

Fig 2 shows the architecture for Alert Aggregation. In this case, we use multiple channels of the four-step architecture, one channel for each symptom. The final step is a combination where alerts from all channels are combined. This is indicated by the ‘+’. This combination can be performed in two ways: (a) union, where an alert for a given day is generated by the overall architecture if it was generated by one of the channels; and, (b) intersection, where an alert for a given day is generated by the overall architecture if an alert was generated by all the channels on that day. We refer to these as Alert Aggregation (union) and Alert Aggregation (intersection), respectively.

Experiment setup

Data

For our experiments, we created a dataset of tweets using the Twitter Premium Search API (TPSA) (https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search.html). TPSA allows two key advantages over the free API. It provides full access to historical tweets, while the free API restricts the caller to tweets posted within the last 30 days. TPSA also allows composite search queries that combine location, keyword and date range parameters, while the free API allows calls restricted to one of the parameters (for example, either location or keyword). We used TPSA to download tweets using the following parameters:

Date range: December 2011 to December 2014.
Locations: We search for locations in key locations in Liberia, Guinea and Sierra Leone. The details are provided in the Appendix. TPSA searches for a location using three arguments: latitude, longitude (indicating the location) and a radius (indicating the distance from the location. This can be up to 40 km). We observed that TPSA returns significantly higher number of tweets from Monrovia, the capital city of Liberia. Therefore, Monrovia is the key location of our dataset. This is expected since, among Guinea, Sierra Leone and Liberia, Liberia had the highest access to the Internet (16.5%, as against 1.6% and 1.7%) during the outbreak[10].
Symptoms: For viral haemorrhagic fevers, systems of high sensitivity are expected. To achieve high sensitivity, we use one early symptom: fever, and one late symptom of Ebola: rash. (https://www.who.int/news-room/fact-sheets/detail/ebola-virus-disease). We also experimented with muscle pain but the corresponding stream of tweets did not fit the distribution as required by the time-between-events algorithm. We experimented with bleeding and red eyes but the number of tweets obtained were too few. Therefore, as a simplistic setting, we stick to two symptoms: fever and rash. This also makes the union and intersection operations simpler. While the architecture could be applied to more than two symptoms in principle, we restrict to only two symptoms of Ebola.

As a limitation of the API, tweets from those accounts that have been made private or deleted at the time of data collection cannot be accessed. Table 2 shows the number of tweets corresponding to each symptom for the specific date range and locations. Only 24 tweets are common to both the datasets, containing both fever and rash.

Table 2. Dataset statistics.

Symptom	# tweets
Fever	8507
Rash	743

Open in a new tab

A common reluctance against using social media for surveillance stems from questions over its popularity in the affected region. The mobile subscriber penetration rate in 2016 was 43% in Sub-Saharan Africa, lower than the global average of 66% [16]. Only 1.6% of people in Guinea, 1.7% in Sierra Leone and 16.5% in Liberia had access to the Internet during the outbreak[10]. Hence, before we use the automated techniques, we wish to ascertain if there are enough tweets from the locations of interest. Fig 3 shows the daily counts in the aggregated dataset over the date range. We observe spikes in the counts around the period of the epidemic. This result encourages us to apply automated techniques for the Ebola epidemic. The counts do not track the epidemic curve because, when the epidemic becomes known, corresponding chatter is also observed on social media. It is not possible to break down over multiple locations because the tweets returned may not have the exact city name as a text in any of the fields but contain latitude/longitude parameters as present in the geolocation that is tagged with the tweet. We reiterate that, in the rest of the paper, we use a combination of computational linguistics and time series monitoring, and do not rely on manual selection or counting of either tweets or outbreak signals.

Methods

In our previous work [7], we used two false discovery rates in Step 4, namely the monitoring algorithm: 1 in 1000 and 1 in 2000. The alerts for the false discovery rate of 1 in 2000 were far too few and have not been reported. Similarly, they used two classifiers for the second step, namely the personal health mention classification: SVM and SVM-Perf. We choose support vector machines as the classifier training algorithms because they have shown to perform better than decision trees and random forests for personal health mention classification [17]. For alerts obtained using SVM, we point the reader to the Appendix. The results in the Appendix show that the adapted architecture using SVM-Perf obtains more relevant alerts than the one using SVM. Therefore, in the following section, we report results on SVM-Perf as the classification algorithm and a false discovery rate of 1 in 1000.

Results

Table 3 shows the alerts generated by the three adapted versions of the architecture from December 2013 to July 2014. In each of these cases, our architecture using social media-based monitoring obtains alerts as early as December 2013. This lead time of three months for the Ebola epidemic makes a case for social media-based infectious disease surveillance.

Table 3. Alerts generated by the three adapted versions of the architecture.

	Alerts
Data Aggregation	December 2013: 2, 4, 6, 7, 9, 10, 13, 14, 15, 16, 27, 28, 30
	January 2014: 3, 4, 6, 10, 11, 13, 17, 18, 20, 24, 25, 27
	February 2014: 21, 22, 23, 24, 28
	March 2014: 1
	April 2014: 24, 25
	May 2014: 2, 3, 4, 5, 30
	June 2014: 2, 5, 6, 7, 13, 14, 16, 18, 20, 21, 23, 24, 25
	July 2014: 11, 12, 14, 15, 16,17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29
Alert Aggregation (Union)	December 2013: 26, 27, 28, 30
	January 2014: 3, 4
	February 2014: None
	March 2014: 31
	April 2014: 1, 27
	May-July 2014: None
Alert Aggregation (Intersection)	December 2013: 27
Alert Aggregation (Intersection)	January–July 2014: None

Open in a new tab

Early alerts

As expected, Alert Aggregation (intersection) is restrictive. It only produces one alert: 27^th December, 2013. Data Aggregation produces alerts in early December 2013 (i.e., 2^nd December onwards), while Alert Aggregation (union) produces alerts in late December 2013 (i.e., 26^th December onwards).

Frequency of alerts

Data Aggregation results in many alerts, starting from early December 2013. In contrast, Alert Aggregation (union) results in few alerts, but they are as early as December 2013 and January 2014. Alerts that are too frequent may not be desirable because they may tend to be ignored.

First alert after the official announcement

For Alert Aggregation (union), we observe an alert on 31^st March 2014, soon after the official announcement of the epidemic. However, in the case of Data Aggregation, the first alert after the official announcement is on 24^th April, 2014. Because the data streams are separate in the Alert Aggregation architecture, the alert is sooner than in the case of Data Aggregation.

The complete list of alerts over the entire date range and the alert graphs for the three versions are in the Appendix.

Conclusions & future work

We adapt an architecture for social media-based infectious disease surveillance and compare two variations: Alert Aggregation and Alert Aggregation. We perform our experiments on a dataset of tweets posted by users in West Africa during the Ebola epidemic of 2014. We focus on fever and rash, two symptoms of Ebola. Our results lead us to two key conclusions:

Social media provides an alert for the 2014 Ebola epidemic, three months in advance

Using social media-based monitoring, we obtain the earliest alert in December 2013, three months before the announcement. This holds true for all versions of our architecture. It must be noted that the countries of interest have lower internet penetration and lower mobile subscriber penetration than the world average. We show that, despite that, social media-based infectious disease surveillance can lead to early alerts.

Data aggregation may result in more frequent alerts as compared to alert aggregation

Most work in social media-based surveillance downloads tweets containing a set of keywords and then applies monitoring algorithms (which we refer to as Data Aggregation). We compare this with an approach where tweets related to each symptom are separately analysed using the monitoring algorithm (we refer to this as Alert Aggregation). We observe that Alert Aggregation results in less frequent alerts than Data Aggregation. Therefore, depending on the desired frequency of these alerts, one of the two strategies can be chosen for future work in social media-based infectious disease surveillance.

As future work, our architecture could be adapted for other social media platforms or disease types. Our choice of Twitter as a social media platform is due to the availability of its API for research purposes. The architecture may be applicable to other social media platforms with usage frequency and content similar to Twitter. The architecture may need to be modified for a social media platform that gets used less frequently or has longer posts. Also, we apply the architecture for Ebola in this paper. This could be applied to unexpected diseases such as acute disease events, or common diseases such as influenza.

Supporting information

S1 Code

(ZIP)

Click here for additional data file.^{(51KB, zip)}

S1 Appendix

(DOCX)

Click here for additional data file.^{(196.6KB, docx)}

Data Availability

The dataset available at: https://doi.org/10.25919/5e28cd7a698fb

Funding Statement

Raina was supported by a NHMRC Principal Research Fellowship, grant number 1137582. Abrar was employed at and Sheng-Lun was a student at University of New South Wales at the time of the research. Aditya, Cecile, Sarvnaz and Ross were employed by the Commonwealth Scientific and Industrial Research Organisation (CSIRO). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Peterson L. and Brossette S., Hunting Health Care-Associated Infections from the Clinical Microbiology Laboratory: Passive, Active, and Virtual Surveillance. Journal of Clinical Microbiology, 2002. 40(1): p. 1–4. 10.1128/JCM.40.1.1-4.2002 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Chan E., et al. Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance. PLOS Neglected Tropical Diseases, 2011. 5(5): p. e1206 10.1371/journal.pntd.0001206 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Yang Y.T., Horneffer M. and DiLisio N., Mining social media and web searches for disease detection. Journal of Public Health Research, 2013. 2(1): p. 17–21. 10.4081/jphr.2013.e4 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Christaki E., New technologies in predicting, preventing and controlling emerging infectious diseases. Virulence, 2015. 6(6): p. 558–565. 10.1080/21505594.2015.1040975 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Santillana M., et al. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLOS Computational Biology, 2015. 11(10): p. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Joshi, Aditya; Karimi, Sarvnaz; Sparks, Ross, Paris, Cecile and MacIntyre, C Raina, Survey of Text-based Epidemic Intelligence: A Computational Linguistic Perspective, 2019. CoRR abs/1903.05801.
7.Joshi Aditya; Sparks Ross; McHugh James; Karimi Sarvnaz; Paris Cecile; MacIntyre C Raina, 2020, ‘Harnessing Tweets for Early Detection of an Acute Disease Event’, Epidemiology. Volume 31, Issue 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Alicino C., et al. Assessing Ebola-related web search behaviour: insights and implications. Infectious Diseases of Poverty, 2015. 4(54). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Hossain L., et al. Social media in Ebola outbreak. Epidemiology & Infection, 2016: p. 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Yom-Tov, E., Ebola data from the Internet: An opportunity for syndromic surveillance or a news event? Proceedings of the 5th International Conference on Digital Health, 2015: p. 115–119.
11.Lazard A., et al. Detecting themes of public concern: A text mining analysis of the Centers for Disease Control and Prevention's Ebola live Twitter chat. American Journal of Infection Control, 2015. 43: p. 1109–1111. 10.1016/j.ajic.2015.05.025 [DOI] [PubMed] [Google Scholar]
12.Towers S., et al. Mass Media and the Contagion of Fear: The Case of Ebola in America. PLOS ONE, 2015. 10(6): p. e0129179 10.1371/journal.pone.0129179 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Cleaton J., et al. Characterizing Ebola Transmission Patterns Based on Internet News Reports. Clinical Infectious Diseases, 2016. 62(1): p. 24–31. 10.1093/cid/civ748 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wong ZS Y; Bui CM; Chughtai AA; MacIntyre CR, A systematic review of early modelling studies of Ebola virus disease in West Africa, Epidemiology and Infection, 2017, vol. 145, pp. 1069–1094. 10.1017/S0950268817000164 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Sparks Ross, Jin Brian, Karimi Sarvnaz, Paris Cecile and MacIntyre C. R. "Real-time monitoring of events applied to syndromic surveillance." Quality Engineering 31, no. 1 (2019): 73–90. [Google Scholar]
16.GSM Association; ‘The Mobile Economy Sub-Saharan Africa 2017’, Report, 2017.
17.Aramaki, Eiji, Sachiko Maskawa, and Mizuki Morita. "Twitter catches the flu: detecting influenza epidemics using Twitter." In Proceedings of the conference on empirical methods in natural language processing, pp. 1568–1576. Association for Computational Linguistics, 2011.

PLoS One. doi: 10.1371/journal.pone.0230322.r001

Decision Letter 0

Eric Forgoston

5 Dec 2019

PONE-D-19-27142

Automated Monitoring of Tweets for Early Detection of the 2014 Ebola Epidemic

PLOS ONE

Dear Dr. Joshi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please note that both reviewers have made suggestions to improve the manuscript. You should address all the reviewer concerns including a discussion on how this article's approach improves upon existing, similar approaches.

We would appreciate receiving your revised manuscript by Jan 19 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Eric Forgoston

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

1. Thank you for including your funding statement; "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Please provide an amended Funding Statement that declares *all* the funding or sources of support received during this specific study (whether external or internal to your organization) as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now.

Please state what role the funders took in the study. If any authors received a salary from any of your funders, please state which authors and which funder. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

2. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is a very interesting article where the investigators used datasets derived from social media (tweets) and applied two adaptations of an existing surveillance architecture with automated monitoring to test for early detection of the 2014 Ebola epidemic. Using this social media-based monitoring, an early alert was detected in December 2013, three months before the official announcement.

The authors took into account the probability and relevance of a symptom, its seasonal appearance and its occurrence over time, which are important in terms of specificity to the disease, when trying to combine symptom reports to detect a disease alert in social media.

Comments:

1. The sentence at the end of the abstract “An additional observation of relevance to infectious disease surveillance in general, is regarding the second adaptation which produces a restricted but useful set of alerts” could be edited in order to convey the message more strongly. It could be rephrased as “The second adaptation, which produces a restricted but useful set of alerts, could potentially be applied in other infectious disease surveillance and alert systems”

2. The authors correctly report that the symptoms they searched for were fever and rash. However, they do not adequately explain why they have not tried to incorporate other (than muscle pain) symptoms of ebola virus disease, like headache, diarrhea, vomiting or bleeding in the architecture.

3. The authors could discuss in more detail if and how their model is applicable in other social media platforms or other (rare or common) diseases.

Reviewer #2: In the paper “Automated monitoring of tweets for early detection of the 2014 Ebola epidemic” the authors present two methods for analyzing the language in social media posts (specifically Twitter “tweets”) to give epidemic alerts. These two methods are described as adaptations of an architecture which has just been accepted for publication elsewhere (the authors citation number 8). Using a data set of tweets from between 2011 and 2014 the authors are able to predict the Ebola epidemic in West Africa three months prior to what they call the official announcement of the epidemic. They do this by analyzing the frequency of the use of words relevant to a particular disease. I find the subject matter of the paper interesting and important, however I do not feel like the authors have done a sufficient job explaining why their approach is different or better from other methods being used in the literature. Although I find it likely that the method presented in the paper improves upon their previous architecture, I would have appreciated it if the need for improvement was addressed. The performance between these two architectures was discussed, but a performance comparison with their previous architecture for this particular data set was not mentioned. Unfortunately the data used is not publicly available, since a premium twitter API requires that the user pay a monthly fee, and so I have to recommend rejection, although I do believe that after some editing the paper deserves to be published, just not in a public library of science journal.

PLOS One has seven criteria for publication which I will address in turn.

1. “The study presents the results of primary scientific research”. The authors present a comparison of two novel architectures on a data set of tweets, and so is a paper describing a new methodology, and does meet PLOS One’s crieteria for “primary scientific research”.

2. “Results reported have not been published elsewhere”. As far as I can tell this is the case.

3. “Experiments, statistics, and other analyses are performed to a high technical standard and are described in sufficient detail”. Other papers that aim to use social media posts to predict disease often have more technical statistical methods for analysis, or are implementing more standard methods of data sorting for prediction, such as Decision Trees and Random Decision Forests. The lack of such methodology is not a failing of the paper, I assume that this is one of the merits of the methodology. If it works about as well, or maybe even better, but is simpler to understand and implement, then it qualifies as preferable by those standards. If that is the case I think that it should be highlighted, and other works should be cited for comparison. I think that the four step method deserves a bit more detail. For instance, “The algorithm computes the expected duration between consecutive tweets. When the time between consecutive tweets is shorter than an expected value, the tweet is flagged”. Although I can imagine how this is being done, I do not find it clearly stated how it is done.

4. “Conclusions are presented in an appropriate fashion and are supported by the data”. The conclusions stated seem to be factual.

5. “The article is presented in an intelligible fashion and is written in standard English”. There are instances in which the grammar needs to be fixed. As I understand it, there is no copy editor, and so the authors must do this themselves. As an example, “Most past work in social media-based infectious disease surveillance uses datasets created using a set of symptom words. before official health systems detect it.”, which appears in the third paragraph of the introduction. This certainly hurts the readability, and for this sentence in particular, does not qualify as intelligible. Other than the overt mistakes, the paper is intelligible.

6. “The research meets all applicable standards for the ethics of experimentation and research integrity”. I see no issues.

7. “The article adheres to appropriate reporting guidelines and community standards for data availability”. It is reported here that “For our experiments, we created a dataset of tweets. We used the Twitter Premium Search API…”. Twitter premium search API has a monthly fee associated with it, and so fails to meet the criteria of PLOS journals “PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction at the time of publication”. Having to pay for the data is a restriction.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Mar 17;15(3):e0230322. doi: 10.1371/journal.pone.0230322.r002

Author response to Decision Letter 0

10 Feb 2020

The submission includes a document detailing our responses to the reviewer and editor comments, and corresponding changes made to the manuscript.

Attachment

Submitted filename: PONE-D-19-27142-ResponseDocument.docx

Click here for additional data file.^{(25.4KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0230322.r003

Decision Letter 1

Eric Forgoston

27 Feb 2020

Automated Monitoring of Tweets for Early Detection of the 2014 Ebola Epidemic

PONE-D-19-27142R1

Dear Dr. Joshi,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Eric Forgoston

Academic Editor

PLOS ONE

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

6. Review Comments to the Author

Reviewer #1: All reviewers' comments have been adequately addressed and significant improvements have been made to the manuscript after suggested revisions and clarifications.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. doi: 10.1371/journal.pone.0230322.r004

Acceptance letter

Eric Forgoston

3 Mar 2020

PONE-D-19-27142R1

Automated Monitoring of Tweets for Early Detection of the 2014 Ebola Epidemic

Dear Dr. Joshi:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Eric Forgoston

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Code

(ZIP)

Click here for additional data file.^{(51KB, zip)}

S1 Appendix

(DOCX)

Click here for additional data file.^{(196.6KB, docx)}

Attachment

Submitted filename: PONE-D-19-27142-ResponseDocument.docx

Click here for additional data file.^{(25.4KB, docx)}

Data Availability Statement

The dataset available at: https://doi.org/10.25919/5e28cd7a698fb

[pone.0230322.ref001] 1.Peterson L. and Brossette S., Hunting Health Care-Associated Infections from the Clinical Microbiology Laboratory: Passive, Active, and Virtual Surveillance. Journal of Clinical Microbiology, 2002. 40(1): p. 1–4. 10.1128/JCM.40.1.1-4.2002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref002] 2.Chan E., et al. Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance. PLOS Neglected Tropical Diseases, 2011. 5(5): p. e1206 10.1371/journal.pntd.0001206 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref003] 3.Yang Y.T., Horneffer M. and DiLisio N., Mining social media and web searches for disease detection. Journal of Public Health Research, 2013. 2(1): p. 17–21. 10.4081/jphr.2013.e4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref004] 4.Christaki E., New technologies in predicting, preventing and controlling emerging infectious diseases. Virulence, 2015. 6(6): p. 558–565. 10.1080/21505594.2015.1040975 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref005] 5.Santillana M., et al. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLOS Computational Biology, 2015. 11(10): p. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref006] 6.Joshi, Aditya; Karimi, Sarvnaz; Sparks, Ross, Paris, Cecile and MacIntyre, C Raina, Survey of Text-based Epidemic Intelligence: A Computational Linguistic Perspective, 2019. CoRR abs/1903.05801.

[pone.0230322.ref007] 7.Joshi Aditya; Sparks Ross; McHugh James; Karimi Sarvnaz; Paris Cecile; MacIntyre C Raina, 2020, ‘Harnessing Tweets for Early Detection of an Acute Disease Event’, Epidemiology. Volume 31, Issue 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref008] 8.Alicino C., et al. Assessing Ebola-related web search behaviour: insights and implications. Infectious Diseases of Poverty, 2015. 4(54). [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref009] 9.Hossain L., et al. Social media in Ebola outbreak. Epidemiology & Infection, 2016: p. 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref010] 10.Yom-Tov, E., Ebola data from the Internet: An opportunity for syndromic surveillance or a news event? Proceedings of the 5th International Conference on Digital Health, 2015: p. 115–119.

[pone.0230322.ref011] 11.Lazard A., et al. Detecting themes of public concern: A text mining analysis of the Centers for Disease Control and Prevention's Ebola live Twitter chat. American Journal of Infection Control, 2015. 43: p. 1109–1111. 10.1016/j.ajic.2015.05.025 [DOI] [PubMed] [Google Scholar]

[pone.0230322.ref012] 12.Towers S., et al. Mass Media and the Contagion of Fear: The Case of Ebola in America. PLOS ONE, 2015. 10(6): p. e0129179 10.1371/journal.pone.0129179 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref013] 13.Cleaton J., et al. Characterizing Ebola Transmission Patterns Based on Internet News Reports. Clinical Infectious Diseases, 2016. 62(1): p. 24–31. 10.1093/cid/civ748 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref014] 14.Wong ZS Y; Bui CM; Chughtai AA; MacIntyre CR, A systematic review of early modelling studies of Ebola virus disease in West Africa, Epidemiology and Infection, 2017, vol. 145, pp. 1069–1094. 10.1017/S0950268817000164 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0230322.ref015] 15.Sparks Ross, Jin Brian, Karimi Sarvnaz, Paris Cecile and MacIntyre C. R. "Real-time monitoring of events applied to syndromic surveillance." Quality Engineering 31, no. 1 (2019): 73–90. [Google Scholar]

[pone.0230322.ref016] 16.GSM Association; ‘The Mobile Economy Sub-Saharan Africa 2017’, Report, 2017.

[pone.0230322.ref017] 17.Aramaki, Eiji, Sachiko Maskawa, and Mizuki Morita. "Twitter catches the flu: detecting influenza epidemics using Twitter." In Proceedings of the conference on empirical methods in natural language processing, pp. 1568–1576. Association for Computational Linguistics, 2011.

PERMALINK

Automated monitoring of tweets for early detection of the 2014 Ebola epidemic

Aditya Joshi

Ross Sparks

Sarvnaz Karimi

Sheng-Lun Jason Yan

Abrar Ahmad Chughtai

Cecile Paris

C Raina MacIntyre

Roles

Abstract

Introduction

The Ebola epidemic of 2014

Table 1. Comparing web-based sources on the West African Ebola epidemic detection and dissemination [8, 9].

Architecture

Data aggregation

Fig 1. Adapted architecture using Data Aggregation.

Alert aggregation

Fig 2. Adapted architecture for Alert Aggregation.

Experiment setup

Data

Table 2. Dataset statistics.

Fig 3. Daily counts in the aggregated dataset for both the symptoms.

Methods

Results

Table 3. Alerts generated by the three adapted versions of the architecture.

Early alerts

Frequency of alerts

First alert after the official announcement

Conclusions & future work

Social media provides an alert for the 2014 Ebola epidemic, three months in advance

Data aggregation may result in more frequent alerts as compared to alert aggregation

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Eric Forgoston

Roles

Author response to Decision Letter 0

Decision Letter 1

Eric Forgoston

Roles

Acceptance letter

Eric Forgoston

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases