AMIA Annual Symposium Proceedings
2018 Apr 16;2017:1362–1371.

Tracking Health Related Discussions on Reddit for Public Health Applications

Albert Park 1, Mike Conway 1
PMCID: PMC5977623  PMID: 29854205


We use Reddit to demonstrate social media’s potential for public health applications. First, we employ a lexicon-based approach to track the prevalence of keywords indicating public interest in Ebola, electronic cigarette, influenza, and marijuana. Second, to better understand the public reactions, we use the Latent Dirichlet Allocation algorithm, to identify either the general themes or motivations for extreme changes in the volume of discussion over time. We observe that discussions related to Ebola and influenza, infectious diseases of public health interests, surged when the first case of Ebola was diagnosed and a new strain of H1N1 influenza virus was confirmed in the United States. We also observed that discussions of a controversial health topic like marijuana increased with the announcement of a major change in United States federal policy. Discussions of electronic cigarette highlighted opportunities for better health education. Lastly, we discuss the implications of our findings for utilizing Reddit data for public health applications.


Nearly two-thirds of American adults (65%) use social media: a nearly a tenfold increase in the past 10 years1. Social media provides a platform for users to freely express their thoughts and provides an opportunity to interact with geographically dispersed likeminded individuals. These social media users discuss a wide variety of topics ranging from ordinary details of their daily life to information about infectious diseases of public health interest like Ebola2. Due to the popularity and ubiquitous nature of social media, researchers advocate for utilizing social media for public health applications35. Public health agencies are in an early adoption stage of using social media for information distribution6. In addition to the substantial potential for using social media as a disease surveillance tool35 and means of information distribution6, social media also has the potential to provide other opportunities to improve public-health practice.

Studying the reactions or opinions of a population has traditionally involved nationally distributed data collection, such as surveys from government agencies. However, these methods are expensive, and perhaps more importantly, time consuming. Some researchers suggest that mining social media data can provide opportunities to reduce time and expense when understanding the reactions or opinions of a population on health issues710. For example, social media allows for accessing first person accounts of experiences7,8, public sentiments9, public knowledge10, and public attitudes10 that may help public health agencies and researchers to develop policies that improve public health outcomes. Moreover, social media can provide the contextual information and prevalence of public interests more efficiently than traditional public health methods. Tracking the prevalence of public interests and understanding the general public reactions and opinions on various health issues have the potential to expand the scope of public-health practice.

In this paper, we report on findings derived from social media data gathered from Reddit for the purpose of tracking the prevalence of public interests and understanding public reactions towards infectious diseases of public health interests like Ebola and influenza as well as controversial health issues, such as electronic cigarettes and marijuana. In fact, although Reddit is one of the most popular public social media platforms, it has been underutilized for public health applications. Reddit’s size and range of topics make it difficult to make use of the data without any knowledge of how the platform is used in practice. Thus, we aim to fill this gap in the literature with the current study and answer the following two research questions (RQ):

  • (RQ1) Is Reddit an effective source for tracking the prevalence of public interests on infectious diseases (i.e., Ebola and influenza) and controversial health related issues (i.e., electronic cigarette and marijuana) over time?

  • (RQ2) What do Reddit members discuss regarding these health issues (a) in times of elevated discussion volume or (b) in general, if the issues have a steady level of discussions?

The work described in this paper was exempted from review by the University of Utah’s Institutional Review Board (IRB) [ethics committee] (IRB 00076188).


A growing body of research has demonstrated the successful use of social media for public health applications1113. Often referred to as digital disease detection3, Infoveillance4, and digital epidemiology5, many studies have used Twitter data for applications in public health, primarily due to the real-time nature of the data. For example, Twitter data have been used to monitor or estimate influenza12,14, seasonal allergies12, alcohol sales and consumption15, cholera outbreaks16, earthquake17, and smoking behavior18, as well as to examine sentiment towards marijuana use19. Although Twitter is highly popular and tweet analysis has performed well with the aforementioned topics, tweets provides relatively limited context due to a length limitation of 140 characters.

Other social media data, such as Facebook and online health community data, have also been mined to, for example, characterize and predict postpartum depression20, classify opioid addiction phrases21 and predict adverse drug reactions22. Google search queries allowed researchers to provide timely estimation of influenza rates23. However, a previous study suggested that Facebook users are reluctant to discuss certain negative topics on Facebook, due to users’ desire to convey positive images of themselves24. Online health communities can provide rich details of first person accounts of experiences25, however, online health communities typically are single topic focused groups, often with a small number of members and attracting a substantial number of “lurkers”26 (i.e., individuals who participate without posting) and dropouts27. Google search queries can be useful and timely, however, search queries are relatively limited in providing context and have been shown to overestimate disease rates, due (in part) to heightened media coverage28.

Recently, Reddit, due to the availability of a public Application Programming Interface (API)29, the capability of providing contextual information, and the support for throwaway accounts, has become a widely studied social media platform for controversial discussions. For example, using Reddit data, researchers have found empirical evidence that Reddit members openly discuss and exchange information support for potentially stigmatized issues like mental health illnesses30, detected increases in suicidal content following reports of several celebrity suicides31, identified distinct markers of shifts to suicidal ideation from mental illnesses32, explored the relationship between social feedback and community participation33, identified distinctive linguistic characteristics that are associated with mental illnesses34, characterized smoking and drinking problems35, and examined user experiences with different tobacco products36. Thus, in this study, we explore Reddit’s utility as a data source for public health applications for tracking and understanding public opinions and reactions to health issues.

Data: Social Media Site

The data for this study is hosted in the popular social media platform, Reddit ( We use Reddit to track and understand discussions of Ebola, influenza, electronic cigarettes, and marijuana for the following three reasons. First, Reddit is a highly active social media platform that had 83 billion page views from over 88,000 active sub-communities (subreddits) in 2015. Members of Reddit made over 73 million individual posts with over 725 million associated comments in the same year37. Second, Reddit allows for throwaway and unidentifiable accounts that are suitable for controversial discussions, such as thoughts and feelings on electronic cigarette and marijuana as well as epidemic concerns like Ebola and influenza that may be inappropriate or sensitive for identifiable accounts. Third, Reddit content is publicly available, in contrast to other health focused social media platforms like Facebook Groups or specifically health-focused online communities like PatientsLikeMe, where the content is typically not available on the open web.

Reddit members converse via a forum like platform. Reddit discussion consists of posts (i.e., a submission that starts a conversation) and associated comments (i.e., a submission that replies to posts or other comments) in various topically focused subreddits. Members who have achieved a certain status within the community are able to create new subreddits. For this study, we used a dataset38 released by a Reddit member. The dataset has been used in previous studies34,39,40. The dataset for the current study is comprised of 239,772 (including both active and inactive) subreddits, 13,213,173 unique member IDs, 114,320,798 posts, and 1,659,361,605 associated comments that were made from October 2007 to May 2015.


RQ1. Is Reddit an effective source for tracking the prevalence of public interests on infectious diseases and controversial health related issues over time?

We used a lexicon-based approach to track discussions on Ebola, electronic cigarettes, influenza, and marijuana from all subreddits available in Reddit. First, we identified key terms associated with the topics of our interests. A summary of key terms for each issue is shown in Table 1. Second, we preprocessed the entire dataset, which included converting text to lower case and removing punctuation. Third, to extract submissions (i.e., posts and comments) containing key terms from all available 239,772 subreddits, we employed a lexicon-based approach and extracted timestamps, comment or post IDs, member IDs, and subreddit IDs of the submissions. We extracted and included any partial matches in this process to cover a wide variation of terms. For example, a partial match of ‘cig’ can cover a variation of ‘cig’, ‘cigs’, ‘cigarette’, and ‘cigarettes’ for electronic cigarette. Fourth, we counted unique member IDs, subreddits, posts, and comments containing key terms. Fifth, we normalized the frequencies over time by dividing the frequency counts by the total number of the respective variables from all available subreddits for that period. Since the total number of submissions in Reddit generally increases over time, we report normalized frequencies over time counts.

Table 1.

Key terms used in the lexicon-based approach

Issues Key terms
Ebola ebola
Electronic cigarette e cig, elec cig, electronic cig
Influenza flu, influenza, H1N1
Marijuana weed, marijuana, ganja, cannabis, bong, spliff, Mary Jane

RQ 2. What do Reddit members discuss on these health issues (a) in times of elevated activities or (b) in general, if the issues have a steady level of discussions?

Based on results of RQ 1, we created two scenarios deciding which time periods to further investigate for understanding the discussions on Ebola, electronic cigarette, influenza, and marijuana. (a) If the issue has a sudden elevated level of discussion, we investigated the time period in which the elevation occurs along with prior discussions of the same temporal length to understand the underlying causes for these sudden changes in public interest. Similar methods that contrast to prior time periods have been used to detect emerging topics41,42. (b) If the issue has a steady level of discussions, we investigated the entire discussions on the issue to understand the main themes.

We used natural language processing (NLP) and language modeling for this research question. Due to the size of the dataset and range of topics discussed on Reddit, we used automated methods. Similar automated methods have been used in the health care domain to extract information and analyze data, and to enhance the personal health care experiences4345. First, we preprocessed the entire dataset as we did in RQ1. Second, to improve the language modeling results, we removed the URLs and comments and posts with less than 5 words, and then extracted nouns using Python Natural Language Toolkit (NLTK) package46. The extracted nouns were used to create language models—a set of topics generated from document-level word co-occurrences for a given set of documents—using Latent Dirichlet Allocation47 (LDA) for the time period of our interests. We elected to use LDA, an unsupervised algorithm, due to the lack of a ground truth dataset. We considered each post and its associated comments as a single document.

One advantage of using LDA as opposed to other unsupervised clustering techniques is that the algorithm considers each document with multiple topics. A previous study of online health discussions suggested that discussions could have multiple topics due to topic drift48. Thus, we employed LDA for this study. One disadvantage of using LDA is, however, it requires a pre-determined number of topics. After experimenting with varying numbers of topics, we generated 50 topics to understand Ebola, electronic cigarette, influenza, and marijuana related issues. We used the Python package genism49 to conduct LDA analysis. We then present the main topics and their top 50 associated words as the word cloud overview using the Python package wordcloud50. Despite its simplicity, word cloud overview remains one of the more preferred and user-friendly visualizations that can also scale to different data sizes51. We then manually investigated the identified topics and their associated words to thoroughly examine the LDA results.

Lastly, we performed two types of validity checks. First, for health issues with a sudden elevated level of discussion, we verified the LDA results via a systematic analysis of news at the time of the change. LDA results reflect motivations for the extreme changes, thus news can be an effective source for a validity check. Second, we extracted URLs using regular expressions and categorized the results. A previous study concerning electronic cigarettes—a product with few marketing restrictions in the US until recently—suggested that up to 90 percent of social media (in this case, Twitter) content could be related to product marketing52. Thus, because marketing content can skew our result, we used URLs as a proxy to marketing content and reported the percentage of posts with URLs. We also manually examined several extracted URLs to ensure the quality of the validation process.


RQ1. Is Reddit an effective source for tracking the prevalence of public interests on infectious diseases and controversial health related issues over time?

The lexicon-based approach identified Reddit posts, comments, and members discussing Ebola, electronic cigarette, influenza, and marijuana from October 2007 to May 2015 (Table 2). The most discussed matter was influenza, followed by marijuana, electronic cigarettes, and then Ebola. The raw counts of discussions and members who mentioned each topic generally increased with time.

Table 2.

The total number and average normalized count of posts, comments, members, and subreddits identified using the lexicon-based approach

Issues Total posts and comments (n) Average normalized count of posts and comments (%) Total members (n) Average normalized count of members (%) Number of subreddits containing the key terms
Ebola 252,243 7.18E-05 113,546 9.68E-04 6,039
Electronic cigarette 355,839 2.17E-04 176,252 3.75E-03 4,454
Influenza 6,876,684 4.48E-03 1,443,223 0.06 30,856
Marijuana 4,809,337 3.31E-03 968,892 3.75E-02 18,236

We identified one notable increase in discussion each for Ebloa, influenza, and marijuana using the normalized frequencies over time (Figure 1). First, the normalized count on marijuana almost doubled from the previous month in February 2009. The heightened level of discussions continued for two months then slowly dropped back to the previous level. Second, in April of 2009, the normalized count on influenza almost doubled from the previous month. Third, October 2014 accounts for the Ebola discussions. The discussions on Ebola showed the most increase, jumping more than five times from the previous month. The number of members discussing each issue increased in a similar manner (Figure 1). The Discussions on electronic cigarette was relatively steady from October 2007 to May 2015.

Figure 1.

Figure 1

The Line Graphs of normalized frequencies over time for posts and comments with key terms and members who used the key terms

Discussions on Ebola, electronic cigarettes, influenza, and marijuana, however, only accumulated to a fraction of the overall discussions on Reddit (Table 2). Although the community as a whole did not frequently talk about these health-related issues, this still amounted to more than 3,000 members for Ebloa, the least discussed issue, and more than 137,000 members for influenza, the most discussed issue in a month with a normal level of discussion, May 2015.

New subreddits to discuss Ebloa, Electronic Cigarette, Influenza, and Marijuana

Members of Reddit created a number of subreddits specifically focusing on Ebola, electronic cigarettes, influenza, and marijuana, although they also discussed the issues in many different subreddits (Table 3). Using the key terms (Table 1), we detected a total of 450 topically dedicated subreddits that were created between October 2007 and May 2015. For example, marijuana was casually discussed in 18,236 subreddits (i.e., subreddits with key terms in posts or comments), while members created at least 244 subreddits (i.e., key terms in names of subreddits) to talk about marijuana.

Table 3.

Newly created communities dedicated to focus on Ebola, Electronic Cigarette, influenza, and Marijuana

Issues Subreddits, n Example Subreddits
Names Posts, comments
Ebola 37 4,454 EbolaOutbreak, batebola, Ebolaworld, ebola2014, ebolaUS, ILoveEbola, EbolaEpidemic, EbolaWestAfrica, ebolanews
Electronic cigarette 3 6,039 Ecigclassifieds, ecigclassifiedseu, ecigclassifiedsuk
Influenza 166 30,856 Influenza, Birdflu, flu,
Marijuana 244 18,236 Marijuana, ganja, cannabiscultur, ganjaoutlaw, cannabis_marijuana

RQ 2. What do Reddit members discuss on these health issues (a) in times of elevated activities or (b) in general, if the issues have a steady level of discussions?

From RQ1, we learned that discussions focusing concerning Ebola, influenza, and marijuana, each had one sudden increase of activities. Thus, we created word cloud overviews of emerging topics for Ebola, influenza, and marijuana, while creating a general word cloud overviews for electronic cigarette (Figure 2).

According to the word cloud overview generated by the LDA topic modeling algorithm, we can infer that Reddit members are most concerned about ‘risk’ and ‘symptoms’ regarding Ebola. For influenza, members used terms like ‘Mexico’, ‘Obama’, ‘CDC’, and ‘conspiracy’, along with H1N1 influenza related terms (e.g., ‘H1N1’, ‘Swine’) as well as H5N1 related terms like ‘Egypt’ and ‘pig’. Topics regarding ‘legalization’, ‘prohibition’, ‘economy’, and ‘state’ appeared in discussions regarding marijuana. The general word cloud overview for electronic cigarettes has more commercially related terms such as ‘quality’, ‘prices’, ‘shop’, and ‘store’ than the other three discussions, however substantially more terms related to tobacco (e.g., ‘tobacco’, ‘cigarette’, ‘cigar’) are shown in Figure 2. Other notable topics for electronic cigarette that were identified via the LDA were ‘quitting smoking’, ‘fun experience’, and ‘health information’.

The LDA algorithm identified ‘quit’, ‘addiction’, ‘habit’, ‘cravings’, ‘gum’ and ‘turkey’ for ‘quitting smoking’, associated ‘fun’, ‘experience’, ‘safe’, and ‘pleasure’ with ‘fun experience’, and linked ‘cancer’, ‘risk’, ‘study’, ‘evidence’, ‘research’, ‘article’, ‘data’, and ‘science’ with ‘health information’. These topics highlighted a great opportunity for better health education (See Discussion).

To check the validity of the results, we extracted and investigated the URLs to ensure that frequencies are not inflated by marketing content. The types of URLs shared by members were similar in nature for all four issues. Members shared websites that are concerning information (e.g., Wikipedia, CDC), news (e.g., NY Times), personal stories (e.g., blogs), other social media platforms, (e.g., Youtube), different Reddit posts, and commercial resources (e.g., amazon). Although the proportion of each type of URLs is different, members shared a relatively small number of posts and comments with URLs compared to the overall posts and comments focusing on all four issues.


Principal Findings

We examined four different infectious disease related or potentially stigmatized health related issues discussed on Reddit. We discovered three periods with higher levels of activities on Reddit. We observed that there were almost twice as many marijuana related discussions in February 2009 compared to the previous month, due – we suspect – to the announcement of a major shift in federal policy. Attorney General Eric Holder confirmed that Drug Enforcement Administration would halt medical marijuana raids and give states the power to regulate medical marijuana usage for pain control in February of 200953. In April of 2009, discussions about influenza almost doubled from the previous month. This is likely due to the fact that a novel strain of H1N1 influenza virus was discovered in North America in the spring of 200954 and the Centers for Disease Control and Prevention (CDC) confirming the first two cases of human infection with H1N1 influenza virus in the United States in April of 200955. On September 30, 2014, the United States had its first diagnosed case of Ebola in Texas, and the first Ebola related death on October 8, 201456. We observed that discussions on Ebola, a potentially fatal infectious disease, surged more than five times from the previous month in October of 2014. The news related to Ebola, influenza, and Marijuana align well with the results from topic model analyses (RQ2). On the basis of these changes of activities, Reddit may be a valuable source of data for tracking the prevalence of public interests on infectious diseases (i.e., Ebola and influenza) and controversial health related issues (i.e., electronic cigarette and marijuana) over time (RQ1).

The result of our analysis on electronic cigarette discussions suggests that Reddit contains more than just commercial content despite the fact there are at least three subreddits focusing on classified content (Table 4). For instance, a subreddit called ‘Ecigclassifieds’ consists mainly of commercial content, thus the content of these subreddits deserves further investigation to better utilize the data. From electronic cigarette discussion, we identified three topics, ‘quitting smoking’, ‘fun experience’, and ‘health information’ that highlighted opportunities for better health education. From their associated terms (see Results), we can infer that Reddit members are seeking information on these three topics. Information seeking behavior on Reddit suggests Reddit’s utility as another social media platform for information distribution and as a data source for understanding user groups (e.g., electronic cigarette smokers) and identifying better health education. Why members are seeking health information on Reddit is an unanswered research question, although a recent study suggests that electronic cigarette related health information from public health agencies may be too difficult for the general public to comprehend57.

Table 4.

Posts and comments containing URLs

Issues URL, n Percentages of posts/comments with URLs to the total number of posts/comments
Ebola 32,863 13.03%
Electronic cigarette 22,839 6.42%
Influenza 783,350 11.39%
Marijuana 390,675 8.12%

Reddit members also created at least 450 relevant new subreddits specifically focusing on these four issues. How the content from these subreddits contrast with the content from multiple subreddits on the same issue is an unanswered question. Previous studies30,31,33,34,39 analyzed content from a handful of especially dedicated subreddits for their studies. However, our finding suggests that at least for discussions of Ebola, influenza, electronic cigarettes, and marijuana, members mentioned these issues on thousands of subreddits (Table 3). For instance, a common issue like influenza was discussed in over 30,000 subreddits, and even a focused topic like Ebloa were discussed in over 4,400 subreddits. Thus, we believe analyzing a wider number of subreddits can improve recall of the relevant content.

Limitation, Future Directions, and User Privacy

Reddit offers substantial potential for understanding the public reactions to health-related topics, however, not without a number of limitations. Although Reddit is a widely-used platform, it is more frequently used by young males58,59 and may be subjective to self-selection bias. Reddit members are not necessarily representative of the general public, however, the levels of activity on Reddit aligned with the United States news and deserve a further investigation, especially with respect to location of postings and the overall reactions in Reddit. To better understand the reaction of the general public, studying different platforms and avenues, Facebook and Twitter for example, is warranted. Our analysis suggests that given the increasing popularity and use of Reddit, as well as the increasing frequency of discussions concerning our topic of interests, Reddit provides a productive starting point for investigating infectious disease related or controversial health issues.

Another limitation lies in the methodology. In RQ1, we used a relatively rudimentary lexicon-based approach to extract posts and comments explicit mentioning variations of pre-specified key terms. One major shortcoming of such approach is the selection of key terms. For example, utilizing a large set of key terms will undoubtedly create more false-positives, whereas too limited a set of key terms will surely result in more false-negatives. Moreover, partial matches can produce false-positive matches. We believe the figures for influenza were inflated because ‘flu’ can be a part of a longer word such as ‘fluorine’ or ‘flute’. In future studies, we suggest that precision rather than recall should be emphasized in order to eliminate irrelevant discussions. Other difficulties in mining social media data include the fact that social media text is frequently characterized by extensive of acronyms, abbreviations, and slang terms60. Although we included the most frequently found abbreviations and slang terms, lexicon-based approaches are to omit unknown forms of abbreviations and slang. More sophisticated methods utilizing knowledge- based61 or corpus-based62 approaches could produce different results. Furthermore, a smaller timeframe can better measure the timeliness of the observed reactions as oppose to the one month timeframe used in RQ1. In RQ2, we relied on a systematic analysis of the news to verify the result of our investigation. However, data driven qualitative analysis63 can further bolster our findings and provide the contextual information on the discussions of our interests. Sentiment analysis on the extracted discussion can also provide further clues about general public reactions on various health related topics9.

Research and applications using social media data should be highly sensitive to user privacy, especially for potentially stigmatized topics. Although at least some social media data are publicly available, researchers should consider ethical implications when processing data even for population-level social media research using public data6466. For this reason, we have refrained from using direct quotations from Reddit users in this paper.


As evident by the frequencies over time of discussions, inflated discussions after major news, as well as newly created subreddits specifically focusing on these health-related issues, Reddit could be a useful platform for understanding the concerns and opinions of the general public, especially for issues focusing on controversial topics, such as abuse and addiction as well as infectious diseases of public health interest. By utilizing the content, we also identified opportunities for better health education that could improve public health outcomes. We created topic models using LDA and generated topically associated words and created word cloud visualizations to show (1) emerging topics by contrasting to the prior topic models or (2) main themes of the discussions. We believe our insights and analyses can be generalized to other similar health related issues in the Reddit platform. Understanding public reactions to these issues has the potential to expand the scope of public-health practice.

Figure 2.

Figure 2

Word cloud overviews of emerging topics for Ebola (top left), influenza (top right), and marijuana (bottom left) as well as general word cloud overviews for electronic cigarette (bottom right)


We restricted our analysis to publicly available discussion content. The study was exempted from review by the University of Utah’s Institutional Review Board (Ethics Committee) [IRB 00076188].

Author AP was funded by National Library of Medicine of the National Institutes of Health under award number T15 LM007124. Author MC’s contribution to this research was supported by National Library of Medicine of the National Institutes of Health under award numbers R00LM011393 & K99LM011393.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


