Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2024 May 31;2024:670–678.

Enabling Semantic Topic Modeling on Twitter Using MetaMap

Rebecca Shyu 1, Chunhua Weng 1
PMCID: PMC11141808  PMID: 38827089

Abstract

Topic modeling performs poorly on short phrases or sentences and ever-changing slang, which are common in social media, such as X, formerly known as Twitter. This study investigates whether concept annotation tools such as MetaMap can enable topic modeling at the semantic level. Using tweets mentioning “hydroxychloroquine” for a case study, we extracted 56,017 posted between 03/01/2020-12/31/2021. The tweets were run through MetaMap to encode concepts with UMLS Concept Unique Identifiers (CUIs) and then we used Latent Dirichlet Allocation (LDA) to identify the optimal model for two datasets: 1) tweets with the original text and 2) tweets with the replaced CUIs. We found that the MetaMap LDA models outperformed the non-MetaMap models in terms of coherence and representativeness and identified topics timely relevant to social and political discussions. We concluded that integrating MetaMap to standardize tweets through UMLS concepts improved semantic topic modeling performance amidst noise in the text.

Introduction

The evolution of social media has led to many unintended social implications, such as politics, consumerism, and most recently, public health. Among many platforms, including Twitter, Facebook, and Reddit, communities have formed around similar health interests and backgrounds such as groups of patients with certain conditions.1,2 There are a multitude of benefits for patients who actively participate in these communities including moral support, anecdotal advice, and general discussion about treatments.3 At the same time, social media can also be harmful through trending misinformation.4 There is great promise for health professionals to use this avenue to connect with and learn from patients and the general population.57 Similarly, communities of physicians have formed on social media to discuss clinical evidence, provide anecdotes, and dispense knowledge.8,9

The SARS-CoV-2 (COVID-19) Pandemic swiftly generated conversations on social media, especially Twitter, between patients, physicians, and institutions. There also became an overwhelming amount of information available for people to seek out and understand.10 At the same time, false information was being shared at an alarming rate where social media engagement for false information was higher than science-based information.11 An exuberant number of peer-reviewed publications about COVID-19 were being pushed out with differing stances, leading to confusion among the public. In addition, there were many political and social consequences, including debate about the pandemic’s source, treatment, and authenticity. Social media became a platform for users of all backgrounds to openly discuss COVID-related topics and opinions.

Social media continues to grow as a data source and has immense potential within the biomedical and health policy sphere. Knowing exactly what users of all ages and backgrounds are discussing is useful for medical providers to understand how to conscientiously approach patients about certain topics. In addition, public health officials can develop strategies to combat misinformation and expand their reach through social media to users of all backgrounds. These opportunities allow the clinical relevance to span across many spheres, ultimately improving health literacy and outcomes. Specifically, by extracting semantic meanings from social media with a biomedical focus, the noise within social media data is reduced.

Previous studies aiming to categorize and analyze social media users and content have used a variety of algorithms. Topic modeling, specifically using Latent Dirichlet Allocation (LDA), has been an established method to identify common themes from text.12 Social media, especially tweets, however, have proven to be incompatible with these algorithms with low coherence scores due to the character limits of 280.13 In addition, the constantly evolving definitions and usage of slang and emojis prove difficult for established algorithms to synthesize and group concepts together. Current topic modeling methods are words-based, which makes detecting topics, with each usually consisting of multiple words, difficult in a setting like social media. They are limited to the words at face value instead of their meaning or semantics. This study introduces the use of the National Library of Medicine’s Unified Medical Language System (UMLS)’s MetaMap to address these pitfalls. MetaMap has typically been used in biomedical text to link UMLS concepts.14 The majority of uses is for natural language processing including information retrieval and knowledge discovery within texts such as scientific literature and clinical notes.15 Analyzing social media using UMLS has been an emerging application, however it has been used only for filtering purposes for relevancy and extracting information.16

In this study, we hypothesize that MetaMap can enhance a current topic modeling method with meaningful semantic units by recognizing UMLS concepts. This case study focuses on tweets concerning hydroxychloroquine (HCQ) and COVID-19. Twitter was selected due to a study finding that it was the social media platform of choice for disease surveillance.17 Using HCQ as a treatment for COVID-19 patients was a large point of contention during the early stages of the pandemic, leading to an influx of studies and discourse. There was a lot of confusion stemming from many isolated events amongst communities about the effectiveness of hydroxychloroquine in treating COVID-19.18 The controversy began in March 2020 by a preliminary study, which was not peer-reviewed or rigorously generalized, but indicated the effectiveness of HCQ as a treatment for COVID; this study spread very quickly on social media, despite its lack of evidence.19 From there, the US Food and Drug Administration issued an Emergency Use Authorization (EUA) to approve HCQ as a treatment that lasted from March 28, 2020, to June 15, 2020. On November 9, 2020, the JAMA article debunking HCQ for COVID was published.20 All throughout, some political figures were vocally advocating for the use of HCQ while others were staunchly against it, leading to confusion and resistance of emerging scientific literature. Each of these events sparked discourse beyond the scientific and medical community and played a large role in how the pandemic progressed. This research establishes a pipeline to identify thematic content of social media tweets through topic modeling with a temporal aspect, introducing a novel application of the National Library of Medicine’s Unified Medical Language System (UMLS)’s MetaMap to improve LDA performance.

Methods

This study utilized a public, pre-defined COVID-19 Twitter Dataset.21 This curated dataset pulled trending topics and keywords relating to COVID-19 from Twitter’s Streaming API and included additional tweets from another repository by Chen.22 Each data point had a unique tweet identifier, language, geolocation, and additional metadata; the actual text was not included in the dataset. This dataset consisted of original English tweets that were not retweets (to prevent duplicates) were pulled (n = 734,393,952). To filter out bots and spam, tweets with no interactions (0 likes and 0 retweets) were also removed. This study focused on tweets between March 1, 2020, and December 31, 2021. This time interval was chosen to include the beginning of the pandemic in early March, to the end of 2021 to fully capture the evolution of the scientific community’s understanding and literature of hydroxychloroquine and other treatments. Hydrator, an Electron-based application, was used to “hydrate” the tweets from the unique tweet IDs from the datasets.23 This, along with the Twitter Developer API, pulled the full text, metrics, and metadata for each existing tweet.24 Deleted and private tweets were not included (n = 95,858,524). For this proof-of-concept study, we focused on Twitter discourse surrounding Hydroxychloroquine because of its controversial, yet biomedically-relevant nature that brought in opinions from many areas including politics.18 We filtered tweets by string searching the full texts for “hydroxychloroquine” and “hcq”, the main topic of our study. Our final dataset consisted of 56,017 tweets. Preprocessing of the tweets included removing all punctuation, non-ASCII characters, stop words from the NLTK toolkit, and links to other websites. Each tweet was also made all lowercase.

The next part of the pipeline was to use MetaMap (Linux, using the 2020AA USAbase Strict Data Model) to extract UMLS concepts from each tweet. This version included the COVID-19 terminology. Word sense disambiguation based on context was turned on, which limited the output to a single mapping based on the context determined by MetaMap, instead of multiple candidates for a word/phrase. For one tweet, there may be multiple concepts/Concept Unique Identifiers (CUIs), but for the purpose of our study, only one CUI was selected for each word/phrase. The MetaMap output found 729,341 total CUIs mapped across all 56,017 tweets. With each CUI, MetaMap provided the part of speech, score, and trigger from the original tweet (the exact word/phrase used to map to the UMLS counterpart).

Then, the full texts of the tweets were prepared for topic modeling with Gensim’s Latent Dirichlet Allocation (LDA), using the default symmetric alpha level. From here, we took two routes: (1) perform LDA on the preprocessed tweets and (2) perform LDA using MetaMap’s CUIs within the preprocessed tweets. For (1), we created bigrams and trigrams to feed into the LDA model. This allows words that are more likely to co-occur to be associated with each other and considered as one for the LDA model. We chose to not lemmatize the text due to potential slang and references whose meaning may be removed in the lemmatization process. The second route (2) used the same preprocessed tweets but instead of using the preprocessed tweets, we substituted the MetaMap triggers for the UMLS CUIs. The input dataset was a string of CUIs alongside text from the original tweet that were not mapped to any biomedical terms in MetaMap. Figure 1 shows an example of an original tweet as the input into MetaMap; the output is then fed through the preprocessing stage of removing stop words and non-ASCII characters, and the end output that is then put into the LDA model. This transformed data was fed into a second LDA model. For each of the models, coherence scores were calculated to identify the model with the optimal number of topics.

Figure 1.

Figure 1.

MetaMap substitution example for the second route.

We also performed a sub-analysis of the tweets based on the Hydroxychloroquine timeline. Politically and on social media, there were events that affected the public’s perception and understanding of hydroxychloroquine’s effect on COVID-19. Our first-time chunk was March 1, 2020 – March 27, 2020, which marked the early stages of the pandemic with the non-peer reviewed study and before the Emergency Use Authorization went into effect. The second time chunk was March 28, 2020 – June 14, 2020, which were the exact dates of the Emergency Use Authorization in effect. The third was June 15, 2020 – November 8, 2020, which led up to the JAMA study that confirmed the ineffectiveness of HCQ as a treatment for COVID. Lastly, November 9, 2020 – December 31, 2021 marked the lingering discourse as other treatments backed by trials and scientific literature, like Paxlovid, were identified. We repeated the two LDA routes (with and without MetaMap-substituted tweets) and identified the optimal model for each of the four time periods.

Results

In general, there were recurring themes among the topics generated by both LDA models. Based on the optimal number of topics shown in the graph (8 and 14 for MetaMap and non-MetaMap respectively), we generated the topics (top 10 terms/concepts per topic). The CUIs and text terms are shown, with the human interpretable UMLS Concepts in parentheses (Table 1).

Table 1.

LDA models and topics for MetaMap and non-MetaMap models.

LDA Model – With MetaMap LDA Model – Without MetaMap
Topic 1 – Information about HCQ: C0020336 (HCQ), C1442163 (Mom), C1548784 (Excellent), C0034850 (Endosome), C0418624 (Available Drug), C0750541 (Apparently), C0199244 (Antimalarial prophylaxis), ‘Per’ C1424731 (Nuance), C0013153 (Drug Administration Routes) Topic 1 – Politics and Humor: ‘expensive’, ‘left’, ‘dead’, ‘kids’, ‘push’, ‘agenda’, ‘hydroxychloroquine’, ‘wuhan_lab’, ‘lol’, ‘funny’ Topic 9 – Misinformation: ‘retweet’, ‘therapeutics’, misinformation’, ‘hydroxychloroquine’, ‘lies’, ‘stay_home’, ‘science’, ‘facts’, ‘2020’, ‘wrote’
Topic 2 – CDC and the Emergency Use Authorization: ‘Fauci’, C0020336 (HCQ), C1707664 (Delayed Release Dosage Form), C0027468 (NIH), C1070072 (Eua (gastropod)), C0007670 (CDC), C0042210 (Vaccines), C0041714 (FDA), C3273147 (Emergency Use Authorization), ‘Anthony’ Topic 2 – Availability of Vaccines and Drugs: ‘ivermectin’, ‘hcq’, ‘vaccine’, ‘vax’, ‘cheap’, ‘like’, ‘available’, ‘people’, ‘make’, ‘work’ Topic 10 – General Concerns: ‘important’, ‘blocking’, ‘anthony_fauci’, ‘death_rates’, ‘excellent’, ‘concerns’, ‘rick_bright’, ‘work’, ‘doesnt’, ‘gone’
Topic 3 – Effectiveness of HCQ and Ivermectin: C0020336 (HCQ), C0087111 (Treatment), C0022322 (Ivermectin), C0013227 (Drug), ‘COVID19’, C1547282 (Show), C1280519 (Effectiveness), C0030705 (Patients), C0042153 (Use), ‘cheap’ Topic 3 – Treatment Studies: ‘hydroxychloroquine’, ‘ivermectin’, ‘covid19’, ‘treatment’, ‘hcq’, ‘patients’, ‘effective’, ‘early’, ‘drugs’, ‘studies’ Topic 11 – Effectiveness of HCQ: ‘hcq’, ‘fauci’, ‘cdc’, ‘people’, ‘knew’, ‘told’, ‘hydroxychloroquine’, ‘safe’, ‘effective’, ‘lied’
Topic 4 – Misinformation: ‘RandPaul’, C1335624 (RIPK2 gene), ‘avoiding’, C0423899 (Above average intellect), C0020336 (HCQ), C0242664 (Husband), C4046016 (Hoax), C0020157 (Humanity), C1456647 (Childhood Vaccines), C1515187 (Take) Topic 4 – HCQ and other Drugs: ‘hcq’, ‘ivermectin’, ‘hydroxychloroquine’, ‘people’, ‘like’, ‘take’, ‘well’, ‘zinc’, ‘got’, ‘took’ Topic 12 – HCQ and other Drugs: ‘hcq’, ‘ivm’, ‘different’, ‘hydroxychloroquine’, ‘used’, ‘malaria’, ‘india’, ‘drugs’, ‘banned’, ‘doctors’
Topic 5 – Vaccines vs HCQ: C0020336 (HCQ), C1515187 (Take), C0027361 (Person), C0043227 (Work), C0042210 (Vaccines), ‘like’, C4764086 (TRUMP), C1881534 (Make), ‘know’, C0205170 (Good) Topic 5 – Politics and Fear: ‘causing’, ‘pandemic’, ‘biden’, ‘ppl’, ‘husband’, ‘immune_system’, ‘suppressed’, ‘criminal’, ‘fear’, ‘numbers’ Topic 13 – Emergency Use Authorization: ‘ivermectin’, ‘use’, ‘hydroxychloroquine’, ‘treatments’, ‘hcq’, ‘fda’, ‘emergency’, ‘treatment’, ‘eua’, ‘alternative’
Topic 6 – Influential Figures: C0020336 (HCQ), C2987476 (Exist), ‘Fauci’, C2349182 (Correct), C0007670 (CDC), C1521828 (Rate), ‘Vladimir’, C0392760 (Affecting), ‘charliekirk11’, C0596007 (HCQ Sulfate) Topic 6 – Politics:hydroxychloroquine’, ‘trump’, ‘political’, ‘pfizer’, ‘hes’, ‘drug’, ‘president’, ‘covid19’, ‘taking’, ‘right’ Topic 14 – Ineffectiveness of HCQ: ‘vitamin’, ‘administered’, ‘guilty’, ‘team’ ‘hcq’, ‘fake’, ‘hydroxychloroquine’, ‘w’, ‘politicians’, ‘seems’
Topic 7 – Treatment Options: C0022322 (Ivermectin), C0020336 (HCQ), C0087111 (Treatment), C0031831 (Physicians), C0019994 (Hospitals), ‘vax’, C1524063 (Use of), C2347840 (Save), C0018104 (Government), C0042153 (Use) Topic 7 – General Negative Feelings about HCQ: ‘dr_vladmir’, ‘sulfate’, ‘zinc’, ‘hcq’, ‘actual’, ‘shut’, ‘evil’, ‘steroids’, ‘outpatient’, ‘early’
Topic 8 – Misinformation: C0596795 (Intravital Microscopy/IVM), C0022322 (Ivermectin), C0205848 (Death Rate), ‘McCullough’, C0020336 (HCQ), C0233660 (Blocking), C0043481 (Zinc), ‘Robert’, C1524075 (Chain), ‘Malone’ Topic 8 – Politics and News:hydroxychloroquine’, ‘pushing’, ‘pushed’, ‘cure’, ‘still’, ‘vaccine’, ‘covid19’, ‘trump’, ‘news’, ‘randpaul’

The order of topics within each model corresponds with the order of prevalence, with topic 1 being the most prevalent. For the MetaMap model, two of the top topics seem to be about the initial confusion about the effectiveness and information about hydroxychloroquine; this includes HCQ’s previous application in preventing malaria and HCQ’s association with ivermectin. The second topic is about the Emergency Use Authorization and Dr. Anthony Fauci’s role with all the government agencies (NIH, CDC, FDA). Many of the remaining topics were centered around labelling COVID as a hoax and identifying individuals who were vocal about the topic. These themes also appeared in the topics of the non-MetaMap model. The topics emphasized more of the misinformation pushed around, surrounding the origin of the pandemic and the government’s role. Similarly, many names were included. Some were in favor of hydroxychloroquine, such as Topics 2-4, while others were more politically biased with assigning blame.

When incorporating the temporal aspect, there were some shifts in topics, but there were similar themes throughout as seen in Table 2. For the first time chunk, the most prevalent topics for both models were extremely similar and had 7/10 of the same terms, including “Azithromycin”, which was part of the controversial study published online on March 20, 2020. The second time chunk included more terms about the trials and studies being published, even mentioning the New England Journal of Medicine. The third and fourth time chunks had terms ranging from the effectiveness of hydroxychloroquine to politics to dropping the emergency use authorization order. As time went on, the topics became less focused but still relevant to the public’s understanding and concern about hydroxychloroquine.

Table 2.

Most prevalent topic of both models for each time chunk

Time Chunk LDA Model – With MetaMap LDA Model – Without MetaMap
1: March 1, 2020 – March 27, 2020 Topic 1: Initial Understanding of HCQ Topic 1: Initial Understanding of HCQ
C0020336 (HCQ), C0030705 (Patients), C1273517 (Used by), ‘COVID19’, C0013227 (Drug), C0052796 (Azithromycin), C0206750 (Coronavirus), C0008269 (Chloroquine), C4764086 (TRUMP), C0424324 (Fighting) ‘hydroxychloroquine’, ‘covid19’, ‘people’, ‘chloroquine’, ‘drug’, ‘trump’, ‘azithromycin’, ‘coronavirus’, ‘take’, ‘please’
2: March 28, 2020 – June 14, 2020 Topic 2: HCQ Drug Trial and Publications Topic 2: HCQ Trials and Studies
C0020336 (HCQ), C1515187 (Take), C0304229 (Drug Trial), C4764086 (TRUMP), ‘say’, ‘Covid19’, C1880198 (Cure), ‘COVID19, ‘NEJM’, C0013227 (Drug) ‘hydroxychloroquine’, ‘covid19’, ‘treatment’, ‘coronavirus’, ‘trial’, ‘study’, ‘hcq’, ‘trump’, ‘postexposure_prophylaxis’, ‘patients’
3: June 15, 2020 – November 8, 2020 Topic 3: Effectiveness of HCQ and Political Influences Topic 3: Effectiveness of HCQ and Similar Drugs
C0020336 (HCQ), C4764086 (TRUMP), C1833296 (Dementia), C1719958 (Pushing), C5203670 (COVID19), C0087111 (Treatment), C0043227 (Work), C1880198 (Cure), C0013227 (Drug), C1515187 (Take) ‘hcq’, ‘hydroxychloroquine’, ‘covid19’, ‘remdesivir’, ‘zinc’, ‘drug’, ‘work’, ‘effective’, ‘vaccine’, ‘treatment’
4: November 9, 2020 – December 31, 2021 Topic 4: Post-EUA and Vaccines Topic 4: General Pandemic
C0020336 (HCQ), ‘Says’, C1708059 (Fill), ‘effective’, ‘C19’, C0562019 (Dollar), C1456647 (Childhood Vaccines), C1705648 (Dropping), C3273147 (Emergency Use Authorization), C0011011 (Daughter) ‘hcq’, ‘ivermectin’, ‘doctors’, ‘people’, ‘trump’, ‘cdc’, ‘pandemic’, ‘ivm’, ‘saved’, ‘hydroxychloroquine’

MetaMap’s accuracy when assigning CUIs to the tweets was 84.2%. The error analysis was performed on a set of 50 randomly sampled tweets. 549 of the 652 randomly sampled concepts were correctly mapped, while there were some concepts missed. In terms of comparing the performances of the LDA models for MetaMap and non-MetaMap substituted tweets, the MetaMap models were more coherent and representative of the overall tweet corpus. The corpus is the structured collection of text/words formed by all the tweets. For the complete set of tweets (March 1, 2020 to December 31, 2021), we found that the model using the MetaMap-substituted tweets had a higher coherence score of 0.386 compared to 0.337. Coherence scores (C_v) combine cosine similarity and normalized pointwise mutual information (NPMI) into one measure that represents topic interpretability and correlation with human topic rankings.25 The higher the coherence score, the more interpretable and relevant the topics. This depends on the k parameter, which is the number of topics a model search for. Ideally, the smaller the number of topics needed, the better and less cluttered the model is. Therefore, the optimal model had a higher coherence score and required a lower number of topics (8 vs 14), indicating a better job of summarizing and identifying common terms/concepts. These trends are shown in Figure 2.

Figure 2.

Figure 2.

Comparing coherence score trends across various number of topics (MetaMap vs. non-MetaMap)

Visualizing the topic distribution across the corpus furthered the effectiveness of using MetaMap. Figure 3 compares the breakdown of the topics for both MetaMap and non-MetaMap tweets. Each bubble represents one of the topics found by the models – for example, the MetaMap model had 8 bubbles and non-MetaMap had 14. The larger the bubbles, the more prevalent the topic is within the overall corpus of the tweets. The overlap between bubbles indicates the similarity of topics, which is ideally would not be present. For the MetaMap model, only 3 topics overlapped, while many models in two quadrants for the non-MetaMap model overlapped. Less overlap indicates that the topics are more distinguishable and useful for human interpretation. In addition, the distance map for the non-MetaMap model does not have any topics within the bottom right quadrant which shows a lack of representation for many tweets and terms within the corpus.

Figure 3.

Figure 3.

Visualization of LDA topics (Left: MetaMap Model, Right: Non-MetaMap Model)

For the temporal aspect of our study, we repeated the LDA algorithm for both MetaMap and non-MetaMap models previously generated, but within each of the four subsets that were defined above. We found that the trends found for the overall database continued within each time segment, where the LDA model with MetaMap outperformed the LDA model with regular text, shown in Table 3. In general, MetaMap models had higher coherence scores, but needed more topics towards the beginning of the timeline. After the Emergency Use Authorization ended, however, the number of optimal topics went back down to 8, which was the overall model’s number of optimal topics.

Table 3.

Comparison of MetaMap and non-MetaMap models across time segments

Timeline LDA Model – With MetaMap LDA Model – Without MetaMap
Coherence Score Optimal # of Topics Coherence Score Optimal # of Topics
March 1, 2020 – March 27, 2020 0.390 14 0.371 8
March 28, 2020 – June 14, 2020 0.332 14 0.307 8
June 15, 2020 – November 8, 2020 0.382 8 0.308 8
November 9, 2020 – December 31, 2021 0.394 8 0.376 14

Discussion and Conclusion

As the COVID-19 pandemic progressed, so did the public discourse about biomedical issues. For hydroxychloroquine specifically, there were many outside influences (political and social events) that increased interest and online activity. Identifying the topics indicated that the public was aware of the studies and government’s role in understanding hydroxychloroquine’s effect on treating COVID-19. There were many topics that included misinformation and placing blame on individuals including news figures, politicians, and government organizations. When we found topics within certain time chunks, some clearly were about the relevant events happening at the time. For example, the original, controversial study was included in the first-time chunk which was published within that time. From there, the topics coincided with the NIH funded clinical trial and other scientific literature published in sources such as NEJM. A constant term in almost every topic throughout the timeline was “Trump.”

This study shows that there is potential for MetaMap to expand into social media to extract meaningful biomedical concepts. Its application as a natural language processing tool within a social media platform such as Twitter standardizes the text and provides the LDA algorithm a smaller, more focused corpus of text instead of the original text. In general, the LDA models on the tweets substituted with MetaMap CUIs performed better (had a higher coherence score) and required less topics that were more distinguishable compared to the LDA models with regular text. Additionally, MetaMap was able to match different text/terms to the same CUIs, instead of having the models treat them separately, such as “hydroxychloroquine” and “hcq”. For public health efforts, social media can be leveraged to accurately assess the public’s opinion and confusion about pressing healthcare topics, such as an emerging pandemic. It can also help clinicians approach discussions about controversial topics with greater social context.

However, MetaMap does not take in context, so it also made mistakes such as matching “Trump” to CUI C4764086, which corresponds with “Transplant Registry Unified Management Program” or matching “Dems” (short for Democrats) to CUI C1833296 which stands for “Dementia”. In the error analysis, there were many known news organizations and acronyms that were missed by MetaMap. In addition, there are many concepts that are not recognized by MetaMap such as slang or acronyms such as “Rona” for “Coronavirus”. This limitation would miss many concepts that are mentioned informally in tweets. MetaMap does not include misspelled words, emojis, and photos, which are integral parts of Twitter. MetaMap’s vocabulary and CUIs are not continuously updated, and it required large software updates to include COVID specific terms. This would hinder real-time use of MetaMap for future emerging public health problems and require manual/additional vocabulary datasets. Another limitation of this study includes the dataset being restricted to public tweets at the time of study. Deleted or tweets by users with private accounts were not included, which doesn’t accurately represent the entire extent of the content and topics that were discussed at the time.

Future work could include incorporating informal contexts to existing MetaMap concepts or expanding the scope beyond hydroxychloroquine and the social/political events included in this study. The pipeline could also utilize data from additional social media platforms such as Facebook, which allows posts with longer text and may identify a separate user base and audience. Integrating technology such as GPT models may improve detection, recognition, and contextualize emoticons and add an additional layer in the topic modeling pipeline. The clinical potential and relevance of using topic modeling on datasets beyond biomedically-curated texts is increasing as social media continues to solidify its influence on society through politics and public health. Overall, the coherence scores are still not optimal for LDA, but they are an improvement to performing LDA on regular social media text. Future work to denoise tweets, or using regularizers, have the potential to further improve coherence scores. Incorporating semantic knowledge resources, such as UMLS’ MetaMap can help address pitfalls of topic modeling on non-biomedical text, leading to better interpretability and knowledge discovery.

Acknowledgments

This research was supported by NLM T15 grant T15LM007079, NLM R01 grants R01LM009886 and R01LM 014344-01, and NCATS CTSA grant UL1TR001873.

Figures & Table

References

  • 1.Naslund JA, Aschbrenner KA, Marsch LA, Bartels SJ. The future of mental health care: peer-to-peer support and social media. Epidemiology and Psychiatric Sciences. 2016 Apr 1;25(2):113–22. doi: 10.1017/S2045796015001067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Chen J, Wang Y. Social media use for health purposes: systematic review. J. Med. Internet Res. 2021 May 12;23(5):e17917. doi: 10.2196/17917. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Moorhead SA, Hazlett DE, Harrison L, Carroll JK, Irwin A, Hoving C. A new dimension of health care: systematic review of the uses, benefits, and limitations of social media for health communication. J Med Internet Res. 2013 Apr 23;15(4):e85. doi: 10.2196/jmir.1933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Steffens MS, Dunn AG, Wiley KE, Leask J. How organisations promoting vaccination respond to misinformation on social media: a qualitative investigation. BMC Public Health. 2019 Dec 1;19(1) doi: 10.1186/s12889-019-7659-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hamm MP, Chisholm A, Shulhan J, et al. Social media use by health care professionals and trainees. Acad Med. 2013 Sep 1;88(9):1376–83. doi: 10.1097/ACM.0b013e31829eb91c. [DOI] [PubMed] [Google Scholar]
  • 6.Yaagoob E, Hunter S, Chan S. The effectiveness of social media intervention in people with diabetes: An integrative review. J. Clin. Nurs. 2022 May 11 doi: 10.1111/jocn.16354. [DOI] [PubMed] [Google Scholar]
  • 7.Smailhodzic E, Hooijsma W, Boonstra A, Langley DJ. Social media use in healthcare: A systematic review of effects on patients and on their relationship with healthcare professionals. BMC Health Serv. Res. [Internet] 2016 Dec 1;16(1) doi: 10.1186/s12913-016-1691-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rolls K, Hansen M, Jackson D, Elliott D. How health care professionals use social media to create virtual communities: an integrative review. J. Med. Internet Res ch. 2016 Jun 21;18(6):e166. doi: 10.2196/jmir.5312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Maloney S, Tunnecliff J, Morgan P, et al. Translating evidence into practice via social media: a mixed-methods study. J. Med. Internet Res. 2015 Oct 26;17(10):e242. doi: 10.2196/jmir.4763. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mheidly N, Fares J. Leveraging media and health communication strategies to overcome the COVID-19 infodemic. J. Public Health Policy. 2020 Dec 1;41(4):410–20. doi: 10.1057/s41271-020-00247-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pulido CM, Villarejo-Carballido B, Redondo-Sama G, Gómez A. COVID-19 infodemic: More retweets for science-based information on coronavirus than for false information. Int. J. Sociol. 2020 Jul 1;35(4):377–92. [Google Scholar]
  • 12.Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003 Mar 1;3:993–1022. [Google Scholar]
  • 13.Zhao WX, Jiang J, Weng J, et al. Comparing Twitter and Traditional Media Using Topic Models. Advances in Information Retrieval [Internet] In: Clough P, Foley C, Gurrin C, Jones GJF, Kraaij W, Lee H, et al., editors. Berlin, Heidelberg: Springer Berlin Heidelberg; 2011. pp. 338–49. (Lecture Notes in Computer Science; vol. 6611) [Google Scholar]
  • 14.Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17(3):229–36. doi: 10.1136/jamia.2009.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Amos L, Anderson D, Brody S, Ripple A, Humphreys BL. UMLS users and uses: a current overview. J Am Med Inform Assoc. 2020 Oct 1;27(10):1606–11. doi: 10.1093/jamia/ocaa084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Denecke K. Denecke extraction from medical social media. In: Denecke K, editor. Health Web Science: Social Media Data for Healthcare [Internet] Cham: Springer International Publishing; 2015. pp. 61–73. (Health Information Science) [Google Scholar]
  • 17.Gupta A, Katarya R. Social media based surveillance systems for healthcare using machine learning: A systematic review. J. Biomed. Inform. 2020 Aug 1;108:103500. doi: 10.1016/j.jbi.2020.103500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Saag MS. Misguided Use of Hydroxychloroquine for COVID-19: The infusion of politics into science. JAMA. 2020 Dec 1;324(21):2161–2. doi: 10.1001/jama.2020.22389. [DOI] [PubMed] [Google Scholar]
  • 19.Gautret P, Lagier JC, Parola P, Hoang VT, Meddeb L, Mailhe M, et al. Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial. Int. J. Antimicrob. Agents. 2020 Jul;56(1):105949. doi: 10.1016/j.ijantimicag.2020.105949. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  • 20.Self WH, Semler MW, Leither LM, Casey JD, Angus DC, Brower RG, et al. Effect of hydroxychloroquine on clinical status at 14 days in hospitalized patients with COVID-19: a randomized clinical trial. JAMA. 2020 Dec 1;324(21):2165–76. doi: 10.1001/jama.2020.22240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lopez CE, Gallemore C. An augmented multilingual Twitter dataset for studying the COVID-19 infodemic. Soc Netw Anal Min. 2021;11(1) doi: 10.1007/s13278-021-00825-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chen E, Lerman K, Ferrara E. Tracking social media discourse about the COVID-19 pandemic: development of a public coronavirus twitter data set. JMIR Public Health Surveill. 2020 May 29;6(2):e19273. doi: 10.2196/19273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Hydrator [Internet] Documenting the Now. 2022 [cited 2022 Dec 18]. Available from: https://github.com/DocNow/hydrator.
  • 24.Twitter. Twitter Developer API [Internet] [cited 2022 Dec 18]. Available from: https://developer.twitter.com/en.
  • 25.Syed S, Spruit M. Full-Text or abstract? Examining topic coherence scores using Latent Dirichlet Allocation. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA) 2017. pp. 165–74.

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES