Abstract
Prevalence of social media has driven a growing number of health related applications with the information shared by online users. It is well known that a gap exists between healthcare professionals and laypeople in expressing the same health concepts. Filling this gap is particularly important for health related applications using social media data. A data-driven, attributional similarity-based method was developed to identify Twitter terms related to side effect concepts. For the 10 most common side effect (symptom) concepts, our method was able to identify a total of 333 Twitter terms, among which only 90 are mapped to those in the consumer health vocabulary (CHV). The identified Twitter terms are specific to Twitter data, indicating a need to expand the existing CHV, and many of them seem to have less ambiguity in word senses than those in CHV.
Keywords: Consumer health concepts, pharmacovigilance, social media, Twitter
1. Introduction
Prevalence of social media has yielded a growing number of health related applications using social media data. One of the active areas of this endeavor pertains to pharmacovigilance whose primary task is the continuous detection of suspected unknown side effects from the use of pharmaceutical products in order to promote the safe uses of the products. This has been evidenced by a 2015 research conducted by Golder et al. [1] who searched “pharmacovigilance” and “social media” in 16 databases and discovered more than 3,000 relevant published articles with an upward trending.
It is well known that there exists a gap in expressing health concepts between healthcare professionals and laypeople (consumers) [2], and this is particular true in the era of social media where consumers typically share their personal experiences related to their health issues and use different expressions to describe health concepts. Therefore, there is a need to understand how social media users express health concepts.
2. Method
A data-driven, attribution similarity-based method was developed to identify Twitter terms corresponding to concepts of side effects (symptoms). Our method leverages a significant amount of Twitter data and is based upon attributional (syntactic and semantic) similarity through which Mikolov et al. [3] demonstrated the state-of-art results in many related tasks from the vector space model(VSM) of words. In June of 2017, a total of 53 million tweets were collected from twitter.com using a set of 1,147 medication names as keywords, and it was done with the help of a homemade web crawler to overcome the limitations of Twitter APIs. The collected tweets were preprocessed to remove non-English and duplicate tweets. Phrases were learned from the preprocessed tweets using the gensim software2, which implements the data-driven, co-occurrence-based phrase learning algorithm [4]. Afterward, the text of the preprocessed tweets was replaced with learned phrases. A vector space model (VSM) was created from the preprocessed tweets using Google’s word2vec3 (skip-gram, window_size=10, min_count=5 and dimension=300).This VSM was used to generate Twitter terms similar to each health concept with a similarity of 0.2 or higher. The similarity is the vector cosine similarity. The SIDER side effect list4 was used as the standard side effect concepts in identifying Twitter terms similar to the concepts, by comparing the vector similarity of each pair of concept and Twitter term. To understand the differences of side effect concepts between the Twitter terms and the existing CHV5, we compiled a list of CHV side effect terms (CHV+SE) by intersecting the CHV and the SIDER side effect list through the alignment of concept IDs (CUIs).
For each of the side effect concepts, a list of Twitter terms similar to the concept was generated by including all the terms with a similarity of 0.20 or higher. Terms matching any of CHV+SE terms were assigned the corresponding concept ID. The list was sorted first by the occurrence (the # of times it appears in our corpus) and then by the similarity. Stop words and irrelevant terms were removed from the list. Afterwards, the list was manually annotated by the first two authors, and each term was labeled with one of three choices: Yes, No, and Unsure. A Yes term is a term highly likely to be the side effect concept; a No term is a one highlyunlikely to be the concept; and an Unsure term is somehow in between. All the Yes terms became the candidate terms for the same concept, and all No terms were added to the irrelevant term list.
3. Results
Table 1 shows the statistics of discovered Twitter terms and CHV terms that are considered to have the same meaning and/or the same concept ID (CUI).
Table 1.
Statistics of side effect terms discovered in tweets and found in CHV. The first column lists the 10 most common side effect terms. The column of “# of discovered” is the number of Twitter terms discovered in this study. The column of “# of match” represents the number of common terms in their Twitter terms and CHV, and the column of “% of match” shows the percentage of common terms with respect to the discovered Twitter terms.
| Side Effect | # of discovered | # in CHV | # of match | % of match | Occurrence |
|---|---|---|---|---|---|
| pain | 52 | 6 | 3 | 5.8% | 376,627 |
| headache | 57 | 16 | 3 | 5.3% | 135,953 |
| anxiety | 48 | 4 | 4 | 8.3% | 95,998 |
| depression | 37 | 13 | 4 | 10.8% | 63,001 |
| migraine | 28 | 13 | 2 | 7.1% | 53,460 |
| insomnia | 33 | 7 | 3 | 9.1% | 38,571 |
| stress | 8 | 4 | 4 | 50.0% | 34,120 |
| nausea | 35 | 13 | 6 | 17.1% | 29,777 |
| cough | 12 | 6 | 3 | 25.0% | 17,677 |
| asthma | 23 | 8 | 5 | 21.7% | 15,091 |
| total | 333 | 90 | 37 | 11.1% | 860,275 |
For the top 10 side effect concepts, there are a total of 90 CHV terms and 333 Twitter terms. Summarized in Table 2 are the numbers of tweets with CHV terms and Twitter terms for each of the top 10 side effect concepts. Table 3 lists example tweets with Twitter terms of highest similarities.
Table 2.
Statistics of tweets containing CHV and Twitter side effect terms.
| Side Effect | Total # of tweets | # of tweets w CHV terms | # of tweets w Twitter terms | % of tweets w Twitter terms in total |
|---|---|---|---|---|
| pain | 376,627 | 366,780 | 9,847 | 2.6% |
| headache | 135,953 | 124,633 | 8,804 | 6.5% |
| anxiety | 95,998 | 90,778 | 5,220 | 5.4% |
| depression | 63,001 | 60,198 | 2,803 | 4.4% |
| migraine | 53,460 | 49,974 | 3,486 | 6.5% |
| insomnia | 38,571 | 29,379 | 9,192 | 23.8% |
| stress | 34,120 | 33,813 | 370 | 0.9% |
| nausea | 29,777 | 28,789 | 988 | 3.3% |
| cough | 17,677 | 16,676 | 1,001 | 5.7% |
| asthma | 15,091 | 13,651 | 1,440 | 9.5% |
Table 3.
Examples of tweets containing the Twitter side effect terms. The tweets are related to the personal experiences of the side effects.
| Twitter Term(similarity) | Example Tweet |
|---|---|
| massiveheadache(0.615) | Day 3 of a massive headache. No amount of acetaminophen, ibuprofen or caffeine helps for long. Poop. |
| excruciatingpain(0.777) | Im in excruciating pain who got some Xanax |
| anxietypanicattacks(0.716) | Yeah :/ Albuterol gives me anxiety/panic attacks because its a stimulant. Pulmozyme usually is ok... but tobramycin tastes NASTY. |
| cripplingdepression(0.595) | My crippling depression and addiction to Adderall |
| migrane(0.800) | @USER I used to take clonazepam but it always gave me migrane. I just stick to zolpidem now but even that isnt reliable. |
| stres(0.251) | i need to stay up for at least 3 more hours but im so stressed but if i take a xanax im afraid ill fall asleep but my head hurts bc stres |
| insomia(0.657) | I _ just took a AMBIEN CRI think that this is going to help me actually sleep tonight, insomia sucks and so do pills! at least I am sleepy’ |
| extremenausea(0.680) | Two days of extreme nausea and headaches...nothing works like my Ativan. Ugh I really need to find a new doctor.—feeling sick |
| caugh(0.450) | So the doc gave me prednisone for my caugh/asthma. Good news cough is getting better-bad news I cannot sleep ugh! #steriodsarebad |
| asthmasymptoms(0.559) | I use seretide my pneumoligist said its better and I take 2 different pills no asthma symptoms!Rinse mouth after puff |
4. Discussions
Table 1 shows that although there is a total of 90 CHV terms and 333 Twitter terms for the top 10 side effect concepts, only 37 terms were found in both Twitter and CHV, indicating that not all the CHV terms were found in our tweet corpus. In other words, Twitter users appear to not use all the CHV terms in expressing health concepts, and they seem to have their own Twitter-specific expressions. This demonstrates the need to expand the existing CHV to include the Twitter specific terms.
Even though more Twitter side effect terms than CHV terms were identified for each health concept (Table 1), the actual number of tweets with the Twitter side effect terms is relatively smaller than those with CHV terms (Table 2). This may indicate that the discovered Twitter side effect terms are of less dominance in usage, but they may show more relevance to the corresponding health concepts than the CHV terms.
Illustration in Table 3 are the actual examples of tweets containing the Twitter side effect terms, and texts of the examples show that these Twitter terms are closely related to the corresponding health concepts. The discovered Twitter terms can be single words, phrases, and even misspellings (e.g., migrane, stres, insomia, and caugh). Also included in the table are the similarities of Twitter terms, and the terms listed are the ones most similar to (i.e., the highest similarities) the corresponding side effects. The term of painful, which was found in both Twitter terms and CHV, has a similarity of 0.346 to concept pain, whereas the Twitter term of excruciating pain has a 0.777 similarity, indicating that the latter is more closely related to pain. This observation may signify that some CHV terms were used in senses unrelated to the health concepts (e.g., it is painful to get a medication), and Twitter terms may have less ambiguity in word senses due to their higher (syntactical and/or semantic) similarities to the health concepts.
5. Conclusion
A data-driven, attributional similarity-based method was presented to identify Twitter terms that are most (syntactically and semantically) similar to the health concepts. For the 10 most common side effect concepts, it was able to discover 333 Twitter terms, among which only 90 terms match with the CHV terms. Analysis of our data shows that (1) there is a need to expand the existing CHV by including Twitter terms which are more specific to the Twitter data, and (2) many identified Twitter side effect terms seem to be less ambiguous in word senses than CHV terms because of their higher similarities to the health concepts than that of CHV terms.
Acknowledgement
Authors wish to thank anonymous reviewers for their constructive comments. This work was supported by National Institutes of Health Grant 1R15LM011999–01.
Footnotes
References
- [1].Golder S, Norman G, Loke YK, Systematic review on the prevalence, frequency and comparative value of adverse events data in social media, British Journal of Clin Pharmacol.80 (2015), 878–888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Zeng QT, Tse T, Exploring and developing consumer health vocabularies, Journal of the American Medical Informatics Association 13(2006), 24–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Mikolov T, Chen K, Corrado GS, Dean J, Efficient Estimation of Word Representations in Vector Space, International Conference on Learning Representations (2013). [Google Scholar]
- [4].Mikolov T, Sutskever I, Chen K, K., Corrado GS, Dean J, Distributed representations of words and phrases and their compositionality, Adv. in neural information processing systems (2013), 3111–3119. [Google Scholar]
