Abstract
Individuals increasingly rely on social media to discuss health-related issues. One way to provide easier access to relevant in- formation is through sentiment analysis – classifying text into polarity classes such as positive and negative. In this paper, we generated freely available datasets of WebMD.com drug reviews and star ratings for Common, Cancer, Depression, Diabetes, and Hypertension drugs. We explored four supervised learning models: Naive Bayes, Random Forests, Support Vector Machines, and Convolutional Neural Networks for the purpose of determining the polarity of drug reviews. We conducted inter-domain and cross-domain evaluations. We found that SVM obtained the highest f-measure on average and that cross-domain training produced similar or higher results to models trained directly on their respective datasets.
Introduction
Due to the rapid growth of online communication, individuals increasingly rely on medical social media to discuss health-related issues1. In 2013, The Pew Internet Health Tracking Survey2 reported that of the 81% of adults who use the internet, 72% have searched online for health-related issues, and 18% have consulted online reviews of a specific drug or medical treatment within the last 12 months. This plethora of patient-generated health information, specifically drug reviews, is not only useful to other patients, but also to the larger medical and scientific communities. Past work has found this data to be useful for a variety of tasks, including identifying adverse drug reactions3–5 and illicit drug use6; as well as building user recommendation7 and patient assisted healthcare systems1. A tool for extracting information about patient attitudes or sentiments from such drug reviews would offer valuable information that improves the effectiveness of these approaches.
Previous work on medical sentiment analysis has focused on biomedical literature8–11 and clinical narratives12–14 (generally objective and technical in nature15, but includes information relevant to determining patient outcomes) as well as social media1, 16, 17 (more subjective in nature15, but provides more patient-centric experience information18). Medical social media data presents particular difficulties for sentiment analysis due to the use of jargon, non-standard medical terminology, incorrect spelling, and incorrect grammar. In addition, medical texts commonly use sentiment words which may have the opposite meaning in the medical domain15. For example, a lab test with a negative result may indicate a positive patient outcome. Thus, determining sentiment in medical social media presents challenges that stem from both the medical and general sentiment domains.
In this paper, we explored four machine-learning based approaches to the sentiment analysis of online drug reviews. We present Medinify19, a Python package for collecting patient drug reviews published to WebMD.com, as well as training and evaluating the effectiveness of these machine-learning algorithms. Using Medinify’s review collection capabilities, we collect drug review datasets for five different drug classes and conduct an inter- and cross-domain evaluation over those datasets. Our results show that our approach performed better or on par with pre-existing work in the field.
Our error analysis reveals key areas of concept normalization and negation detection that may increase our basic method’s performance. However, some statements can be difficult for even humans to assign a sentiment to or they may be factual and contain no sentiment at all. A cross-domain analysis reveals that, in general, a model trained on a large dataset of reviews for a multitude of drugs classifies review sentiment equally or more accurately than a model trained only on reviews for the drug type being classified. The specific contributions of this paper are:
Freely available datasets of drug reviews for sentiment analysis research
Freely available open source package (Medinify) for scraping and processing drug review data from WebMD
Freely available open source package (Medinify) for easily training sentiment analysis classifiers
Inter-domain analysis using 10-fold stratified cross validation of four classifier on five different datasets
Cross-domain analysis of drug reviews for generalizability of the classifiers across drug types
Related Work
Sentiment analysis research focuses on analyzing opinions, sentiments, attitudes, and emotions20. Research separates sentiment analysis into two main areas: subjectivity analysis, which focuses on separating facts from opinions, and polarity classification, which focuses on whether the text is either negative or positive17. In this work, we focus on polarity classification. Current approaches to sentiment analysis include rule-based methods that utilize special lexicons21–24, machine learning approaches25, 26, and hybrid methods27, 28. All of these approaches have been applied to different domains and categories of sentiment27.
Most recently in the public health domain, there are three state-of-the-art sentiment analysis systems that most align with our work and have been evaluated on drug review data:
Jime´nez-Zafra, et al,16 evaluated the sentiment of 1,620 reviews from two corpora of drug reviews. They used Support Vector Machines (SVMs) exploring four word representations: 1) term frequency-inverse document frequency (TF-IDF), 2) term frequency (TF), 3) unigrams, and 4) word embeddings. Their results showed that word embeddings obtained a slightly higher accuracy than TF-IDF.
Yadav, et al1 evaluated sentiment analysis over approximately 2,000 medication or 4,400 medical condition posts from patient.info for Anxiety, Depression, Asthma, and Allergies. They compared machine and deep learning approaches using unigrams, bigrams, trigrams, a medical abbreviation dictionary, and SentiWordnet as features into a multi-layer perceptron, Random Forest (RF), and SVM as well as a Convolutional Neural Network (CNN) using word embeddings. They found the CNN out-performed the other machine learning methods and that the system was able to classify the medication posts with a significantly higher precision and recall than the medical condition posts.
Carrillo-de-Albornoz17 evaluated sentiment analysis over 3,500 online health posts on breast cancer, Crohn’s disease, and allergies from the eDiseases website. They explored unigrams, word embeddings, and concept embeddings with additional syntactic and domain specific features into SVM, RF, and Naive Bayes (NB) classifiers. They found that unigrams obtained a higher accuracy than using the embeddings as features into the NB and RF but that the embeddings obtained a higher accuracy when used as a feature into an SVM.
Data
The datasets consist of WebMD reviews obtained through the Medinify WebMD Scraper19 which takes as input a list of drugs and returns a comma-separated value (CSV) file containing: 1) the text of the user’s comment, 2) the star ratings the user gave the drug, 3) the date the review was posted, and 4) the name of the reviewed drug. WebMD allows users to post free text reviews of a drug and provide a star rating for three categories: Effectiveness, Ease of Use, and Satisfaction.
We use both the Effectiveness and Satisfaction ratings as determiners for each review’s polarity class. While neither rating directly corresponds to the reviewer’s general sentiment toward the drug being reviewed, both ratings reflect some aspect of that sentiment. Across all datasets, Effectiveness ratings tends to be more positive (of all non-neutral Effectiveness ratings, 68.82% were positive), while Satisfaction ratings are more evenly distributed (53.27% of non- neutral Satisfaction ratings were positive). We do not use the Ease of Use rating, as these ratings are overwhelmingly positive. We believe this is due to the majority of drugs we use being in pill form, which patients may consider easy to take, regardless of the patient’s attitude toward the drug.
The five drug review datasets are separated by drug class: Common, Cancer, Depression, Diabetes, and Hypertension. Table 1shows the number of reviews, number of reviews with each star rating, the ratio of positive to negative reviews, and the final number of reviews from the Effectiveness and Satisfaction star ratings for the datasets. The Common dataset is designated by WebMD and contains 93 of the most highly reviewed drugs. This dataset is the largest and contains a diverse set of drugs, however, it is worth noting that at least 16,471 (∼20%) of the Common reviews were for drugs that are used to treat depression. The Cancer dataset is made up of all reviews for the drugs listed at cancer.gov* at the time of writing. The Depression, Diabetes, and Hypertension datasets are made up of all reviews for the drugs listed at centerwatch.com† under “Depression,” “Diabetes Mellitus Types I and II” and “Diabetes Mellitus, Type 2”, and “High Blood Pressure (Hypertension)” respectively. WebMD maintains separate reviews for the generic and alternate forms of each drug, both of which are included in each dataset except Common.
Table 1.
Effectiveness and Satisfaction ratings class distributions. Positives include [4,5], negatives include [1,2].
Effectiveness rating class distributions | ||||||||
Dataset | Total Reviews | 1 Star | 2 Star | 3 Star | 4 Star | 5 Star | Pos:Neg Ratio | Training Reviews |
---|---|---|---|---|---|---|---|---|
Common | 84,390 | 14,230 | 7,299 | 14,168 | 19,554 | 29,139 | 2.268 | 70,222 |
Cancer | 9,946 | 1,586 | 718 | 1,800 | 2,264 | 3,578 | 2.320 | 8,146 |
Depression | 16,151 | 2,755 | 1,310 | 2,565 | 3,887 | 5,634 | 2.457 | 13,586 |
Diabetes | 5,094 | 920 | 578 | 983 | 1,027 | 1,586 | 1.674 | 4,111 |
Hypertension | 10,688 | 1,752 | 1,120 | 2,202 | 2,576 | 3,038 | 1.690 | 8,486 |
Satisfaction rating class distributions | ||||||||
Dataset | Total Reviews | 1 Star | 2 Star | 3 Star | 4 Star | 5 Star | Pos:Neg Ratio | Training Reviews |
Common | 84,390 | 24,492 | 8,216 | 12,194 | 14,477 | 25,011 | 1.207 | 72,196 |
Cancer | 9,946 | 2,860 | 1,092 | 1,636 | 1,722 | 2,636 | 1.103 | 8,310 |
Depression | 16,151 | 4,478 | 1,453 | 2,340 | 3,117 | 4,763 | 1.329 | 13,811 |
Diabetes | 5,094 | 1,738 | 614 | 736 | 725 | 1,281 | 0.853 | 4,358 |
Hypertension | 10,688 | 3,966 | 1,421 | 1,664 | 1,481 | 2,156 | 0.675 | 9,024 |
Method
In this work, we developed a system, Medinfy19, to automatically compile drug review datasets and evaluate machine- learning methods for classifying drug reviews as either positive or negative. We explored using unigrams and word embeddings29 reporting the best performing feature set. The word embeddings were generated over the WebMD drug review datasets using a dimensionality of 100. We studied four supervised machine learning methods: Random Forest, Naive Bayes, Support Vector Machines, and Convolutional Neural Networks. The details of these four classifiers and the best performing feature set are described below. Each classifier’s hyperparameters were tuned over a random sample consisting of half of the Common dataset using a grid search.
Naive Bayes - Medinify uses the Scikit Learn implementation of a Multinomial Naive Bayes algorithm30. When training the Naive Bayes classifier, review text data are represented as unigram frequency vectors. The reported results are generated using the hyperparameter settings: α = 1.0; fit prior = True.
Support Vector Machine - Medinify uses the Scikit Learn implementation of an SVM-based classifier30 using the rbf kernel. Review data are represented as averaged word embeddings. The reported results are generated using the hyperparameter settings: c = 10; γ = 0.01.
Random Forest - Medinify uses the Scikit Learn implementation of a Random Forest classifier30 using the gini criterion function. Review text data are represented as unigram frequency vectors. The reported results are generated using the hyperparameter settings: trees = 100.
Convolutional Neural Network - Medinify implements a convolutional neural network in PyTorch31. The input is the review data represented as matrices of word embeddings. There are three hidden layers that each convolve 100 filters with a default stride of 1. These layers use filters of sizes 2x100, 3x100, and 4x100 to represent bigrams, trigrams, and 4-grams. The output of each layers is calculated, max pooled, and concatenated. This result is then sent through a dropout layer with a probability of 0.5, a fully connected linear layer with an output size of 50, and finally another fully connected linear layer that outputs a logit, which represents the predicted class when input into a sigmoid function.
In our experiments, we focus on a binary classification of the reviews. We labeled 1-2 star reviews as negative and 4-5 star reviews as positive. This assumes that a user’s star rating corresponds to the sentiment in their comment, which is not always the case as the user may report in their text that the drug works but has terrible side-effects or that the drug was not sufficiently effective but was easier to take than another drug due to reduced side-effects. This limitation could be ameliorated by manually annotating the review data in future work.
Results and Discussion
To evaluate the efficacy of the machine learning systems, we conducted three evaluations:
Inter-domain Evaluation: we conducted an inter-domain evaluation to determine the effectiveness of four machine- learning algorithms for classifying the sentiment of each dataset: Naive Bayes, Random Forest, Support Vector Machines (SVM), and Convolutional Neural Networks (CNNs).
Rating Evaluation: we evaluated the effects of training models on two of the different rating metrics offered by WebMD, Effectiveness, and Satisfaction. These two metrics represent different aspects of reviewers’ sentiments, and have different distributions between positive and negative reviews.
Cross-domain Evaluation: we conducted a cross-domain evaluation to analyze the generalizability of the models across reviews for all collected drug classes. For this, we evaluate using classifiers trained over the Common dataset to predict the sentiment on the Cancer, Hypertension, Diabetes, and Depression datasets. This cross-domain analysis leads to some of the best performance.
Inter-domain Evaluation. To perform an inter-domain evaluation of the four machine-learning models, we performed 10-fold cross validation over the five drug datasets. Tables 2 to 5 shows the precision, recall, and f-measure results and majority label baselines (Baseline). These baselines are the f-measure and precision of naively guessing the majority class for all instances in a dataset (recall baseline is not reported, as it is always 100% when naively guessing the majority class).
The results show that the Depression dataset was the easiest to classify for all of the models and also had the lowest standard deviations. This is likely due to the difference in the lexical and semantic attributes present when expressing sentiment for this Drug class (discussed below), as well as this dataset’s large size.
The SVM model resulted in the highest f-measure across all datasets while the CNN model resulted in the lowest average f-measures. The Naive Bayes model resulted in comparable precision scores with the SVM model, but resulted in lower recall scores, indicating that the Naive Bayes model, when being validated over review data from the same drug class it was trained over, is more discriminative about classifying reviews positive. The CNN model had a similar issue, resulting in relatively high precision scores and low recall scores.
Rating Evaluation. Here, we evaluated the models trained on both the Effectiveness and Satisfaction ratings. Table 6shows the precision, recall, and f-measure 10-Fold Cross Validation scores for an SVM model on Satisfaction scores; Table 3shows the equivalent scores for an SVM model trained over Effectiveness scores. The results across the other three the machine-learning models were similar.
Table 6.
SVM 10-Fold Cross Validation Results for Satisfaction Ratings
Baseline | Prediction | ||||
---|---|---|---|---|---|
Precision | F-Measure | Precision | Recall | F-1 Measure | |
Common | 54.70% | 70.71% | 79.67% (+/- 2.67%) | 82.20% (+/- 2.94%) | 80.86% (+/- 1.81%) |
Cancer | 52.44% | 68.80% | 77.01% (+/- 2.68%) | 77.24% (+/- 3.63%) | 77.06% (+/- 2.20%) |
Depression | 57.06% | 72.66% | 82.15% (+/- 2.63%) | 84.80% (+/- 2.70%) | 83.41% (+/- 1.68%) |
Diabetes | 53.97% | 70.10% | 78.90% (+/- 5.94%) | 78.76% (+/- 6.02%) | 78.43% (+/- 2.14%) |
Hypertension | 59.70% | 74.76% | 78.21% (+/- 4.67%) | 71.91% (+/- 3.56%) | 74.82% (+/- 2.96%) |
Average | 55.57% | 71.41% | 79.19% | 78.98% | 78.91% |
Table 3.
SVM 10-Fold Cross Validation Results for Effectiveness Ratings
Baseline | Prediction | ||||
---|---|---|---|---|---|
Precision | F-Measure | Precision | Recall | F-1 Measure | |
Common | 69.34% | 81.90% | 80.78% (+/- 0.96%) | 89.77% (+/- 2.29%) | 85.02% (+/- 1.12%) |
Cancer | 71.72% | 83.53% | 80.21% (+/- 2.08%) | 92.04% (+/- 2.39%) | 85.68% (+/- 1.13%) |
Depression | 70.08% | 82.41% | 83.37% (+/- 1.06%) | 91.20% (+/- 1.50%) | 87.10% (+/- 0.77%) |
Diabetes | 63.56% | 77.72% | 78.14% (+/- 3.55%) | 82.67% (+/- 10.14%) | 79.82% (+/- 5.11%) |
Hypertension | 66.16% | 79.63% | 76.66% (+/- 1.04%) | 86.16% (+/- 1.89%) | 81.12% (+/- 0.82%) |
Average | 68.17% | 81.04% | 79.83% | 88.37% | 83.75% |
Much of the difference between the results for models trained over Effectiveness and Satisfaction ratings comes down to each metrics’ class imbalance, specifically the positive bias in each dataset’s Effectiveness rating. In models trained over the relatively balanced Satisfaction ratings, precision and recall scores tend to be comparable, while all models trained over Effectiveness ratings are higher recall than precision. For all four models, the average f-measure results while training over Satisfaction were lower than when training over Effectiveness.
Cross-Domain Evaluation. Here, we trained a new model for each of the types of machine learning algorithms on the entire Common dataset (with duplicate comments removed) and evaluate its performance classifying the reviews of the other four datasets: Cancer, Depression, Diabetes, and Hypertension. Tables 7 and 8 show the precision, recall, Table 7: Naive Bayes and SVM Cross-Domain Results and f-measure over each of the drug classes. Overall the results indicate that models trained over a large drug review dataset with a diverse range of drugs being reviewed can generalize to smaller datasets composed of reviews for more specific drug classes. However, while this is the case for the Naive Bayes, SVM, and Random Forest models, these results did not hold for the CNN model.
The results show that the SVM models resulted in the highest f-measure for the Cancer and Hypertension datasets, while the Random Forest model resulted in the highest f-measures for the Depression and Diabetes datasets. The Random Forest model always resulted in higher precision scores and lower recall scores than the Naive Bayes and SVM models, indicating the this model was less discriminative than the other two when classifying reviews as positive.
The CNN model’s low performance indicates that this model was unable to adequately generalize to smaller and less diverse datasets. This, and the CNN’s performance during 10-Fold Cross Validation, indicates that this model is better at fitting to a specific dataset’s lexical and semantic attributes than the other three models, but this ability hinders the model when attempting to generalize. In the case that a certain drug or class of drugs had a sufficient number of reviews to train a model for that specific type of data, the CNN may perform better than the other three models. However, for drugs and drug classes without an adequately large number of reviews to train a model, this CNN model would perform worse at classification than the other three.
The cross-domain f-measure results for all models except the CNN were comparable with the f-measure results ob- tained through 10-Fold Cross Validation, with the lowest negative variation being -2.63% for SVM models classifying the Diabetes datasets, and the highest positive variation being +4.55% for SVM models classifying the Hypertension dataset. All models trained over the Common dataset performed worse when classifying the Diabetes dataset than models trained over the Diabetes dataset during 10-Fold Cross Validation. The Hypertension dataset saw the highest improvements, although the Random Forest model still performed 0.42% worse. The similarity between the Naive Bayes, SVM, and Random Forest models’ performance during cross validation and cross-domain evaluation is signif- icant because it indicates that training on an large and diverse drug review dataset could be an effective way to classify a drug review dataset with too few reviews to adequately train a model over.
Most Informative Words
Table 9shows the top ten most informative words for each of the datasets as identified using the Naive Bayes classifier. These are the word features with the highest probability of being classified positive or negative, based on their distri- bution across the two classes. The positive oriented words are italicized. Analysis of the top most informative words for Depression shows that one possible reason that this dataset resulted in some of the best classification results was that this dataset’s lexicon contained more distinctive words. These words, such as psychosis, overdose, and godsend, have obvious positive or negative sentiments. The probabilities for the Depression dataset’s most discriminative words were also relatively high compared with the probabilities for the four other datasets, besides for the extremely high probability of the word saver being classified positive in the Common dataset.
Table 9.
Most informative features from the Naive Bayes classifier
Common | Cancer | Depression | Diabetes | Hypertension | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
term | pos | neg | term | pos | neg | term | pos | neg | term | pos | neg | term | pos | neg |
saver | 37.9 | 1.0 | amgen | 1.0 | 27.9 | overdose | 1.0 | 21.1 | please | 23.1 | 1.0 | controls | 14.5 | 1.0 |
lawyer | 1.0 | 20.4 | admitted | 1.0 | 24.5 | saver | 20.8 | 1.0 | kill | 1.0 | 14.5 | mistake | 1.0 | 12.9 |
marketing | 1.0 | 18.8 | ruined | 1.0 | 17.7 | downside | 18.1 | 1.0 | broke | 1.0 | 14.5 | wonderful | 12.5 | 1.0 |
lifesaver | 18.6 | 1.0 | saved | 17.5 | 1.0 | saved | 16.6 | 1.0 | food | 12.4 | 1.0 | pleased | 12.4 | 1.0 |
enabled | 18.4 | 1.0 | ruin | 1.0 | 16.1 | psychosis | 1.0 | 14.8 | spasms | 1.0 | 11.0 | kill | 1.0 | 12.0 |
downside | 16.7 | 1.0 | vaccination | 1.0 | 14.4 | garbage | 1.0 | 13.6 | migraine | 1.0 | 11.0 | satisfied | 10.5 | 1.0 |
godsend | 15.6 | 1.0 | fda | 1.0 | 13.5 | poison | 1.0 | 13.4 | failure | 1.0 | 10.1 | sign | 1.0 | 9.8 |
downfall | 15.2 | 1.0 | label | 1.0 | 12.7 | daze | 1.0 | 13.3 | fever | 1.0 | 9.9 | internal | 1.0 | 9.8 |
stevens | 1.0 | 14.3 | wonders | 12.2 | 1.0 | lifesaver | 12.4 | 1.0 | swelled | 1.0 | 8.7 | fever | 1.0 | 8.7 |
steven | 1.0 | 14.3 | wonderful | 11.9 | 1.0 | godsend | 12.1 | 1.0 | list | 1.0 | 8.7 | americans | 1.0 | 8.5 |
Hypertension was difficult for having less important features (pos : neg magnitude shown in Table 9). The occurrence of traditionally neutral words like “americans” as negative also implies that there were not as many high impact features as there were in the other datasets. An analysis of the comments that contained “americans” showed that of the 28 comments, 26 are users referring to themselves as “Black” or “African” Americans. Of these comments, 17 are for the oral version of the drug “Lisinopril.” Only 5 of these “Lisinopril” comments have a positive effectiveness rating and of those, 4 of them include references and concerns over adverse side effects and allergic reactions. Further research is required to evaluate the inclusion of drug as a feature in reviews and the potential for drug review sentiment analysis as a tool for adverse drug effect monitoring.
In addition to revealing unexpected features, the better performance of the Random Forest and SVM models on the Hypertension dataset during cross-domain analysis indicates that the lack of highly informative features may result in better generalizability. Further investigation into the correlation between the types of words used and the difficulty of classification is required and could reveal easy ways to determine the best dataset to use for training, depending on what type of reviews are classified.
Listing important features also revealed some interesting words like “steven” in the Common dataset. An analysis of the Common reviews that contained “steven” showed that in almost all cases it is a reference to Stevens-Johnson Syndrome, an acute life-threatening condition that can occur from drugs that appear in the Common dataset, such as “allopurinol.”32
Error Analysis
Analysis of review classifications across the datasets revealed certain patterns in how the four models mis-classified sentiment. We identified five areas that resulted in improper classification.
Negation. We did not implement negated term identification, therefore comments with negations were a consistent problem. For example, all four models consistently classified one review in the Depression dataset stating “...i dont thank [sic] it works because notthing [sic] has changed 4 me.” as positive, despite a clear negative sentiment. Imple- mentation of some sort of rule-based method to recognize and account for negation should be a critical part of any sentiment analysis method as a single token can completely reverse the entire statement’s orientation. This can be challenging since a statement such as “been on it for about 6months [sic] now and it has helped me out alot [sic]. i am not like i use to be in the past with anger and depression issues” is clearly positive but contains a negation.
Contextual Information. For all four models, but especially for the models that only took lexical data as input, complex contextual information also resulted in a mis-classification. For example, the models rated the following comment, which clearly expresses a positive sentiment, negative, due to extensive descriptions of side effects.
I have taken for over 15 years. It was a mircle drup [sic] for me. I thought I was going to be wheel chair bound but not any more... I was unable to climb stairs, raise hands over head, hurt all over and had nodules on elbows (that have gone away)...
Additionally, shorter reviews like “AWFUL” and “have not had any change in feelings” were classified as positive when they should be negative. Interpreting complex contextual information correctly requires more sophisticated parsing; however, with free-form reviews, it can be difficult to obtain enough information for a program to make a judgment call. For example, the short review “still have depression” may be too short and not contain enough data to accurately assess it’s polarity, while a human can easily do so quickly.
Neutral Questions/Mixed Information. Some comments are difficult even for a human to classify because they contain either mixed or neutral information such as the question, “When should this pill be taken?” or, “can i get treatment done on claxa oral please thank you”. These types of reviews may contain negatively or positively associated words which are not meant to convey sentiment in this context, but rather ask a question. In our pre-processing of the dataset we removed all punctuation; however, for instances like this, utilizing punctuation may be beneficial to performance. Other errors may be due to the fact that users gave a high star rating for effectiveness, but discussed negative side effects in their comments. This error in particular is difficult to detect without manual review or more sophisticated text analysis due to the nature of our dataset.
Incorrect Spelling. Some comments with clearly negative or clearly positive sentiment, such as, “i dont [sic] thank it works because notthing [sic] has changed 4 me”, and, “i has relefe [sic] some of the sorness [sic] from my sholders [sic]”, were misclassified because the words expressing sentiment were misspelled. During pre-processing, we did not fix grammatical errors like these. Such corrections in the future may alleviate this error.
Acronyms. Comments often include acronyms. Some do not pertain to the overall sentiment classification such as the “4” in “nothing has changed 4 me.” However, some contribute to the overall sentiment of the review. For example, the Depression dataset contains 46 comments that include the acronym “lol” at least once, with 6 of those being negative, 8 neutral, and 32 positive. Interestingly with “lol”, we often saw it when individuals were using self-deprecating humor. For example:
Asked to switch from Venlafaxine ER 75mg (Effexor XR)because of bad side effects. Since switching I have been dizzy almost constantly. I am hoping that it is due to the switch in meds. Mood seems to be the same, which is good. But I don’t like being a “dizzy broad”. lol
Comparison with previous work
Table 10shows an indirect comparison with three state-of-the-art sentiment analysis systems evaluated on drug review data. The table includes the language, dataset, data size, machine learning algorithm (ML), features, and top accuracy obtained by each of the systems; as well as our system and the dataset size and accuracy ranges. The accuracy range contains all accuracies that model obtained across both the 10-fold cross validation and the cross-domain evaluations. The results show that our system, using only unigrams as features, performs on par with current systems. We believe that the reason for this is two fold. First, the unigram results also evaluated by these systems – specifically Jimenez- Zafra, et al.16 and Yadav, et al.1 – obtained accuracy results that were close to that of the word embeddings. Second, the amount of data used in our experiments is 2 to 51 times that of these systems.
Table 10.
Indirect comparison with state-of-the-art systems over drug reviews
System | Language | Data | Size | ML | Features | Best results |
---|---|---|---|---|---|---|
Jimenez-Zafra, et al.16 | Spanish | Drug Opinions | 1,620 | SVM | word embeddings | 0.65 |
Yadav, et al.1 | English | patient.info | 2,000 | cNN | word embeddings | 0.82 |
Carrillo, et al.17 | English | eDisease | 3,500 | SVM | lexical; syntactic; sentiment; domain | 0.68 |
Medinify (our system) | English | WebMD | 6k-84k | NB | unigrams | 0.77 - 0.86 |
Medinify (our system) | English | WebMD | 6k-84k | SVM | embeddings | 0.80 - 0.87 |
Medinify (our system) | English | WebMD | 6k-84k | CNN | embeddings | 0.78 - 0.85 |
Medinify (our system) | English | WebMD | 6k-84k | RF | unigrams | 0.79 - 0.86 |
Conclusions and Future Work
In this work, we generated datasets of WebMD.com drug reviews and star ratings for Common, Cancer, Depres- sion, Diabetes, and Hypertenion drugs. We explored four supervised learning models: Naive Bayes, Random Forest, Support Vector Machine, and a Convolutional Neural Network for the purpose of determining the sentiment polarity of drug reviews. We found that the SVM models performed with the highest f-measures on average. Additionally, we conducted a cross-domain evaluation by training models on a dataset consisting of all the reviews for WebMD’s “common drug” list, and found that this model produced similar results to models trained directly on the test dataset. Using a Common trained model may be a viable technique when classifying a small set of reviews or comments that are difficult to classify because of their features. Analysis of the most informative words for the Naive Bayes models trained on each dataset reveals that the higher f-measures achieved when evaluating the Depression dataset may be due to reviewers using words highly associated with sentiment. In contrast the Hypertension most important features had a lower pos:neg ratio and were harder to identify as positive or negative. Our error analysis revealed that negation, complex contextual information, and grammatical issues each caused the models to misclassify reviews. Future work could mitigate these issues by integrating spelling and grammar correction, and identifying negation.
Table 2.
Naive Bayes 10-Fold Cross Validation Results for Effectiveness Ratings
Baseline | Prediction | ||||
---|---|---|---|---|---|
Precision | F-Measure | Precision | Recall | F-1 Measure | |
Common | 69.34% | 81.90% | 80.15% (+/- 1.49%) | 84.87% (+/- 4.23%) | 82.36% (+/- 1.73%) |
Cancer | 71.72% | 83.53% | 80.13% (+/- 3.08%) | 89.90% (+/- 3.07%) | 84.63% (+/- 0.91%) |
Depression | 70.08% | 82.41% | 81.80% (+/- 2.08%) | 91.37% (+/- 2.93%) | 86.26% (+/- 0.81%) |
Diabetes | 63.56% | 77.72% | 77.16% (+/- 6.50%) | 80.51% (+/- 13.89%) | 77.70% (+/- 7.60%) |
Hypertension | 66.16% | 79.63% | 76.11% (+/- 1.45%) | 84.41% (+/- 2.07%) | 80.02% (+/- 0.98%) |
Average | 68.17% | 81.04% | 79.07% | 86.21% | 82.19% |
Table 4.
Random Forest 10-Fold Cross Validation Results for Effectiveness Ratings
Baseline | Prediction | ||||
---|---|---|---|---|---|
Precision | F-Measure | Precision | Recall | F-1 Measure | |
Common | 69.34% | 81.90% | 76.76% (+/- 0.94%) | 93.31% (+/- 1.57%) | 84.22% (+/- 0.87%) |
Cancer | 71.72% | 83.53% | 76.33% (+/- 2.06%) | 94.39% (+/- 1.16%) | 84.38% (+/- 1.00%) |
Depression | 70.08% | 82.41% | 78.24% (+/- 0.81%) | 95.48% (+/- 1.16%) | 86.00% (+/- 0.45%) |
Diabetes | 63.56% | 77.72% | 73.71% (+/- 3.52%) | 85.37% (+/- 10.72%) | 78.64% (+/- 5.26%) |
Hypertension | 66.16% | 79.63% | 73.59% (+/- 0.66%) | 87.96% (+/- 1.60%) | 80.12% (+/- 0.75%) |
Average | 68.17% | 81.04% | 75.73% | 91.30% | 82.67% |
Table 5.
CNN 10-Fold Cross Validation Results for Effectiveness Ratings
Baseline | Prediction | ||||
---|---|---|---|---|---|
Precision | F-Measure | Precision | Recall | F-1 Measure | |
Common | 69.34% | 81.90% | 83.80% (+/- 0.54%) | 85.73% (+/- 2.14%) | 84.74% (+/- 0.93%) |
Cancer | 71.72% | 83.53% | 80.71% (+/- 1.36%) | 84.29% (+/- 2.27%) | 82.43% (+/- 0.87%) |
Depression | 70.08% | 82.41% | 84.18% (+/- 0.79%) | 86.25% (+/- 1.50%) | 85.20% (+/- 0.83%) |
Diabetes | 63.56% | 77.72% | 76.79% (+/- 2.44%) | 81.40% (+/- 3.14%) | 78.96% (+/- 1.41%) |
Hypertension | 66.16% | 79.63% | 77.65% (+/- 0.91%) | 79.11% (+/- 2.89%) | 78.33% (+/- 1.06%) |
Average | 68.17% | 81.04% | 80.63% | 83.35% | 81.93% |
Table 7.
Naive Bayes and SVM Cross-Domain Results
Naive Bayes Results | SVM Results | ||||||
---|---|---|---|---|---|---|---|
Precision | Recall | F Measure | Precision | Recall | F Measure | ||
Cancer | 80.04% | 83.20% | 81.59% | Cancer | 88.37% | 80.69% | 84.35% |
Depression | 94.09% | 80.08% | 86.52% | Depression | 91.89% | 82.52% | 86.95% |
Diabetes | 84.50% | 76.67% | 80.39% | Diabetes | 91.10% | 74.48% | 81.72% |
Hypertension | 73.57% | 79.82% | 76.57% | Hypertension | 79.19% | 78.94% | 79.19% |
Average | 83.05% | 79.94% | 81.27% | Average | 87.82% | 78.94% | 83.06% |
Table 8.
Random Forest and CNN Cross-Domain Results
Random Forest Results | CNN Results | ||||||
---|---|---|---|---|---|---|---|
Precision | Recall | F Measure | Precision | Recall | F Measure | ||
Cancer | 92.57% | 77.59% | 84.42% | Cancer | 71.47% | 72.90% | 72.18% |
Depression | 95.98% | 77.12% | 85.52% | Depression | 69.73% | 75.12% | 72.32% |
Diabetes | 93.65% | 70.17% | 80.23% | Diabetes | 63.51% | 72.14% | 67.55% |
Hypertension | 88.83% | 73.66% | 80.54% | Hypertension | 65.75% | 68.90% | 67.29% |
Average | 92.76% | 74.64% | 82.68% | Average | 67.61% | 72.26% | 69.84% |
Footnotes
https://www.cancer.gov/about-cancer/treatment/drugs
https://www.centerwatch.com/drug-information/fda-approved-drugs/medical-conditions
References
- [1].Shweta Y, Asif E, Sriparna S, Bhattacharyya P. Medical sentiment analysis using social media: towards building a patient assisted system. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) 2018 [Google Scholar]
- [2].Fox S, Duggan M. Health Online 2013. Pew Internet & American Life Project Washington, DC. 2013 [Google Scholar]
- [3].Elena T, Sergey N. Exploring convolutional neural networks and topic models for user profiling from drug reviews. Multimed Tools Appl. 2018 Feb;77(4):4791–4809. [Google Scholar]
- [4].Alimova I, Tutubalina E. Automated detection of adverse drug reactions from social media posts with machine learning. van der Aalst WMP, Khachay M, Ignatov D, Kuznetsov SO, editors. Analysis of images, social networks and texts. Lecture Notes in Computer Science. Springer, Cham. 2017:3–15. [Google Scholar]
- [5].Sarker A, Ginn R, Nikfarjam A, OConnor K, Smith K, Jayaraman S. Utilizing social media data for pharmacovigilance: A review. J Biomed Inform. 2015 Apr;54:202–212. doi: 10.1016/j.jbi.2015.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Kazemi DM, Borsari B, Levine MJ, Dooley B. Systematic review of surveillance by social media platforms for illicit drug use. J Public Health (Oxf) 2017 Dec;39(4):763–776. doi: 10.1093/pubmed/fdx020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Jiang L, Yang CC. User recommendation in healthcare social media by assessing user similarity in heterogeneous network. Artificial intelligence in medicine. 2017;81:63–77. doi: 10.1016/j.artmed.2017.03.002. [DOI] [PubMed] [Google Scholar]
- [8].Mullen T, Collier N. Sentiment analysis using support vector machines with diverse information sources. Proceedings of the 2004 conference on empirical methods in natural language processing. 2004 [Google Scholar]
- [9].Xu J, Zhang Y, Wu Y, Wang J, Dong X, Xu H. Citation sentiment analysis in clinical trial papers. In: AMIA annual symposium proceedings. vol. 2015. American Medical Informatics Association. 2015:1334. [PMC free article] [PubMed] [Google Scholar]
- [10].Yu B. Automated citation sentiment analysis: what can we learn from biomedical researchers. In: Proceedings of the 76th ASIS&T Annual Meeting: Beyond the Cloud: Rethinking Information Boundaries. American Society for Information Science. 2013:83. [Google Scholar]
- [11].Wilbur WJ, Rzhetsky A, Shatkay H. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC bioinformatics. 2006;7(1):356. doi: 10.1186/1471-2105-7-356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Sabra S, Malik KM, Alobaidi M. Prediction of venous thromboembolism using semantic and sentiment analyses of clinical narratives. Computers in biology and medicine. 2018;94:1–10. doi: 10.1016/j.compbiomed.2017.12.026. [DOI] [PubMed] [Google Scholar]
- [13].Deng Y, Stoehr M, Denecke K. Retrieving Attitudes: Sentiment Analysis from Clinical Narratives. MedIR@ SIGIR. 2014:12–15. [Google Scholar]
- [14].Velupillai S, Mowery D, South BR, Kvist M, Dalianis H. Recent advances in clinical natural language processing in support of semantic analysis. Yearbook of medical informatics. 2015;24(01):183–193. doi: 10.15265/IY-2015-009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Denecke K, Deng Y. Sentiment analysis in medical settings: new opportunities and challenges. Artif Intell Med. 2015;64(1):17–27. doi: 10.1016/j.artmed.2015.03.006. [DOI] [PubMed] [Google Scholar]
- [16].Jime´nez-Zafra S, Mart´ın-Valdivia M, Molina-Gonza´lez M, Uren˜a-Lo´pez L. How do we talk about doctors and drugs? Sentiment analysis in forums expressing opinions for medical domain. Artif Intell Med. 2019;93:50–57. doi: 10.1016/j.artmed.2018.03.007. [DOI] [PubMed] [Google Scholar]
- [17].Carrillo-de Albornoz J, Vidal Rodr´ıguez J, Plaza L. Feature engineering for sentiment analysis in e-health forums. PloS one. 2018;13(11):e0207996. doi: 10.1371/journal.pone.0207996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Greaves F, Ramirez-Cano D, Millett C, Darzi A, Donaldson L. Use of sentiment analysis for capturing patient experience from free-text comments posted online. Journal of medical Internet research. 2013;15(11):e239. doi: 10.2196/jmir.2721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Medinify. 2019 Available from: https://github.com/NanoNLP/medinify. [Google Scholar]
- [20].Giachanou A, Gonzalo J, Mele I, Crestani F. Sentiment propagation for predicting reputation polarity. European Conference on Information Retrieval. Springer. 2017:226–238. [Google Scholar]
- [21].Taboada M, Brooke J, Tofiloski M, Voll K, Stede M. Lexicon-based methods for sentiment analysis. Comput Linguist. 2011 Apr;37(2):267–307. [Google Scholar]
- [22].Asghar MZ, Khan A, Ahmad S, Qasim M, Khan IA. Lexicon-enhanced sentiment analysis framework using rule-based classification scheme. PloS one. 2017 Feb;12(2):e0171649. doi: 10.1371/journal.pone.0171649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Na JC, Kyaing WYM, Khoo CSG, Foo S, Chang YK, Theng YL. Sentiment classification of drug reviews using a rule-based linguistic approach. In: The outreach of digital libraries: a globalized resource network. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. 2012:189–198. [Google Scholar]
- [24].Goeuriot L, Na JC, Min Kyaing WY, Khoo C, Chang YK, Theng YL. Sentiment lexicons for health-related opinion mining. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. ACM. 2012:219–226. [Google Scholar]
- [25].Wilson T, Wiebe J, Hoffmann P. Recognizing contextual polarity: an exploration of features for phrase-level sentiment analysis. Comput Linguist. 2009 Aug;35(3):399–433. [Google Scholar]
- [26].Wang G, Sun J, Ma J, Xu K, Gu J. Sentiment classification: the contribution of ensemble learning. Decis Support Syst. 2014 Jan;57:77–93. [Google Scholar]
- [27].Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl- Based Syst. 2015;89:14–46. [Google Scholar]
- [28].Nguyen TD, Nguyen LDP, Cao T. Sentiment analysis on medical text using combination of machine learning and SO-CAL scoring. 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES) 2017:49–54. [Google Scholar]
- [29].Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013:3111–3119. [Google Scholar]
- [30].Bressert E. SciPy and NumPy: An Overview for Developers. 1st ed. 1005 Gravenstein Highway North Se- bastopol, CA 95472: O’Reilly Media. 2013 [Google Scholar]
- [31].Ketkar N. Introduction to pytorch. In: Deep learning with python. Springer. 2017:195–208. [Google Scholar]
- [32].Roujeau JC, Kelly JP, Naldi L, Rzany B, Stern RS, Anderson T. Medication use and the risk of Stevens– Johnson syndrome or toxic epidermal necrolysis. New Engl J Med. 1995;333(24):1600–1608. doi: 10.1056/NEJM199512143332404. [DOI] [PubMed] [Google Scholar]