Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2020 May 30;2020:136–141.

Towards Automatic Bot Detection in Twitter for Health-related Tasks

Anahita Davoudi 1, Ari Z Klein 1, Abeed Sarker 2, Graciela Gonzalez-Hernandez 1
PMCID: PMC7233076  PMID: 32477632

Abstract

With the increasing use of social media data for health-related research, the credibility of the information from this source has been questioned as the posts may not from originating personal accounts. While automatic bot detection approaches have been proposed, none have been evaluated on users posting health-related information. In this paper, we extend an existing bot detection system and customize it for health-related research. Using a dataset of Twitter users, we first show that the system, which was designed for political bot detection, underperforms when applied to health-related Twitter users. We then incorporate additional features and a statistical machine learning classifier to improve bot detection performance significantly. Our approach obtains F1-scores of 0.7 for the “bot” class, representing improvements of 0.339. Our approach is customizable and generalizable for bot detection in other health-related social media cohorts.

Introduction

In recent years, social media has evolved into an essential source of information for various types of health-related research. Social networks encapsulate large volumes of data associated with diverse health topics, generated by active user bases in continuous growth. Twitter, for example, has 330 million monthly active users worldwide that create almost 500 million micro-blogs (tweets) per day.i For some years, the use of the platform to share personal health information has been growing, particularly amongst people living with one or more chronic conditions and those living with a disability. Twenty percent of social network site users living with chronic conditions gather and share health information on the sites, compared with 12% of social network site users who report no chronic conditions.ii Social media data is thus being widely used for health-related research, for tasks such as adverse drug reaction detection1, syndromic surveillance2, subject recruitment for cancer trials3, and characterizing drug abuse4, to name a few. Twitter is particularly prevalent in research due to the availability of the public streaming API,iii, which releases a sample of publicly posted data in real-time. While early health-related research from social media focused almost exclusively on population-level studies, some very recent research tasks have focused on performing longitudinal data analysis at the user level, such as mining health-related information from cohorts of pregnant women5.

When conducting user-level studies from social media, one challenge is to ascertain the credibility of the information posted. Notably, it is important to verify, when deriving statistical estimates from user cohorts, that the user accounts represent humans and not bots (accounts that can be controlled to produce content automatically and interact with other profiles)6, 7. Bots may spread false information by automatically retweeting posts without a human to verify the facts or to influence public opinions on particular topics on purpose6, 8, 9. For example, a recent study10 showed that the highest proportion of anti-vaccine content is generated by accounts with unknown or intermediate bot scores, meaning that the existing methods were not able to fully determine if they were indeed bots. Automatic bot detection techniques mostly rely on extracting features from users’ profiles and their social networks11, 12. Some studies have used Honeypot profiles on Twitter to identify and analyze bots13, while other studies have examined social proximity14 or both social and content proximities12, tweet timing intervals15, or user-level content-based and graph-based features16. However, in response to efforts towards keeping Twitter bot-free, bots have evolved and changed to overcome the detection techniques17.

The objectives of this study are to (i) evaluate an existing bot detection system on user-level datasets selected for their health-related content, and (ii) extend the bot detection system for practical application within the health realm. Bot detection approaches have been published in the past few years, but most of the code and data necessary for reproducing the published results were not made available18–20. The only system for which we found both operational code and data available, Botometer21 (formerly BotOrNot), was chosen as the benchmark system for this study. To the best of our knowledge, this paper presents the first study on health-related bot detection. We have made the classification code and training set of annotated users availableiv.

Methods

Corpus

To identify bots in health-related social media data, we retrieved a sample of 10, 417 users from a database containing more than 400 million publicly available tweets posted by more than 100, 000 users who have announced their pregnancy on Twitter5. This sample is based on related work for detecting users who have mentioned various pregnancy outcomes in their tweets. Two professional annotators manually categorized the 10, 417 users as “bot,” “non-bot,” or “unavailable,” based on their publicly available Twitter sites. Users were annotated broadly as “bot” if, in contrast to users annotated as “non-bot,” they do not appear to be posting personal information. We extended the definition of “bot” to include any user that is not a personal account because, in utilizing Twitter data for health-related studies, we seek to identify only accounts that are posting user-generated information. Users were annotated as “unavailable” if their Twitter sites could not be viewed at the time of annotation due to modifying their privacy settings or being removed or suspended from Twitter. Based on 1000 overlapping annotations, their inter-annotator agreement (IAA) was κ = 0.93 (Cohen’s kappa22), considered “almost perfect agreement”23. Their IAA does not include disagreements resulting from the change of a user’s status to or from “unavailable” in the time between the first and second annotations. Upon resolving the disagreements, 413 (4%) users were annotated as “bot”, 7849 (75.35%) as “non-bot”, and 20.69 (19.9%) as “unavailable”.

Classification

We used the 8262 “bot” and “non-bot” users in experiments to train and evaluate three classification systems. We split the users into 80% (training) and 20% (test) sets, stratified based on the distribution of “bot” and “non-bot” users. The training set includes 61, 160, 686 tweets posted by 6610 users, and the held-out test set includes 15, 703, 735 tweets posted by 1652 users. First, we evaluated Botometer on our held-out test set. Botometer is a publicly available bot detection system designed for political bot detection. It outputs a score between 0 and 1 for a user, representing the likelihood that a user is a bot. Second, we used the Botometer score for each user as a feature in training a gradient boosting classifier, which is a decision tree-based ensemble machine learning algorithm with gradient boosting24 and can be used to address the class imbalance. To adapt the Botometer scores to our binary classification task, we set the threshold to 0.47, based on performing 5-fold cross-validation over the training set. To further address the class imbalance, we used the Synthetic Minority Over-sampling Technique (SMOTE)25 to create artificial instances of “bot” users in the training set. We also performed 5-fold cross-validation over the training set to optimize parameters for the classifier; we used exponential as the loss function, set the number of estimators to 200, and set the learning rate to 0.1. Third, we used the classifier with an extended set of features that are not used by Botometer. Based on our manual annotation, we consider the following features to be potentially informative for distinguishing “bot” and “non-bot” users in health-related data:

  • Tweet Diversity. Considering that “bot” users may re-post the same tweets, we used the ratio of a user’s unique tweets to the total number of tweets posted by the user. Here, 0 indicates that the user has published only the same tweet multiple times, and 1 shows that each tweet is unique and has been posted only once. As Figure 1 illustrates, a subset of “bot” users (in the training set) have posted more of the same tweets than “non-bot” users.

  • URL score. During manual annotation, we found that “bot” users’ tweets frequently contain URLs (e.g., advertisements for health-related products, such as medications), so we use the ratio of the number of a user’s tweets containing a URL to the total number of tweets posted by the user.

  • Mean Daily Posts. Considering that “bot” users may post tweets more frequently than “non-bot” users, we measured the average and standard deviation of the number of tweets posted daily by a user. As Figure 1 illustrates, a subset of “bot” users post, on average, more tweets daily than “non-bot” users.

  • Topics. Considering that “bot” users may post tweets about a limited number of targeted topics, we used topic modeling to measure the heterogeneity of topics in a user’s tweets. We used Latent Dirichlet Allocation (LDA)26 to extract the top five topics from all of the users’ 1000 most recent tweets (or all the tweets if a user has posted less than 1000 tweets), and used the mean of the weights of each topic across all of a user’s tweets.

  • Mean Post Length. Considering that the length of tweets may be different between “bot” and “non-bot” users, we used the mean word length and standard deviation of a user’s tweets.

  • Profile Picture. In addition to tweet-related features, we used features based on information in users’ profiles. Considering that a “non-bot” user’s profile picture may be more likely to contain a face, we used a publicly available systemv to detect the number of faces in a profile picture. As Figure 2 illustrates, a face was not detected in the profile picture of the majority of “non-bot” users (in the training set). In contrast, at least one face was detected in the profile picture of the majority of “bot” users.

  • User Name. Finally, we used a publicly available lexiconvi to detect the presence or absence of a person’s name in a user name. As Figure 2 illustrates, the name of a person is present (1) in approximately half of “non-bot” user names, whereas the name of a person is absent (0) in the majority of “bot” user names.

Figure 1.

Figure 1.

The distribution of features for “bot” and “non-bot” users in the training set.

Figure 1.

Figure 1.

The distribution of faces detected in profile pictures and names identified in user names for “bot” and “non-bot” users in the training set.

Results

Table 1 presents the precision, recall, and F1-scores for the three bot detection systems evaluated on the held-out test set. The F1-score for the “bot” class indicates that Botometer (0.361), designed for political bot detection, does not generalize well for detecting “bot” users in health-related data. Although the classifier with only the Botometer score as a feature (0.286) performs even worse than the default Botometer system, our extended feature set significantly improves performance (0.700). For imbalanced data, a higher F1-score for the majority class is typical; in this case, it reflects that we have modeled the detection of “bot” users based on their natural distribution in health-related data.

Table 1.

Precision, recall, and F1-score for three bot detection systems evaluated on a held-out test set of 1652 users. Precision, recall, and F1-scores are reported for the “non-bot” class (NB), the “bot” class (B), and an average of the two classes (avg.).

Classifier Precision (NB, B, Avg.) Recall (NB, B, Avg.) F1-Score (NB, B, Avg.)
Botometer Default 0.974, 0.276, 0.625 0.919, 0.542, 0.730 0.945, 0.361, 0.653
GB Classifier Botometerscore 0.963, 0.285, 0.624 0.962, 0.288, 0.625 0.962, 0.286, 0.624
GB Classifier Botometerscore+Features 0.985, 0.678, 0.831 0.982, 0.724, 0.853 0.984, 0.700, 0.842

Discussion

Our results demonstrate that (i) a publicly available bot detection system, designed for political bot detection, underperforms when applied to health-related data, and (ii) extending the system with simple features derived from health-related data significantly improves performance. An F1-score of 0.700 for the “bot” class represents a promising benchmark for automatic classification of highly imbalanced Twitter data and, in this case, for detecting users who are not reporting information about their own pregnancy on Twitter. Identifying such users is particularly important in the process of automatically selecting cohorts27 from a population of social media users for user-level observational studies28.

A brief error analysis of the 25 false negatives users (in the held-out test set of 1652 users) from the classifier with the extended feature set reveals that, while only one of the users is an account that automatically re-posts other users’ tweets, the majority of the errors can be attributed to our broad definition of “bot” users, which includes health-related companies, organizations, forums, clubs, and support groups that are not posting personal information. These users are particularly challenging to automatically identify as “bot” users because, with humans posting on behalf of an online maternity store, or to a pregnancy forum, for example, their tweets resemble those posted by “non-bot” users. In future work, we will focus on deriving features for modeling the nuances that distinguish such “bot” users.

Conclusion

As the use of social networks, such as Twitter, in health research is increasing, there is a growing need to validate the credibility of the data before making conclusions. The presence of bots in social media presents a crucial problem, mainly because bots may be customized to perpetuate specific biased or false information or to execute advertising or marketing goals. We demonstrate that, while existing systems have been successful in detecting bots in other domains, they do not perform as well for detecting health-related bots. Using a machine learning algorithm on top of an existing bot detection system, and a set of simple derived features, we were able to improve bot detection performance in health-related data significantly. Introducing more features would likely contribute to further enhancing performance, which we will explore in future work.

Acknowledgments

This study was funded in part by the National Library of Medicine (NLM) (grant number: R01LM011176) and the National Institute on Drug Abuse (NIDA) (grant number: R01DA046619) of the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Footnotes

i

https://www.internetlivestats.com/twitter-statistics/. Accessed: 05/06/2019.

ii

https://www.pewinternet.org/2011/05/12/the-social-life-of-health-information-2011/. Accessed: 08/15/2019.

iii

https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data.html. Accessed: 05/06/2019.

iv

https://healthlanguageprocessing.org/towards-automatic-bot-detection/

v

https://github.com/ageitgey/face_recognition. Accessed: 08/15/2019.

vi

https://pypi.org/project/gender-guesser/. Accessed: 07/25/2019.

References

  • 1.Sarker A, Ginn R, Nikfarjam A. Utilizing social media data for pharmacovigilance. A review. Journal of Biomedical Informatics. 2015;54:202–212. doi: 10.1016/j.jbi.2015.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Liangzhe Ch, Tozammel Hossain KSM, Butler P, Ramakrishnan N, Prakash BA. Syndromic surveillance of Flu on Twitter using weakly supervised temporal topic models. Journal of Data Mining and Knowledge Discovery. 2016;30:681–710. [Google Scholar]
  • 3.Reuter K, Praveen A, NamQuyen L. Monitoring Twitter Conversations for Targeted Recruitment in Cancer Trials in Los Angeles County: Protocol for a Mixed-Methods Pilot Study. JMIR Res Protoc. 2018;7:e177. doi: 10.2196/resprot.9762. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sarker A, O’Connor K, Ginn R. Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter. Journal of Drug safety. 2016;39:231-240. doi: 10.1007/s40264-015-0379-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sarker A, Chandrashekar P, Magge A, Haitao C, Klein AZ, Gonzalez G. Discovering cohorts of pregnant women from social media for safety surveillance and analysis. Journal of medical Internet research. 2017;19:e361. doi: 10.2196/jmir.8164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ferrara E, Varol O, Davis CA, Menczer F, Flammini A. The Rise of Social Bots. Journal of Commun. ACM. 2016;59:96-104. [Google Scholar]
  • 7.Varol O, Ferrara E, Davis CA, Menczer F, Flammini A. Online Human-Bot Interactions: Detection, Estimation, and Characterization. Proc. Intl. AAAI Conf. on Web and Social Media. 2017:280–289. [Google Scholar]
  • 8.Bessi A, Ferrara E. Social bots distort the 2016 U.S. Presidential election online discussion. Journal of First Monday. 2016;21:243–258. [Google Scholar]
  • 9.Thomas K, Grier C, Song D, Paxson V. Suspended Accounts in Retrospect: An Analysis of Twitter Spam. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement Conference. 2011:243–258. [Google Scholar]
  • 10.Broniatowski DA, Jamison AM, Qi S. Weaponized health communication: Twitter bots and Russian trolls amplify the vaccine debate. American journal of public health. 2018;108(10):1378–1384. doi: 10.2105/AJPH.2018.304567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wu L, Hu X, Morstatter F, Liu H. Adaptive Spammer Detection with Sparse Group Modeling. ICWSM. 2017:319–326. [Google Scholar]
  • 12.Hu x, Tang J, Liu H. Online Social Spammer Detection. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. 2014:59–65. [Google Scholar]
  • 13.Stringhini G, Kruegel Ch, Vigna G. Detecting Spammers on Social Networks. Proceedings of the 26th Annual Computer Security Applications Conference. 2010:1–9. [Google Scholar]
  • 14.Ghosh S, Viswanath B, Kooti F. Understanding and combating link farming in the Twitter social network. International Conference on World Wide Web Conference. 2012:61–70. [Google Scholar]
  • 15.Faisal AA, Tavares G. Scaling-laws of human broadcast communication enable distinction between human, corporate and robot Twitter users. Journal PloS One. 2013;8 doi: 10.1371/journal.pone.0065774. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.McCord M, Chuah M. Spam Detection on Twitter Using Traditional Classifiers. Proceedings of the 8th International Conference on Autonomic and Trusted Computing. 2011:175–186. [Google Scholar]
  • 17.Yang Ch, Harkreader RCh, Gu G. Die Free or Live Hard? Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers. Recent Advances in Intrusion Detection. 2011:318–337. [Google Scholar]
  • 18.Chavoshi N, Hamooni H, Mueen A. DeBot: Twitter Bot Detection via Warped Correlation. In ICDM. 2016:817–822. [Google Scholar]
  • 19.Gilani Z, Wang L, Crowcroft J, Almeida M, Farahbakhsh R. Stweeler: A framework for twitter bot analysis. In Proceedings of the 25th International Conference Companion on World Wide Web. 2016:37–38. [Google Scholar]
  • 20.Minnich A, Chavoshi N, Koutra D, Mueen A. BotWalk: Efficient adaptive exploration of Twitter bot networks. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2017:467–474. [Google Scholar]
  • 21.Clayton DA, Varol O, Ferrara E, Flammini A, Menczer F. Botornot: A system to evaluate social bots. Proceedings of the 25th International Conference Companion on World Wide Web. 2016:273–274. [Google Scholar]
  • 22.Cohen J. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement. 1960;20:37–46. [Google Scholar]
  • 23.Viera AJ, Garret JM. Understanding interobserver agreement: the kappa statistic. Fam Med. 2005;37(5):360–363. [PubMed] [Google Scholar]
  • 24.Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001 Oct;1:1189–232. [Google Scholar]
  • 25.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–357. [Google Scholar]
  • 26.Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. Journal of machine Learning research. 2003;3:993–1022. [Google Scholar]
  • 27.Klein AZ, Sarker A, Cai H, Weissenbacher D, Gonzalez-Hernandez G. Social media mining for birth defects research: a rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter. J Biomed Inform. 2018;87:68–78. doi: 10.1016/j.jbi.2018.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Golder S, Chiuve S, Weissenbacher D. Pharmacoepidemiologic evaluation of birth defects from health-related postings in social media during pregnancy. Drug Saf. 2019;42(3):389–400. doi: 10.1007/s40264-018-0731-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES