Skip to main content
Online Journal of Public Health Informatics logoLink to Online Journal of Public Health Informatics
. 2015 Feb 26;7(1):e132. doi: 10.5210/ojphi.v7i1.5798

User-Customizable Health Pattern Detector Framework: Twitter Analysis Example

Lianna M Hall 1,*, Kevin K Nam 1, Jason Thornton 1, Marianne DeAngelus 1, Timothy J Dasey 1
PMCID: PMC4512446

Objective

To demonstrate a framework for user-customizable text processing that can improve the efficiency and effectiveness of mining text for biosurveillance, with initial application to Twitter.

Introduction

Early detection of a disease outbreak using pre-diagnostic textual data is available in biosurveillance systems with the integration of data such as chief complaints. Social media has been identified as an additional pre-diagnostic data source of interest [1]. Textual data analysis in public health is usually based on a keyword search and often involves a complex Boolean combination of terms that produce results with many false alarms. Epidemiologists may wish to query the data differently based on the event of interest, yet the process is laborious to weed out uninteresting content. Specialized detectors that decide on the topical relevance of keyword search usually require developers to adapt methods to new uses, which is a time- and cost- prohibitive activity. Users need the ability to rapidly build text content detectors on their own.

Methods

A generalizable detector framework called Customizable Pattern Analytics (CPA) was adapted and tested with the Twitter biosurveillance data mining application. CPA was originally developed for detecting features in videos [2], but has a general purpose mathematical framework that allows migration to other data discrimination problems. CPA automatically reconfigures multiple stages of a detector processing chain (e.g. feature selection and classification) based on binary feedback from the user on the utility of returned results. It does so by computing a wide range of features about the data, and adjusting the feature weighting and the decision boundary on the combined features based on user feedback. The result is a user-built detector that can be specific to a situation.

For Twitter processing, CPA analyzes many characteristics of each Tweet that is returned from a keyword search, including term frequencies, common word combinations, content flags, and metadata (e.g. location). A user interface transparently shows ranked examples of the returned Tweets of suspected relevance to the user. The user can select examples as either relevant or irrelevant, and the interface progressively displays a new set of options based on an underlying reengineering of the detector by CPA.

Results

The figure shows an example user interface screenshot. The application is available for demonstration. Performance curves (i.e. true vs. false positive rate) show that CPA achieves superior performance than that of keyword search alone or from one specific type of text analysis. Importantly, the type of text processing to apply is varied based on the particular keywords used in the search. CPA can select the most productive combination of text processing methods to apply that best match the user-supplied yes/no labels. Empirical trials have demonstrated that there is a significant accuracy boost with as few as 5 yes/no labels, and that this accuracy approaches its upper limit after several dozen feedback labels.

Conclusions

Application of CPA to Twitter data analysis was demonstrated with superior performance to simpler text analysis and with the versatility to be applied to a wide range of microblog processing tasks. These methods should be directly applicable to similar text processing tasks (e.g., chief complaints). Automatic or user-guided keyword expansion methods are being investigated to extend the capability. Extensions to other text processing problems are imagined, such as for electronic health record analysis.

graphic file with name ojphi-07-e132-g001.jpg

Example user interface

Acknowledgments

This work is sponsored by the Assistant Secretary of Defense for Research and Engineering (ASD(R&E)) under Air Force Contract #FA8721- 05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

References

  • 1.CDC . 2014. Health Department Use of Social Media to Identify Foodborne Illness — Chicago, Illinois, 2013–2014. MMWR. 63(32), 681-85. [PMC free article] [PubMed] [Google Scholar]
  • 2.Thornton J, DeAngelus M, Chan M. Online Customization of Video Detection Capabilities. IEEE Int’l Conf. on Security Technology; 2014 Oct 13-16; Rome, Italy. Accepted for presentation. [Google Scholar]

Articles from Online Journal of Public Health Informatics are provided here courtesy of JMIR Publications Inc.

RESOURCES