Skip to main content
. Author manuscript; available in PMC: 2020 May 1.
Published in final edited form as: Int J Med Inform. 2019 Feb 20;125:37–46. doi: 10.1016/j.ijmedinf.2019.02.008

Table 4.

Evaluation and performance metrics

Author< NLP/Text Mining Primary Evaluation Metric Comparative Evaluation
Brennan & Aronson, 200316 NLP Number of matched terms to vocabularies; reported matched frequencies higher for nursing complemented vocabularies Compared several models on nursing, MeSH, and SNOMED terms
Portier et al, 201333 Text mining Descriptive results including sentiment analysis None reported
Freifeld et al, 201427 NLP Automated, dictionary-based symptom classification had 72% recall and 86% precision Results of annotation were compared to FDA Adverse Event Reporting System data
Gupta et al, 201419 NLP Extracts symptoms and conditions with a F-measure of 66–76% Compared performance of two other programs, the OBA and the MetaMap annotator, with baseline and default parameters
Park & Ryu, 201425 Text mining Descriptive results including symptoms and clinical distinctions None reported
Janies et al, 201528 NLP No reported algorithm evaluation metrics None reported
Jimeno-Yepes et al, 201520 NLP Highest performing model (Micromed+Meta) had precision, recall, and F-measure as 72%, 60%, and 66%, respectively Compared across exact and partials matches for five models
Karmen et al, 201521 NLP Average precision of 84% and an average F-measure of 79% Compared algorithm results to independent expert ratings
Liu & Chen, 201523 NLP Average F-measure of 90% for drug entity extraction and average F- measure of 80% for medical event extraction Compared several methods across patient-authored forums
Nikfarjam et al, 201524 NLP 86%, 78%, and 82% for precision, recall, and F-measure, respectively Comparison between several methods including SVM, ADRMine, MetaMap, and a lexicon-based method
Tighe et al, 201532 Text mining Descriptive results including the average degree centrality of the reduced pain tweet corpus graph was 60.7 Compared sentiment for relevant terms to objective terms
Eshleman & Singh, 201618 NLP & text mining Precision exceeding 85% and F- measure over 81% Compared sentiment analysis to graph topology with co-occurring symptoms
Lee & Donovan, 201635 Text mining Descriptive results for symptom findings None reported
Marshall et al, 201629 Text mining Descriptive results including cooccurrence and clustering for symptom findings None reported
Topaz et al, 201630 NLP Descriptive results including symptom extraction None reported
Sunkureddi et al, 201634 NLP Descriptive results including frequency ranking of reactions and patients’ concerns None reported
Cocos et al, 201717 NLP Approximate match F-measure for RNN of 75% for ADR identification Compared the BLSTM-RNN ADR classification, a baseline lexicon system, and a condition random-field model
Cronin et al, 201736 NLP Logistic regression for medical communications with AUC of 0.899 Compared naive bayes, logistic regression, and random forest across different types of patient portal messages
Lamy et al, 201722 NLP No reported algorithm evaluation metrics None reported
Lu et al, 201731 Text mining Descriptive results including sentiment scores, clustering of groups, and Jaccard similarities None reported
Patel et al, 201726 NLP No reported algorithm evaluation metrics Compared method between two datasets

Note.

<

Studies have been arranged in chronological order to assess trends over time; ADR=adverse drug reaction; ADRMine (a machine learning-based concept extraction system that uses Conditional Random Fields); AUC=area under the curve; BLST=Binarized Long Short-Term Memory Network; FDA=Federal Drug Administration; F-measure=also known as F1 score or F-score in the published literature; MeSH=Medical Subject Headings; NLP=Natural Language Processing; OBA=open biomedical annotator; MetaMap (tool for recognizing Unified Medical Language System [UMLS] concepts in text); RNN=recurrent neural network; SVM=support vector machine