Skip to main content
. 2022 Dec 13;30(3):588–603. doi: 10.1093/jamia/ocac240

Table 2.

Outcomes and additional findings of the studies combining CDS and TM

Study Application field Comparison Quantitative outcome measure Outcomes estimates (%)
Additional findings
A Sens Spec PPV NPV F Co K
Aronsky et al 200168 Diagnosis Comparing a clinically valid gold standard (3-step diagnostic evaluation process and 8 independent physicians) versus the model, of ED patients whose CXR report was available during the encounter to extract pneumonia diagnosis from unstructured data, detect and prompt initiation of antibiotic treatment to reduce disease severity and mortality Sensitivity, specificity, PPV, NPV n/a 96 63 14.2 99.5 25a n/a The area under the receiver operating characteristic curve was 0.881 (95% CI, 0.822–0.925) for the Bayesian network alone and 0.916 (95% CI, 0.869–0.949) combined (P = .01)
Bozkurt et al 201657 Diagnosis Comparing reference standard decision support system with the NLP decision support systems of patients with mammography reports, to provide decision support as part of the workflow of producing the radiology report. Accuracy 97,58 n/a n/a n/a n/a n/a n/a The system performed extraction of imaging observations with their modifiers from text reports with precision = 94.9%, recall = 90.9%, and F-measure = 92%. They also compared the BI-RADS categories, accuracy rate of the Bayesian network outputs for each setting were calculated as 98.14% (history nodes included) and 98.15% (history nodes not included). The NLP-DSS and RS-DSS had closely matched probabilities, with a mean paired difference of 0.004 ± 0.025. The concordance correlation of these paired measures was 0.95.
Byrd et al 201258 Diagnosis Comparing the performance of the machine-learning and rule-based labelers of patients diagnosed with HF in primary care to identify heart failure with the Framingham diagnostic criteria to detect heart failure early. Precision, recall, F-score n/a 89.6 n/a 92.5 n/a 93.2 n/a Detection with the Framingham criteria had an F-score of 0.910. Encounter labeling achieves an F-score of 0.932.
Chen et al 201739 Prognosis Comparing clinical order suggestions against the “correct” set of orders that actually occurred within a follow-up verification time for the patient with an encounter from their initial presentation until hospital discharge to build a probabilistic topic model representations of hospital admissions processes Sensitivity and PPV n/a 47 n/a 24 n/a 32a n/a Existing order sets predict clinical orders used within 24 h with area under the receiver operating characteristic curve 0.81, precision 16%, and recall 35%. This can be improved to 0.90, 24%, and 47% by using probabilistic topic models to summarize clinical data into up to 32 topics.
Cruz et al 201934b,c Prognosis Comparing the same patients in the primary care area of SESCAM with recommendations before and after implementation of the CDSS to improve Adherence to Clinical Pathways and reducing clinical variability Qualitative n/a n/a n/a n/a n/a n/a n/a Adherence rates to clinical pathways improved in 8 out of 18 recommendations when the real-time CDSS was employed, achieving 100% compliance in some cases. The improvement was statistically significant in 3 cases (P < .05). Considering that this was a preliminary study, these results are promising.
Day et al 200754b Diagnosis Reading documentation to determine whether registry inclusion criteria were met and printing admission lists versus the tool of trauma patients or patients who died to automate the process of identifying trauma patients Qualitative n/a n/a n/a n/a n/a n/a n/a This program has made the job of identifying trauma patients much less complicated and time consuming. It improved the efficiency and reduced the amount of time wasted using multiple for-mats to ensure that all patients who qualify for inclusion are found. The program also stores relevant patient information to a permanent electronic database
Denny et al 201061 Prognosis Comparing the gold standard (expert physicians’ manual review of EMR notes) versus the tool of patients whose colonoscopy statuses were unknown to detect colorectal cancer screening status Recall and precision n/a
  • TR: 91

  • CS : 82

  • CC : 93

n/a
  • TR : 95

  • CS : 95

  • CC : 95

n/a
  • TR : 93a

  • CS : 88a

  • CC : 94a

n/a
Evans et al 201653b Diagnosis Comparing patients with HF treated using the new tool compared with HF patients who had received standard care at the same hospital before the tool was implemented to help identify high risk heart failure patients Sensitivity, specificity, PPV n/a 82.6–95.3 82.7–97.5 97.45 n/a 89–96a n/a
Fiszman et al 200069 Diagnosis Comparing SymText against 4 physicians, 2 different keyword searches, and 3 lay persons of patients with a primary ICD-9 hospital discharge diagnosis of bacterial pneumonia to find cases of CAP to support diagnosis and treatment. Recall, precision and specificity n/a 95 85 78 n/a 86a n/a
Friedlin and McDonald 200646 Diagnosis Comparing REX to the gold standard (specially trained human coders as well as an experienced physician) of patients who have had a chest x-ray with dictation reports to identify congestive heart failure Sensitivity, specificity, PPV, NPV n/a 100 100 95.4 100 98a n/a
Friedlin et al 200845 Diagnosis Comparing REX to the gold standard (human review) of patients with MRSA keywords to improve reporting notifiable diseases with automated electronic laboratory MRSA reporting. Sensitivity, specificity, PPV, F-measure n/a 99.96 99.71 99.81 99.93 99.89 n/a REX identified over 2 times as many MRSA positive reports as the electronic lab system without NLP.
Garvin et al 201860 Diagnosis Comparing CHIEF versus reference standard and of External Peer Review Program cases involving HF patients discharged from 8 VA medical centers to accurately automate the quality measure for inpatients with HF. Sensitivity and PPV n/a
  • RS: 98.9

  • EPRP: 98.5

n/a 98.7 n/a
  • RS:

  • 99a

  • EPRP : 99a

n/a Reference standard (RS) External Peer Review Program (EPRP). Of the 1083 patients available for the NLP system, the CHIEF evaluated and classified 100% of cases.
Hazlehurst et al 200556 Diagnosis Comparing MediClass versus the gold standard of patients who are smokers to detect clinical events in the medical record Specificity and sensitivity n/a 82 93 n/a n/a n/a n/a
Imler et al 201363 Diagnosis Comparing annotation by gastroenterologists (reference standard) versus the NLP system of veterans who had an index colonoscopy to extract meaningful information from free-text gastroenterology reports for secondary use, to connect the pathologic record that is generally disconnected from the reports Recall, precision, accuracy, and f-measure
  • L: 97

  • S: 96

  • N: 84

  • L > 82

  • S > 92

  • N > 66

n/a
  • L > 95

  • S > 95

  • N > 64

n/a
  • L : 96

  • S : 96

  • N: 62

n/a
Imler et al 201464 Diagnosis Comparing NLP-based CDS surveillance intervals with those determined by paired, blinded, manual review of patients with an index colonoscopy for any indication except surveillance of a previous colorectal neoplasia to improve adherence to evidence-based practices and guidelines in endoscopy. Kappa statistic n/a n/a n/a n/a n/a n/a 74 Fifty-five reports differed between manual review and CDS recommendations. Of these, NLP error accounted for 54.5%, incomplete resection of adenomatous tissue accounted for 25.5%, and masses observed without biopsy findings of cancer accounted for 7.2%. NLP based CDS surveillance intervals had higher levels of agreement with the standard than the level agreement between experts.
Jain et al 199647b Diagnosis Comparing manual coding (gold standard) versus MedLee of patients with culture positive tuberculosis to find cases of tuberculosis in radiographic reports to identify eligible patients from the data. Sensitivity n/a
  • With L: 78.9

  • w/o L: 85.4

n/a n/a n/a n/a n/a MedLEE agreed on the classification of 152/171 (88.9%) reports 129/142 (90.8%) suspicious for TB and 23/29 (79.3%) not suspicious for TB; and 1072/1197 (89.6%) terms indicative of TB. Analysis showed that most of the discrepancies were caused by MedLEE not finding the location of the infiltrate. By ignoring the location (L) of the infiltrate, the agreement became 157/171 (91.8%) reports and 946/1026 (92.2%) terms.
Jones et al 201252b Diagnosis Screening tool sensitivity and specificity as well as for ICD-9 plus radiographic confirmation were compared to physician review of ED patients with a chest x-ray of CT scan to identify patients with pneumonia Sensitivity, PPV, specificity, NPV n/a 61 96 52 97 56a n/a Among 41 true positive cases, ED physicians recognized and agreed with the tool in 39%. In only 6 cases did physicians proceed to complete the accompanying pneumonia decision support tool. Of the 39 false positive cases, the NLP incorrectly identified pneumonia in 74%. Of the 8 false negative cases, one was due to failure of the NLP to identify pneumonia
Kalra et al 202041 Prognosis Comparing manually review versus the models (kNN, RF, DNN) of older men and include both primary and tertiary care indications to enhance multispecialty CT and MRI protocol assignment quality and efficiency Precision and recall n/a
  • kNN: 83.4

  • RF: 92.2

  • DNN:

  • 91.5

n/a
  • kNN: 77.5

  • RF: 81.7

  • DNN: 83.6

n/a
  • kNN : 80a

  • RF : 86a

  • DNN : 87a

n/a Baseline protocol assignment performance achieved weighted precision of 0.757–0.824. Simulating real-world deployment using combined thresholding techniques, the optimized deep neural network model assigned 69% of protocols in automation mode with recall 95% =(weighted accuracy). In the remaining 31% of cases, the model achieved 92% accuracy in CDS mode.
Karwa et al 202033 Prognosis Comparison of Colonoscopy Follow-up Recommendations between CDS algorithm and endoscopists of patients with a colonoscopy to assist clinicians to generate colonoscopy follow-up intervals based on the USMSTF guidelines Cohen’s Kappa n/a n/a n/a 69% n/a n/a 58.3 Discrepant recommendations by endoscopists were earlier than guidelines in 91% of cases.
Kim et al 201570 Diagnosis A 5-fold cross validation with the tool to measure the contribution of each of the 4 subset (lexial, concept position, related concept, section) of patients with LVEF or LVSF to classify the contextual use of both quantitative and qualitative LVEF assessments in clinical narrative documents. Recall, precision and F-measure n/a
  • QT: 95.6

  • QL: 94.2

n/a
  • QT: 95.6

  • QL: 94.2

n/a
  • QT: 95.6

  • QL: 94.2

n/a The experimental results showed that the classifiers achieved good performance, reaching 95.6% F1-measure for quantitative (QT) assessments and 94.2% F1-measure for qualitative (QL) assessments in a 5-fold cross validation evaluation.
Kivekäs et al 201640c Prognosis Comparing the review team versus SAS of adult epilepsy patients to test the functionality and validity of the previously defined triggers to describe the status of epilepsy patient’s well-being. Qualitative n/a n/a n/a n/a n/a n/a n/a In both medical and nursing data, the triggers described patients’ well-being comprehensively. The narratives showed that there was overlapping in triggers.
Matheny et al 201259 Diagnosis Comparing manual reference versus algorithm of patients with at least one surgical admission to identify infectious symptoms F measure, Fleiss’ Kappa, precision, recall, n/a
  • ASD: 84

  • SDA:

  • 62

n/a
  • ASD:

  • 91

  • SDA:

  • 67

n/a
  • ASD:

  • 87

  • SDA:

  • 64

n/a Among those instances in which the automated system matched the reference set determination for symptom, the system correctly detected 84.7% of positive assertions, 75.1% of negative assertions, and 0.7% of uncertain assertions.
Mendonça et al 200548 Diagnosis Comparing Clinicians’ judgments versus the tool of infants admitted at the NICU to identify pneumonia in newborns, to reduce manual monitoring Sensitivity, specificity and PPV n/a 71 99 7.5 n/a n/a n/a
Meystre and Haug 200537 Prognosis Comparing NLP tools versus reference standard which was created with a chart review of patients admitted in a cardiovascular unit, stay of at least 48h, and with a discharge diagnosis in the list of the 80 selected diagnosis to extract medical problems from electronic clinical document to maintain, complete and up-to-date a problem list
  • Precision and recall

  • Cohen’s Kappa

n/a 74 n/a 75.6 n/a 75a 90 A custom data subset for MMTx was created, making it faster and significantly improving the recall to 0.896 with a non-significant reduction in precision.
Meystre and Haug 200836 Prognosis Comparing the control group (the standard electronic problem list) versus the intervention group (the Automated Problem List system) of inpatients of the 2 inpatients wards for at least 48h, > 18 years and not already enrolled in a previous phase of this study to improve the completeness and timeliness of an electronic problem list Sensitivity, specificity, PPV and NPV n/a 81.5 95.7 78.4 95.6 80a n/a
Nguyen et al 201931c Diagnosis Therapy Comparing the resultant number of test results identified by the system for clinical review to the full set of test results that would have been manually reviewed of patients with ED encounters, to ensure important diagnoses are recognized and correct antibiotics are prescribed. PPV, sensitivity and F-measure n/a 94,3 n/a 85,8 n/a 89,8 n/a
Raja et al 201250b Diagnosis Pulmonary angiography were compared before and after CDS implementation of ED patients who underwent CT pulmonary angiography to decrease the use and increase in yield of CT for acute pulmonary embolism Accuracy, sensitivity, PPV, NPV and specificity 97.8 91.3 98.7 91.3 98.7 91.3 n/a Quarterly CT pulmonary angiography use increased 82.1% before CDSS implementation, from 14.5 to 26.4 examinations per 1000 patients (P<.0001). After CDSS implementation, quarterly use decreased 20.1%, from 26.4 to 21.1 examinations per 1000 patients (P=.0379).
Raja et al 201942b Prognosis Comparing research assistant Golden standard (3 physicians) versus the tool of patients < 50 years with a history of uncomplicated nephrolithiasis presenting to the ED to evaluate the impact of an appropriate use criterion for renal colic based on local best practice, implemented on the ED use of CT. Qualitative n/a n/a n/a n/a n/a n/a n/a The final sample included 467 patients (194 study site) before and 306 (88 study site) after AUC implementation. The study site’s CT of ureter rate decreased from 23.7% (46/194) to 14.8% (13/88) (P = .03) after implementation of the AUC. The rate at the control site remained unchanged, 49.8% (136/273) versus 48.2% (105/218) (P = .3).
Rosenthal et al 201951b Diagnosis Comparing UPMC Children’s Hospital of Pittsburgh earlier study results versus UPMC Hamot and Mercy of children < 2 years who triggered the EHR-based alert system to increase the number of young children identified as having injuries suspicious for physical abuse Qualitative n/a n/a n/a n/a n/a n/a n/a A total of 242 children triggered the system, 86 during the pre-intervention and 156 during the intervention. The number of children identified with suspicious injuries increased 4-fold during the intervention (P<.001). Compliance was 70% (7 of 10) in the pre-intervention period versus 50% (22 of 44) in the intervention, a change that was not statistically different (P=.55).
Shen et al 202043 Therapy Comparing Pre-existing workflow versus pilot workflow of patients undergoing outpatient endoscopy to decrease sedation-type order errors Precision, PPV, NPV, Sensitivity and Specificity. n/a 89.1 99.2 28.5 99.9 43a n/a
Smith et al 202155b Diagnosis Comparing the algorithm to manual review, to expected performance of the model and to prior work in adults using CXR and other clinical data to recognize patients with pneumonia < 18 years. Sensitivity, specificity, PPV, F-measure n/a 89.9 94.9 78.1 n/a 83.5 n/a
Stultz et al 201944 Therapy Comparing different meningitis dosing alert triggers and dosing error rates between antimicrobials with and without meningitis order sentences of patients admitted to an inpatient pediatric service or the pediatric ED, to provide a meningitis specific dosing alert for detecting meningitis management Sensitivity, PPV n/a 67,5 n/a 80,9 n/a 74a n/a Antimicrobials with meningitis order sentences had fewer dosing errors (19.8% vs 43.2%, P < .01).
Sung et al 201838c Prognosis Comparing Metamap to IVT eligibility criteria of adult ED patients with AIS who presented within 3h of onset but were not treated with IVT to errors in determining eligibility for IVT in stroke patients Precision, recall and F-score n/a
  • PL: 81.2

  • DL: 97.2

n/a
  • PL:

  • 99.8

  • DL:

  • 100

n/a
  • PL:

  • 89.5

  • DL:

  • 98.6

n/a Users using the task-specific interface achieved a higher accuracy score than those using the current interface (91% vs 80%) in assessing the IVT eligibility criteria. The completion time between the interfaces was statistically similar (2.46 min vs 1.70 min).
Wadia et al 201735 Prognosis Comparing the gold standard (discussion between pathologists and oncologist) versus the tool of patients undergoing colonoscopy or surgery for colon lesions to identifying cases that required close clinical follow up Recall, specificity, precision and F-score n/a 100 98.5 95.2 n/a 97.5 n/a
Wagholikar et al 201265 Diagnosis Comparing the interpretation of free-text Pap reports of Physician versus CDSS of patients with Pap reports to develop a computerized clinical decision support system for cervical cancer screening that can interpret free-text Pap reports. Qualitative n/a n/a n/a n/a n/a n/a n/a Evaluation revealed that the CDSS outputs the optimal screening recommendations for 73 out of 74 test patients and it identified 2 cases for gynecology referral that were missed by the physician. The CDSS aided the physician to amend recommendations in 6 cases.
Wagholikar et al 201366 Diagnosis Comparing cervical cancer screening of care providers versus the CDSS of patients who had visited the Mayo clinic Rochester in March 2012 to ensure deployment readiness of the system. Accuracy 87 n/a n/a n/a n/a n/a n/a When the deficiencies were rectified, the system generated optimal recommendations for all failure cases, except one with incomplete documentation.
Watson et al 201162 Diagnosis Comparing the model versus reading and evaluating patient characteristics in the EHR notes of patients who are discharged with a principal diagnosis of HF to examine psychosocial characteristics as a predictor to heart failure to reduce hospital readmissions Sensitivity and specificity n/a >80 >80 n/a n/a n/a n/a Detection of 5 characteristics that were associated with an increased risk for hospital readmission
Yang et al 201832c Diagnosis Comparing 4 machine learning algorithms, as well as our proposed model of patients with multiple diseases to assist diagnosis Accuracy, recall, precision and F-score 98.67 96.02 n/a 95.94 n/a 95.96 n/a
Zhou et al 201571 Diagnosis Comparing the golden standard (manual review) versus tool of patients with an history of ischemic heart disease and hospitalized to identify patients with depression Sensitivity, specificity and PPV n/a
  • HC: 92.4

  • IC: 77.4

n/a
  • HC: 86.9

  • IC: 64.9

n/a
  • HC: 89.6

  • IC:

  • 70.6

n/a

A: accuracy; AIS: acute ischemic stroke; ASD: automated symptom detection; AUC: Area Under the Curve; CAP: community acquired pneumonia; CC: completed colonoscopies; CDSS: clinical decision support system; Co k: Cohen’s kappa; CS: colonoscopy status; CT: computed tomography; CXR: Chest X ray; DNN: deep neural network; DL: document level; ED: Emergency Department; EHR: Electronic Health Record; EMR: electronic medical record; ERPR: external peer review program; F: F-score; HC: high confidence; HF: heart failure; IC: intermediate confidence; ICD-9: International Statistical Classification of Diseases and Related Health Problems; IVT: intravenous thrombolytic therapy; kNN: k-nearest neighbor; L: location; LVEF: left ventricular ejection fraction; LVSF: left ventricular systolic function; MMTx: MetaMap Transfer; MRI: magnetic resonance imaging; MRSA: methicillin-resistant Staphylococcus aureus; N: number; NICU: Neonatal Intensive Care Unit; NLP: natural language processing; NPV: negative predictive value; Pap: Papanicolaou; PL: phrase level; PPV: positive predictive value; Sens: sensitivity; QL: qualitative; QT: quantitative; RF: random forest; RS: reference standard; S: size; SDA: symptom detection with assertion; SESCAM: Servicio de Salud de Castilla—La Mancha; Spec: specificity; TB: tuberculosis; TR: timing references; UPMC: University of Pittsburgh Medical Center; USMSTF: US Multisociety Task Force on Colorectal Cancer; VA: veterans affairs.

a

These F-scores were calculated according to the formula: 2*(sensitivity * PPV)/(sensitivity + PPV).

b

These studies were implemented in clinical practice.

c

These studies were not performed in an English speaking country.