. 2022 Dec 13;30(3):588–603. doi: 10.1093/jamia/ocac240

Table 2.

Outcomes and additional findings of the studies combining CDS and TM

Study	Application field	Comparison	Quantitative outcome measure	Outcomes estimates (%)							Additional findings
				A	Sens	Spec	PPV	NPV	F	Co K
Aronsky et al 2001⁶⁸	Diagnosis	Comparing a clinically valid gold standard (3-step diagnostic evaluation process and 8 independent physicians) versus the model, of ED patients whose CXR report was available during the encounter to extract pneumonia diagnosis from unstructured data, detect and prompt initiation of antibiotic treatment to reduce disease severity and mortality	Sensitivity, specificity, PPV, NPV	n/a	96	63	14.2	99.5	25^a	n/a	The area under the receiver operating characteristic curve was 0.881 (95% CI, 0.822–0.925) for the Bayesian network alone and 0.916 (95% CI, 0.869–0.949) combined (P = .01)
Bozkurt et al 2016⁵⁷	Diagnosis	Comparing reference standard decision support system with the NLP decision support systems of patients with mammography reports, to provide decision support as part of the workflow of producing the radiology report.	Accuracy	97,58	n/a	n/a	n/a	n/a	n/a	n/a	The system performed extraction of imaging observations with their modifiers from text reports with precision = 94.9%, recall = 90.9%, and F-measure = 92%. They also compared the BI-RADS categories, accuracy rate of the Bayesian network outputs for each setting were calculated as 98.14% (history nodes included) and 98.15% (history nodes not included). The NLP-DSS and RS-DSS had closely matched probabilities, with a mean paired difference of 0.004 ± 0.025. The concordance correlation of these paired measures was 0.95.
Byrd et al 2012⁵⁸	Diagnosis	Comparing the performance of the machine-learning and rule-based labelers of patients diagnosed with HF in primary care to identify heart failure with the Framingham diagnostic criteria to detect heart failure early.	Precision, recall, F-score	n/a	89.6	n/a	92.5	n/a	93.2	n/a	Detection with the Framingham criteria had an F-score of 0.910. Encounter labeling achieves an F-score of 0.932.
Chen et al 2017³⁹	Prognosis	Comparing clinical order suggestions against the “correct” set of orders that actually occurred within a follow-up verification time for the patient with an encounter from their initial presentation until hospital discharge to build a probabilistic topic model representations of hospital admissions processes	Sensitivity and PPV	n/a	47	n/a	24	n/a	32^a	n/a	Existing order sets predict clinical orders used within 24 h with area under the receiver operating characteristic curve 0.81, precision 16%, and recall 35%. This can be improved to 0.90, 24%, and 47% by using probabilistic topic models to summarize clinical data into up to 32 topics.
Cruz et al 2019³⁴ ^b^,^c	Prognosis	Comparing the same patients in the primary care area of SESCAM with recommendations before and after implementation of the CDSS to improve Adherence to Clinical Pathways and reducing clinical variability	Qualitative	n/a	n/a	n/a	n/a	n/a	n/a	n/a	Adherence rates to clinical pathways improved in 8 out of 18 recommendations when the real-time CDSS was employed, achieving 100% compliance in some cases. The improvement was statistically significant in 3 cases (P < .05). Considering that this was a preliminary study, these results are promising.
Day et al 2007⁵⁴ ^b	Diagnosis	Reading documentation to determine whether registry inclusion criteria were met and printing admission lists versus the tool of trauma patients or patients who died to automate the process of identifying trauma patients	Qualitative	n/a	n/a	n/a	n/a	n/a	n/a	n/a	This program has made the job of identifying trauma patients much less complicated and time consuming. It improved the efficiency and reduced the amount of time wasted using multiple for-mats to ensure that all patients who qualify for inclusion are found. The program also stores relevant patient information to a permanent electronic database
Denny et al 2010⁶¹	Prognosis	Comparing the gold standard (expert physicians’ manual review of EMR notes) versus the tool of patients whose colonoscopy statuses were unknown to detect colorectal cancer screening status	Recall and precision	n/a	TR: 91 CS : 82 CC : 93	n/a	TR : 95 CS : 95 CC : 95	n/a	TR : 93^a CS : 88^a CC : 94^a	n/a	–
Evans et al 2016⁵³ ^b	Diagnosis	Comparing patients with HF treated using the new tool compared with HF patients who had received standard care at the same hospital before the tool was implemented to help identify high risk heart failure patients	Sensitivity, specificity, PPV	n/a	82.6–95.3	82.7–97.5	97.45	n/a	89–96^a	n/a	–
Fiszman et al 2000⁶⁹	Diagnosis	Comparing SymText against 4 physicians, 2 different keyword searches, and 3 lay persons of patients with a primary ICD-9 hospital discharge diagnosis of bacterial pneumonia to find cases of CAP to support diagnosis and treatment.	Recall, precision and specificity	n/a	95	85	78	n/a	86^a	n/a	–
Friedlin and McDonald 2006⁴⁶	Diagnosis	Comparing REX to the gold standard (specially trained human coders as well as an experienced physician) of patients who have had a chest x-ray with dictation reports to identify congestive heart failure	Sensitivity, specificity, PPV, NPV	n/a	100	100	95.4	100	98^a	n/a	–
Friedlin et al 2008⁴⁵	Diagnosis	Comparing REX to the gold standard (human review) of patients with MRSA keywords to improve reporting notifiable diseases with automated electronic laboratory MRSA reporting.	Sensitivity, specificity, PPV, F-measure	n/a	99.96	99.71	99.81	99.93	99.89	n/a	REX identified over 2 times as many MRSA positive reports as the electronic lab system without NLP.
Garvin et al 2018⁶⁰	Diagnosis	Comparing CHIEF versus reference standard and of External Peer Review Program cases involving HF patients discharged from 8 VA medical centers to accurately automate the quality measure for inpatients with HF.	Sensitivity and PPV	n/a	RS: 98.9 EPRP: 98.5	n/a	98.7	n/a	RS: 99^a EPRP : 99^a	n/a	Reference standard (RS) External Peer Review Program (EPRP). Of the 1083 patients available for the NLP system, the CHIEF evaluated and classified 100% of cases.
Hazlehurst et al 2005⁵⁶	Diagnosis	Comparing MediClass versus the gold standard of patients who are smokers to detect clinical events in the medical record	Specificity and sensitivity	n/a	82	93	n/a	n/a	n/a	n/a	–
Imler et al 2013⁶³	Diagnosis	Comparing annotation by gastroenterologists (reference standard) versus the NLP system of veterans who had an index colonoscopy to extract meaningful information from free-text gastroenterology reports for secondary use, to connect the pathologic record that is generally disconnected from the reports	Recall, precision, accuracy, and f-measure	L: 97 S: 96 N: 84	L > 82 S > 92 N > 66	n/a	L > 95 S > 95 N > 64	n/a	L : 96 S : 96 N: 62	n/a
Imler et al 2014⁶⁴	Diagnosis	Comparing NLP-based CDS surveillance intervals with those determined by paired, blinded, manual review of patients with an index colonoscopy for any indication except surveillance of a previous colorectal neoplasia to improve adherence to evidence-based practices and guidelines in endoscopy.	Kappa statistic	n/a	n/a	n/a	n/a	n/a	n/a	74	Fifty-five reports differed between manual review and CDS recommendations. Of these, NLP error accounted for 54.5%, incomplete resection of adenomatous tissue accounted for 25.5%, and masses observed without biopsy findings of cancer accounted for 7.2%. NLP based CDS surveillance intervals had higher levels of agreement with the standard than the level agreement between experts.
Jain et al 1996⁴⁷ ^b	Diagnosis	Comparing manual coding (gold standard) versus MedLee of patients with culture positive tuberculosis to find cases of tuberculosis in radiographic reports to identify eligible patients from the data.	Sensitivity	n/a	With L: 78.9 w/o L: 85.4	n/a	n/a	n/a	n/a	n/a	MedLEE agreed on the classification of 152/171 (88.9%) reports 129/142 (90.8%) suspicious for TB and 23/29 (79.3%) not suspicious for TB; and 1072/1197 (89.6%) terms indicative of TB. Analysis showed that most of the discrepancies were caused by MedLEE not finding the location of the infiltrate. By ignoring the location (L) of the infiltrate, the agreement became 157/171 (91.8%) reports and 946/1026 (92.2%) terms.
Jones et al 2012⁵² ^b	Diagnosis	Screening tool sensitivity and specificity as well as for ICD-9 plus radiographic confirmation were compared to physician review of ED patients with a chest x-ray of CT scan to identify patients with pneumonia	Sensitivity, PPV, specificity, NPV	n/a	61	96	52	97	56^a	n/a	Among 41 true positive cases, ED physicians recognized and agreed with the tool in 39%. In only 6 cases did physicians proceed to complete the accompanying pneumonia decision support tool. Of the 39 false positive cases, the NLP incorrectly identified pneumonia in 74%. Of the 8 false negative cases, one was due to failure of the NLP to identify pneumonia
Kalra et al 2020⁴¹	Prognosis	Comparing manually review versus the models (kNN, RF, DNN) of older men and include both primary and tertiary care indications to enhance multispecialty CT and MRI protocol assignment quality and efficiency	Precision and recall	n/a	kNN: 83.4 RF: 92.2 DNN: 91.5	n/a	kNN: 77.5 RF: 81.7 DNN: 83.6	n/a	kNN : 80^a RF : 86^a DNN : 87^a	n/a	Baseline protocol assignment performance achieved weighted precision of 0.757–0.824. Simulating real-world deployment using combined thresholding techniques, the optimized deep neural network model assigned 69% of protocols in automation mode with recall 95% =(weighted accuracy). In the remaining 31% of cases, the model achieved 92% accuracy in CDS mode.
Karwa et al 2020³³	Prognosis	Comparison of Colonoscopy Follow-up Recommendations between CDS algorithm and endoscopists of patients with a colonoscopy to assist clinicians to generate colonoscopy follow-up intervals based on the USMSTF guidelines	Cohen’s Kappa	n/a	n/a	n/a	69%	n/a	n/a	58.3	Discrepant recommendations by endoscopists were earlier than guidelines in 91% of cases.
Kim et al 2015⁷⁰	Diagnosis	A 5-fold cross validation with the tool to measure the contribution of each of the 4 subset (lexial, concept position, related concept, section) of patients with LVEF or LVSF to classify the contextual use of both quantitative and qualitative LVEF assessments in clinical narrative documents.	Recall, precision and F-measure	n/a	QT: 95.6 QL: 94.2	n/a	QT: 95.6 QL: 94.2	n/a	QT: 95.6 QL: 94.2	n/a	The experimental results showed that the classifiers achieved good performance, reaching 95.6% F1-measure for quantitative (QT) assessments and 94.2% F1-measure for qualitative (QL) assessments in a 5-fold cross validation evaluation.
Kivekäs et al 2016⁴⁰ ^c	Prognosis	Comparing the review team versus SAS of adult epilepsy patients to test the functionality and validity of the previously defined triggers to describe the status of epilepsy patient’s well-being.	Qualitative	n/a	n/a	n/a	n/a	n/a	n/a	n/a	In both medical and nursing data, the triggers described patients’ well-being comprehensively. The narratives showed that there was overlapping in triggers.
Matheny et al 2012⁵⁹	Diagnosis	Comparing manual reference versus algorithm of patients with at least one surgical admission to identify infectious symptoms	F measure, Fleiss’ Kappa, precision, recall,	n/a	ASD: 84 SDA: 62	n/a	ASD: 91 SDA: 67	n/a	ASD: 87 SDA: 64	n/a	Among those instances in which the automated system matched the reference set determination for symptom, the system correctly detected 84.7% of positive assertions, 75.1% of negative assertions, and 0.7% of uncertain assertions.
Mendonça et al 2005⁴⁸	Diagnosis	Comparing Clinicians’ judgments versus the tool of infants admitted at the NICU to identify pneumonia in newborns, to reduce manual monitoring	Sensitivity, specificity and PPV	n/a	71	99	7.5	n/a	n/a	n/a	–
Meystre and Haug 2005³⁷	Prognosis	Comparing NLP tools versus reference standard which was created with a chart review of patients admitted in a cardiovascular unit, stay of at least 48h, and with a discharge diagnosis in the list of the 80 selected diagnosis to extract medical problems from electronic clinical document to maintain, complete and up-to-date a problem list	Precision and recall Cohen’s Kappa	n/a	74	n/a	75.6	n/a	75^a	90	A custom data subset for MMTx was created, making it faster and significantly improving the recall to 0.896 with a non-significant reduction in precision.
Meystre and Haug 2008³⁶	Prognosis	Comparing the control group (the standard electronic problem list) versus the intervention group (the Automated Problem List system) of inpatients of the 2 inpatients wards for at least 48h, > 18 years and not already enrolled in a previous phase of this study to improve the completeness and timeliness of an electronic problem list	Sensitivity, specificity, PPV and NPV	n/a	81.5	95.7	78.4	95.6	80^a	n/a	–
Nguyen et al 2019³¹ ^c	Diagnosis Therapy	Comparing the resultant number of test results identified by the system for clinical review to the full set of test results that would have been manually reviewed of patients with ED encounters, to ensure important diagnoses are recognized and correct antibiotics are prescribed.	PPV, sensitivity and F-measure	n/a	94,3	n/a	85,8	n/a	89,8	n/a	–
Raja et al 2012⁵⁰ ^b	Diagnosis	Pulmonary angiography were compared before and after CDS implementation of ED patients who underwent CT pulmonary angiography to decrease the use and increase in yield of CT for acute pulmonary embolism	Accuracy, sensitivity, PPV, NPV and specificity	97.8	91.3	98.7	91.3	98.7	91.3	n/a	Quarterly CT pulmonary angiography use increased 82.1% before CDSS implementation, from 14.5 to 26.4 examinations per 1000 patients (P<.0001). After CDSS implementation, quarterly use decreased 20.1%, from 26.4 to 21.1 examinations per 1000 patients (P=.0379).
Raja et al 2019⁴² ^b	Prognosis	Comparing research assistant Golden standard (3 physicians) versus the tool of patients < 50 years with a history of uncomplicated nephrolithiasis presenting to the ED to evaluate the impact of an appropriate use criterion for renal colic based on local best practice, implemented on the ED use of CT.	Qualitative	n/a	n/a	n/a	n/a	n/a	n/a	n/a	The final sample included 467 patients (194 study site) before and 306 (88 study site) after AUC implementation. The study site’s CT of ureter rate decreased from 23.7% (46/194) to 14.8% (13/88) (P = .03) after implementation of the AUC. The rate at the control site remained unchanged, 49.8% (136/273) versus 48.2% (105/218) (P = .3).
Rosenthal et al 2019⁵¹ ^b	Diagnosis	Comparing UPMC Children’s Hospital of Pittsburgh earlier study results versus UPMC Hamot and Mercy of children < 2 years who triggered the EHR-based alert system to increase the number of young children identified as having injuries suspicious for physical abuse	Qualitative	n/a	n/a	n/a	n/a	n/a	n/a	n/a	A total of 242 children triggered the system, 86 during the pre-intervention and 156 during the intervention. The number of children identified with suspicious injuries increased 4-fold during the intervention (P<.001). Compliance was 70% (7 of 10) in the pre-intervention period versus 50% (22 of 44) in the intervention, a change that was not statistically different (P=.55).
Shen et al 2020⁴³	Therapy	Comparing Pre-existing workflow versus pilot workflow of patients undergoing outpatient endoscopy to decrease sedation-type order errors	Precision, PPV, NPV, Sensitivity and Specificity.	n/a	89.1	99.2	28.5	99.9	43^a	n/a	–
Smith et al 2021⁵⁵ ^b	Diagnosis	Comparing the algorithm to manual review, to expected performance of the model and to prior work in adults using CXR and other clinical data to recognize patients with pneumonia < 18 years.	Sensitivity, specificity, PPV, F-measure	n/a	89.9	94.9	78.1	n/a	83.5	n/a	–
Stultz et al 2019⁴⁴	Therapy	Comparing different meningitis dosing alert triggers and dosing error rates between antimicrobials with and without meningitis order sentences of patients admitted to an inpatient pediatric service or the pediatric ED, to provide a meningitis specific dosing alert for detecting meningitis management	Sensitivity, PPV	n/a	67,5	n/a	80,9	n/a	74^a	n/a	Antimicrobials with meningitis order sentences had fewer dosing errors (19.8% vs 43.2%, P < .01).
Sung et al 2018³⁸ ^c	Prognosis	Comparing Metamap to IVT eligibility criteria of adult ED patients with AIS who presented within 3h of onset but were not treated with IVT to errors in determining eligibility for IVT in stroke patients	Precision, recall and F-score	n/a	PL: 81.2 DL: 97.2	n/a	PL: 99.8 DL: 100	n/a	PL: 89.5 DL: 98.6	n/a	Users using the task-specific interface achieved a higher accuracy score than those using the current interface (91% vs 80%) in assessing the IVT eligibility criteria. The completion time between the interfaces was statistically similar (2.46 min vs 1.70 min).
Wadia et al 2017³⁵	Prognosis	Comparing the gold standard (discussion between pathologists and oncologist) versus the tool of patients undergoing colonoscopy or surgery for colon lesions to identifying cases that required close clinical follow up	Recall, specificity, precision and F-score	n/a	100	98.5	95.2	n/a	97.5	n/a	–
Wagholikar et al 2012⁶⁵	Diagnosis	Comparing the interpretation of free-text Pap reports of Physician versus CDSS of patients with Pap reports to develop a computerized clinical decision support system for cervical cancer screening that can interpret free-text Pap reports.	Qualitative	n/a	n/a	n/a	n/a	n/a	n/a	n/a	Evaluation revealed that the CDSS outputs the optimal screening recommendations for 73 out of 74 test patients and it identified 2 cases for gynecology referral that were missed by the physician. The CDSS aided the physician to amend recommendations in 6 cases.
Wagholikar et al 2013⁶⁶	Diagnosis	Comparing cervical cancer screening of care providers versus the CDSS of patients who had visited the Mayo clinic Rochester in March 2012 to ensure deployment readiness of the system.	Accuracy	87	n/a	n/a	n/a	n/a	n/a	n/a	When the deficiencies were rectified, the system generated optimal recommendations for all failure cases, except one with incomplete documentation.
Watson et al 2011⁶²	Diagnosis	Comparing the model versus reading and evaluating patient characteristics in the EHR notes of patients who are discharged with a principal diagnosis of HF to examine psychosocial characteristics as a predictor to heart failure to reduce hospital readmissions	Sensitivity and specificity	n/a	>80	>80	n/a	n/a	n/a	n/a	Detection of 5 characteristics that were associated with an increased risk for hospital readmission
Yang et al 2018³² ^c	Diagnosis	Comparing 4 machine learning algorithms, as well as our proposed model of patients with multiple diseases to assist diagnosis	Accuracy, recall, precision and F-score	98.67	96.02	n/a	95.94	n/a	95.96	n/a	–
Zhou et al 2015⁷¹	Diagnosis	Comparing the golden standard (manual review) versus tool of patients with an history of ischemic heart disease and hospitalized to identify patients with depression	Sensitivity, specificity and PPV	n/a	HC: 92.4 IC: 77.4	n/a	HC: 86.9 IC: 64.9	n/a	HC: 89.6 IC: 70.6	n/a	–

A: accuracy; AIS: acute ischemic stroke; ASD: automated symptom detection; AUC: Area Under the Curve; CAP: community acquired pneumonia; CC: completed colonoscopies; CDSS: clinical decision support system; Co k: Cohen’s kappa; CS: colonoscopy status; CT: computed tomography; CXR: Chest X ray; DNN: deep neural network; DL: document level; ED: Emergency Department; EHR: Electronic Health Record; EMR: electronic medical record; ERPR: external peer review program; F: F-score; HC: high confidence; HF: heart failure; IC: intermediate confidence; ICD-9: International Statistical Classification of Diseases and Related Health Problems; IVT: intravenous thrombolytic therapy; kNN: k-nearest neighbor; L: location; LVEF: left ventricular ejection fraction; LVSF: left ventricular systolic function; MMTx: MetaMap Transfer; MRI: magnetic resonance imaging; MRSA: methicillin-resistant Staphylococcus aureus; N: number; NICU: Neonatal Intensive Care Unit; NLP: natural language processing; NPV: negative predictive value; Pap: Papanicolaou; PL: phrase level; PPV: positive predictive value; Sens: sensitivity; QL: qualitative; QT: quantitative; RF: random forest; RS: reference standard; S: size; SDA: symptom detection with assertion; SESCAM: Servicio de Salud de Castilla—La Mancha; Spec: specificity; TB: tuberculosis; TR: timing references; UPMC: University of Pittsburgh Medical Center; USMSTF: US Multisociety Task Force on Colorectal Cancer; VA: veterans affairs.

These F-scores were calculated according to the formula: 2*(sensitivity * PPV)/(sensitivity + PPV).

These studies were implemented in clinical practice.

These studies were not performed in an English speaking country.