Table 2.
Study | Application field | Comparison | Quantitative outcome measure | Outcomes estimates (%) |
Additional findings | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
A | Sens | Spec | PPV | NPV | F | Co K | |||||
Aronsky et al 200168 | Diagnosis | Comparing a clinically valid gold standard (3-step diagnostic evaluation process and 8 independent physicians) versus the model, of ED patients whose CXR report was available during the encounter to extract pneumonia diagnosis from unstructured data, detect and prompt initiation of antibiotic treatment to reduce disease severity and mortality | Sensitivity, specificity, PPV, NPV | n/a | 96 | 63 | 14.2 | 99.5 | 25a | n/a | The area under the receiver operating characteristic curve was 0.881 (95% CI, 0.822–0.925) for the Bayesian network alone and 0.916 (95% CI, 0.869–0.949) combined (P = .01) |
Bozkurt et al 201657 | Diagnosis | Comparing reference standard decision support system with the NLP decision support systems of patients with mammography reports, to provide decision support as part of the workflow of producing the radiology report. | Accuracy | 97,58 | n/a | n/a | n/a | n/a | n/a | n/a | The system performed extraction of imaging observations with their modifiers from text reports with precision = 94.9%, recall = 90.9%, and F-measure = 92%. They also compared the BI-RADS categories, accuracy rate of the Bayesian network outputs for each setting were calculated as 98.14% (history nodes included) and 98.15% (history nodes not included). The NLP-DSS and RS-DSS had closely matched probabilities, with a mean paired difference of 0.004 ± 0.025. The concordance correlation of these paired measures was 0.95. |
Byrd et al 201258 | Diagnosis | Comparing the performance of the machine-learning and rule-based labelers of patients diagnosed with HF in primary care to identify heart failure with the Framingham diagnostic criteria to detect heart failure early. | Precision, recall, F-score | n/a | 89.6 | n/a | 92.5 | n/a | 93.2 | n/a | Detection with the Framingham criteria had an F-score of 0.910. Encounter labeling achieves an F-score of 0.932. |
Chen et al 201739 | Prognosis | Comparing clinical order suggestions against the “correct” set of orders that actually occurred within a follow-up verification time for the patient with an encounter from their initial presentation until hospital discharge to build a probabilistic topic model representations of hospital admissions processes | Sensitivity and PPV | n/a | 47 | n/a | 24 | n/a | 32a | n/a | Existing order sets predict clinical orders used within 24 h with area under the receiver operating characteristic curve 0.81, precision 16%, and recall 35%. This can be improved to 0.90, 24%, and 47% by using probabilistic topic models to summarize clinical data into up to 32 topics. |
Cruz et al 201934b,c | Prognosis | Comparing the same patients in the primary care area of SESCAM with recommendations before and after implementation of the CDSS to improve Adherence to Clinical Pathways and reducing clinical variability | Qualitative | n/a | n/a | n/a | n/a | n/a | n/a | n/a | Adherence rates to clinical pathways improved in 8 out of 18 recommendations when the real-time CDSS was employed, achieving 100% compliance in some cases. The improvement was statistically significant in 3 cases (P < .05). Considering that this was a preliminary study, these results are promising. |
Day et al 200754b | Diagnosis | Reading documentation to determine whether registry inclusion criteria were met and printing admission lists versus the tool of trauma patients or patients who died to automate the process of identifying trauma patients | Qualitative | n/a | n/a | n/a | n/a | n/a | n/a | n/a | This program has made the job of identifying trauma patients much less complicated and time consuming. It improved the efficiency and reduced the amount of time wasted using multiple for-mats to ensure that all patients who qualify for inclusion are found. The program also stores relevant patient information to a permanent electronic database |
Denny et al 201061 | Prognosis | Comparing the gold standard (expert physicians’ manual review of EMR notes) versus the tool of patients whose colonoscopy statuses were unknown to detect colorectal cancer screening status | Recall and precision | n/a |
|
n/a |
|
n/a | n/a | – | |
Evans et al 201653b | Diagnosis | Comparing patients with HF treated using the new tool compared with HF patients who had received standard care at the same hospital before the tool was implemented to help identify high risk heart failure patients | Sensitivity, specificity, PPV | n/a | 82.6–95.3 | 82.7–97.5 | 97.45 | n/a | 89–96a | n/a | – |
Fiszman et al 200069 | Diagnosis | Comparing SymText against 4 physicians, 2 different keyword searches, and 3 lay persons of patients with a primary ICD-9 hospital discharge diagnosis of bacterial pneumonia to find cases of CAP to support diagnosis and treatment. | Recall, precision and specificity | n/a | 95 | 85 | 78 | n/a | 86a | n/a | – |
Friedlin and McDonald 200646 | Diagnosis | Comparing REX to the gold standard (specially trained human coders as well as an experienced physician) of patients who have had a chest x-ray with dictation reports to identify congestive heart failure | Sensitivity, specificity, PPV, NPV | n/a | 100 | 100 | 95.4 | 100 | 98a | n/a | – |
Friedlin et al 200845 | Diagnosis | Comparing REX to the gold standard (human review) of patients with MRSA keywords to improve reporting notifiable diseases with automated electronic laboratory MRSA reporting. | Sensitivity, specificity, PPV, F-measure | n/a | 99.96 | 99.71 | 99.81 | 99.93 | 99.89 | n/a | REX identified over 2 times as many MRSA positive reports as the electronic lab system without NLP. |
Garvin et al 201860 | Diagnosis | Comparing CHIEF versus reference standard and of External Peer Review Program cases involving HF patients discharged from 8 VA medical centers to accurately automate the quality measure for inpatients with HF. | Sensitivity and PPV | n/a |
|
n/a | 98.7 | n/a | n/a | Reference standard (RS) External Peer Review Program (EPRP). Of the 1083 patients available for the NLP system, the CHIEF evaluated and classified 100% of cases. | |
Hazlehurst et al 200556 | Diagnosis | Comparing MediClass versus the gold standard of patients who are smokers to detect clinical events in the medical record | Specificity and sensitivity | n/a | 82 | 93 | n/a | n/a | n/a | n/a | – |
Imler et al 201363 | Diagnosis | Comparing annotation by gastroenterologists (reference standard) versus the NLP system of veterans who had an index colonoscopy to extract meaningful information from free-text gastroenterology reports for secondary use, to connect the pathologic record that is generally disconnected from the reports | Recall, precision, accuracy, and f-measure |
|
|
n/a |
|
n/a |
|
n/a | |
Imler et al 201464 | Diagnosis | Comparing NLP-based CDS surveillance intervals with those determined by paired, blinded, manual review of patients with an index colonoscopy for any indication except surveillance of a previous colorectal neoplasia to improve adherence to evidence-based practices and guidelines in endoscopy. | Kappa statistic | n/a | n/a | n/a | n/a | n/a | n/a | 74 | Fifty-five reports differed between manual review and CDS recommendations. Of these, NLP error accounted for 54.5%, incomplete resection of adenomatous tissue accounted for 25.5%, and masses observed without biopsy findings of cancer accounted for 7.2%. NLP based CDS surveillance intervals had higher levels of agreement with the standard than the level agreement between experts. |
Jain et al 199647b | Diagnosis | Comparing manual coding (gold standard) versus MedLee of patients with culture positive tuberculosis to find cases of tuberculosis in radiographic reports to identify eligible patients from the data. | Sensitivity | n/a |
|
n/a | n/a | n/a | n/a | n/a | MedLEE agreed on the classification of 152/171 (88.9%) reports 129/142 (90.8%) suspicious for TB and 23/29 (79.3%) not suspicious for TB; and 1072/1197 (89.6%) terms indicative of TB. Analysis showed that most of the discrepancies were caused by MedLEE not finding the location of the infiltrate. By ignoring the location (L) of the infiltrate, the agreement became 157/171 (91.8%) reports and 946/1026 (92.2%) terms. |
Jones et al 201252b | Diagnosis | Screening tool sensitivity and specificity as well as for ICD-9 plus radiographic confirmation were compared to physician review of ED patients with a chest x-ray of CT scan to identify patients with pneumonia | Sensitivity, PPV, specificity, NPV | n/a | 61 | 96 | 52 | 97 | 56a | n/a | Among 41 true positive cases, ED physicians recognized and agreed with the tool in 39%. In only 6 cases did physicians proceed to complete the accompanying pneumonia decision support tool. Of the 39 false positive cases, the NLP incorrectly identified pneumonia in 74%. Of the 8 false negative cases, one was due to failure of the NLP to identify pneumonia |
Kalra et al 202041 | Prognosis | Comparing manually review versus the models (kNN, RF, DNN) of older men and include both primary and tertiary care indications to enhance multispecialty CT and MRI protocol assignment quality and efficiency | Precision and recall | n/a |
|
n/a |
|
n/a | n/a | Baseline protocol assignment performance achieved weighted precision of 0.757–0.824. Simulating real-world deployment using combined thresholding techniques, the optimized deep neural network model assigned 69% of protocols in automation mode with recall 95% =(weighted accuracy). In the remaining 31% of cases, the model achieved 92% accuracy in CDS mode. | |
Karwa et al 202033 | Prognosis | Comparison of Colonoscopy Follow-up Recommendations between CDS algorithm and endoscopists of patients with a colonoscopy to assist clinicians to generate colonoscopy follow-up intervals based on the USMSTF guidelines | Cohen’s Kappa | n/a | n/a | n/a | 69% | n/a | n/a | 58.3 | Discrepant recommendations by endoscopists were earlier than guidelines in 91% of cases. |
Kim et al 201570 | Diagnosis | A 5-fold cross validation with the tool to measure the contribution of each of the 4 subset (lexial, concept position, related concept, section) of patients with LVEF or LVSF to classify the contextual use of both quantitative and qualitative LVEF assessments in clinical narrative documents. | Recall, precision and F-measure | n/a |
|
n/a |
|
n/a |
|
n/a | The experimental results showed that the classifiers achieved good performance, reaching 95.6% F1-measure for quantitative (QT) assessments and 94.2% F1-measure for qualitative (QL) assessments in a 5-fold cross validation evaluation. |
Kivekäs et al 201640c | Prognosis | Comparing the review team versus SAS of adult epilepsy patients to test the functionality and validity of the previously defined triggers to describe the status of epilepsy patient’s well-being. | Qualitative | n/a | n/a | n/a | n/a | n/a | n/a | n/a | In both medical and nursing data, the triggers described patients’ well-being comprehensively. The narratives showed that there was overlapping in triggers. |
Matheny et al 201259 | Diagnosis | Comparing manual reference versus algorithm of patients with at least one surgical admission to identify infectious symptoms | F measure, Fleiss’ Kappa, precision, recall, | n/a |
|
n/a |
|
n/a |
|
n/a | Among those instances in which the automated system matched the reference set determination for symptom, the system correctly detected 84.7% of positive assertions, 75.1% of negative assertions, and 0.7% of uncertain assertions. |
Mendonça et al 200548 | Diagnosis | Comparing Clinicians’ judgments versus the tool of infants admitted at the NICU to identify pneumonia in newborns, to reduce manual monitoring | Sensitivity, specificity and PPV | n/a | 71 | 99 | 7.5 | n/a | n/a | n/a | – |
Meystre and Haug 200537 | Prognosis | Comparing NLP tools versus reference standard which was created with a chart review of patients admitted in a cardiovascular unit, stay of at least 48h, and with a discharge diagnosis in the list of the 80 selected diagnosis to extract medical problems from electronic clinical document to maintain, complete and up-to-date a problem list |
|
n/a | 74 | n/a | 75.6 | n/a | 75a | 90 | A custom data subset for MMTx was created, making it faster and significantly improving the recall to 0.896 with a non-significant reduction in precision. |
Meystre and Haug 200836 | Prognosis | Comparing the control group (the standard electronic problem list) versus the intervention group (the Automated Problem List system) of inpatients of the 2 inpatients wards for at least 48h, > 18 years and not already enrolled in a previous phase of this study to improve the completeness and timeliness of an electronic problem list | Sensitivity, specificity, PPV and NPV | n/a | 81.5 | 95.7 | 78.4 | 95.6 | 80a | n/a | – |
Nguyen et al 201931c | Diagnosis Therapy | Comparing the resultant number of test results identified by the system for clinical review to the full set of test results that would have been manually reviewed of patients with ED encounters, to ensure important diagnoses are recognized and correct antibiotics are prescribed. | PPV, sensitivity and F-measure | n/a | 94,3 | n/a | 85,8 | n/a | 89,8 | n/a | – |
Raja et al 201250b | Diagnosis | Pulmonary angiography were compared before and after CDS implementation of ED patients who underwent CT pulmonary angiography to decrease the use and increase in yield of CT for acute pulmonary embolism | Accuracy, sensitivity, PPV, NPV and specificity | 97.8 | 91.3 | 98.7 | 91.3 | 98.7 | 91.3 | n/a | Quarterly CT pulmonary angiography use increased 82.1% before CDSS implementation, from 14.5 to 26.4 examinations per 1000 patients (P<.0001). After CDSS implementation, quarterly use decreased 20.1%, from 26.4 to 21.1 examinations per 1000 patients (P=.0379). |
Raja et al 201942b | Prognosis | Comparing research assistant Golden standard (3 physicians) versus the tool of patients < 50 years with a history of uncomplicated nephrolithiasis presenting to the ED to evaluate the impact of an appropriate use criterion for renal colic based on local best practice, implemented on the ED use of CT. | Qualitative | n/a | n/a | n/a | n/a | n/a | n/a | n/a | The final sample included 467 patients (194 study site) before and 306 (88 study site) after AUC implementation. The study site’s CT of ureter rate decreased from 23.7% (46/194) to 14.8% (13/88) (P = .03) after implementation of the AUC. The rate at the control site remained unchanged, 49.8% (136/273) versus 48.2% (105/218) (P = .3). |
Rosenthal et al 201951b | Diagnosis | Comparing UPMC Children’s Hospital of Pittsburgh earlier study results versus UPMC Hamot and Mercy of children < 2 years who triggered the EHR-based alert system to increase the number of young children identified as having injuries suspicious for physical abuse | Qualitative | n/a | n/a | n/a | n/a | n/a | n/a | n/a | A total of 242 children triggered the system, 86 during the pre-intervention and 156 during the intervention. The number of children identified with suspicious injuries increased 4-fold during the intervention (P<.001). Compliance was 70% (7 of 10) in the pre-intervention period versus 50% (22 of 44) in the intervention, a change that was not statistically different (P=.55). |
Shen et al 202043 | Therapy | Comparing Pre-existing workflow versus pilot workflow of patients undergoing outpatient endoscopy to decrease sedation-type order errors | Precision, PPV, NPV, Sensitivity and Specificity. | n/a | 89.1 | 99.2 | 28.5 | 99.9 | 43a | n/a | – |
Smith et al 202155b | Diagnosis | Comparing the algorithm to manual review, to expected performance of the model and to prior work in adults using CXR and other clinical data to recognize patients with pneumonia < 18 years. | Sensitivity, specificity, PPV, F-measure | n/a | 89.9 | 94.9 | 78.1 | n/a | 83.5 | n/a | – |
Stultz et al 201944 | Therapy | Comparing different meningitis dosing alert triggers and dosing error rates between antimicrobials with and without meningitis order sentences of patients admitted to an inpatient pediatric service or the pediatric ED, to provide a meningitis specific dosing alert for detecting meningitis management | Sensitivity, PPV | n/a | 67,5 | n/a | 80,9 | n/a | 74a | n/a | Antimicrobials with meningitis order sentences had fewer dosing errors (19.8% vs 43.2%, P < .01). |
Sung et al 201838c | Prognosis | Comparing Metamap to IVT eligibility criteria of adult ED patients with AIS who presented within 3h of onset but were not treated with IVT to errors in determining eligibility for IVT in stroke patients | Precision, recall and F-score | n/a |
|
n/a |
|
n/a |
|
n/a | Users using the task-specific interface achieved a higher accuracy score than those using the current interface (91% vs 80%) in assessing the IVT eligibility criteria. The completion time between the interfaces was statistically similar (2.46 min vs 1.70 min). |
Wadia et al 201735 | Prognosis | Comparing the gold standard (discussion between pathologists and oncologist) versus the tool of patients undergoing colonoscopy or surgery for colon lesions to identifying cases that required close clinical follow up | Recall, specificity, precision and F-score | n/a | 100 | 98.5 | 95.2 | n/a | 97.5 | n/a | – |
Wagholikar et al 201265 | Diagnosis | Comparing the interpretation of free-text Pap reports of Physician versus CDSS of patients with Pap reports to develop a computerized clinical decision support system for cervical cancer screening that can interpret free-text Pap reports. | Qualitative | n/a | n/a | n/a | n/a | n/a | n/a | n/a | Evaluation revealed that the CDSS outputs the optimal screening recommendations for 73 out of 74 test patients and it identified 2 cases for gynecology referral that were missed by the physician. The CDSS aided the physician to amend recommendations in 6 cases. |
Wagholikar et al 201366 | Diagnosis | Comparing cervical cancer screening of care providers versus the CDSS of patients who had visited the Mayo clinic Rochester in March 2012 to ensure deployment readiness of the system. | Accuracy | 87 | n/a | n/a | n/a | n/a | n/a | n/a | When the deficiencies were rectified, the system generated optimal recommendations for all failure cases, except one with incomplete documentation. |
Watson et al 201162 | Diagnosis | Comparing the model versus reading and evaluating patient characteristics in the EHR notes of patients who are discharged with a principal diagnosis of HF to examine psychosocial characteristics as a predictor to heart failure to reduce hospital readmissions | Sensitivity and specificity | n/a | >80 | >80 | n/a | n/a | n/a | n/a | Detection of 5 characteristics that were associated with an increased risk for hospital readmission |
Yang et al 201832c | Diagnosis | Comparing 4 machine learning algorithms, as well as our proposed model of patients with multiple diseases to assist diagnosis | Accuracy, recall, precision and F-score | 98.67 | 96.02 | n/a | 95.94 | n/a | 95.96 | n/a | – |
Zhou et al 201571 | Diagnosis | Comparing the golden standard (manual review) versus tool of patients with an history of ischemic heart disease and hospitalized to identify patients with depression | Sensitivity, specificity and PPV | n/a |
|
n/a |
|
n/a |
|
n/a | – |
A: accuracy; AIS: acute ischemic stroke; ASD: automated symptom detection; AUC: Area Under the Curve; CAP: community acquired pneumonia; CC: completed colonoscopies; CDSS: clinical decision support system; Co k: Cohen’s kappa; CS: colonoscopy status; CT: computed tomography; CXR: Chest X ray; DNN: deep neural network; DL: document level; ED: Emergency Department; EHR: Electronic Health Record; EMR: electronic medical record; ERPR: external peer review program; F: F-score; HC: high confidence; HF: heart failure; IC: intermediate confidence; ICD-9: International Statistical Classification of Diseases and Related Health Problems; IVT: intravenous thrombolytic therapy; kNN: k-nearest neighbor; L: location; LVEF: left ventricular ejection fraction; LVSF: left ventricular systolic function; MMTx: MetaMap Transfer; MRI: magnetic resonance imaging; MRSA: methicillin-resistant Staphylococcus aureus; N: number; NICU: Neonatal Intensive Care Unit; NLP: natural language processing; NPV: negative predictive value; Pap: Papanicolaou; PL: phrase level; PPV: positive predictive value; Sens: sensitivity; QL: qualitative; QT: quantitative; RF: random forest; RS: reference standard; S: size; SDA: symptom detection with assertion; SESCAM: Servicio de Salud de Castilla—La Mancha; Spec: specificity; TB: tuberculosis; TR: timing references; UPMC: University of Pittsburgh Medical Center; USMSTF: US Multisociety Task Force on Colorectal Cancer; VA: veterans affairs.
These F-scores were calculated according to the formula: 2*(sensitivity * PPV)/(sensitivity + PPV).
These studies were implemented in clinical practice.
These studies were not performed in an English speaking country.