Skip to main content
. 2021 Feb 22;11(2):372. doi: 10.3390/diagnostics11020372

Table 2.

The 44 reviewed articles reporting validation, which machine learning (ML) models were compared, the best performing model, the metric used by the authors to evaluate the models, results of the studies, most relevant laboratory features and issues of the studies.

Reference Validation Comparison Best Performer BP’s family Metrics Used Results Most Important Laboratory Features for the Model Issues/Notes
Awad et al. (2017) [18] CV RF, DT, NB, PART, Scores (SOFA, SAPS-I, APACHE-II, NEWS, qSOFA) RF Trees AUROC RF best performance (VS subset) predicting hospital mortality: 0.90 ± 0.01 AUROC

AUROC
RF (15 variables) at 6 h: 0.82 ± 0.04
SAPS at 24 h (best performer among scores): 0.650 ± 0.012
Vital Signs, age, serum urea nitrogen, respiratory rate max, heart rate max, heart rate min, creatinine max, care unit name, potassium min, GCS min and systolic blood pressure min Performance metrics for comparison referred to cross-validation results
Escobar et al. (2017) [19] CV 3 LoR models, Zilberberg model LoR (automated model) Regression AUROC, pseudo-R2, Sensitivity, Specificity, PPV, NPV, NNE, NRI, IDI AUROC; R2 Performance metrics for comparison referred to cross-validation results
Age ≥ 65 years 0.546; −0.1131
Basic model 0.591; −0.0910
Zilberberg model 0.591; −0.0875
Enhanced model 0.587; −0.0924
Automated model 0.605; −0.1033
Richardson and Lidbury (2017) [20] CV RF (variables selection) + SVM *** NE SVM *** AUROC, F1, Sensibility, Specificity, Precision For both HBV and HCV, 3 balancing methods and 2 feature selectors were tested, showing how they can change SVM performances HBV: ALT, Age and Sodium
HCV: Age, ALT and Urea
Zhang et al. (2017) [21] CV GBT *** NE Ensemble *** RI, H-statistic (features) AUROC, Sensibility, Specificity (model) WBC count ≥ 15 × 109/L (RI: 49.47, p < 0.001), spinal cord involvement (RI: 26.62, p < 0.001), spinal nerve roots involvement (RI: 10.34, p < 0.001), hyperglycaemia (RI: 3.40, p < 0.001), brain or spinal meninges involvement (RI: 2.45, p = 0.003), EV-A71 infection (RI: 2.24, p < 0.001).
Interaction between elevated WBC count and hyperglycaemia (H statistic: 0.231, 95% CI: 0–0.262, p = 0.031), between spinal cord involvement and duration of fever (H statistic: 0.291, 95% CI: 0.035–0.326, p = 0.035), and between brainstem involvement and body temperature (H statistic: 0.313, 95% CI: 0–0.273, p = 0.017)

GBT model: 92.3% prediction accuracy, AUROC 0.985, Sensibility 0.85, Specificity 0.97
Takeuchi et al. (2017) [22] OOB Scores (Gunma Score, Kurume Score and Osaka Score), RF RF Trees AUROC, Sensibility, Specificity, PPV, NPV, Out OF Bag error estimation RF: AUROC 0.916, Sensitivity 79.7%, Specificity 87.3%, PPV 85.2%, NPV82.1%, OOB error
rate 15.5%
Sensitivity and Specificity were: 69.8% and 60.0% GS; 60.6% and 55.4% KS; 24.1% and 77.0% OS.
PPV (28.2%–45.1%), NPV (82.0%–86.8%)
Aspartate aminotransferase, lactate dehydrogenase concentrations, percent
neutrophils
Performance metrics for comparison referred to cross-validation results
Hernandez et al. (2017) [23] CV DT, RF, SVM, Naive Bayes SVM SVM AUROC, AUPRC, Sensitivity, Specificity, PPV, NPV, TP, FP, TN, FN SVM with SMOTE sampling method and considering 6 features obtained the best results
AUROC, AUCPR, Sensibility, Specificity
0.830, 0.884, 0.747, 0.912
Bertsimas et al. (2018) [24] VS LoR, Regularized LoR, Optimal Classification Tree, CART, GBT Optimal Classification Tree * Trees Accuracy (threshold 50%), PPV at Sensibility of 0.6, AUC Optimal Classification Tree results
60-day mortality, 90-day, 120-day
Accuracy: 94.9, 93.3, 86.1
PPV: 20.2, 27.5, 43.1
AUC: 0.86, 0.84, 0.83
Albumin, change in weight, Pulse, WBC count, Haematocrit according to the kind of cancer The validation set was used only for NN, KNN, and SVM
Jeong et al. (2018) [25] CV CERT, CLEAR, PACE, RF, L1-regularized LoR, SVM, NN RF Trees AUROC, F1, Sensibility, Specificity, PPV, NPV ML models produced higher averaged F1-measures (0.629–0.709) and AUROC (0.737–0.816) compared to those of the original methods AUROC (0.020–0.597) and F1 (0.475–0.563)
Rosenbaum and Baron (2018) [26] NA Univariate models, LoR, SVM SVM SVM AUC, Specificity, PPV AUROC on testing set (simulated WIBT)
best univariate (BUN): 0.84 (interquartile range 0.83–0.84)
SVM (difference and values): 0.97 (0.96–0.97)
LoR (Difference and values): 0.93
Difference and Values together Not available data from the comparison among machines
Ge et al. (2018) [27] CV RNN-LSTM + LoR vs LoR RNN-LSTM DL AUROC, TP, FP AUROC cross-validation, AUROC testing set
Logistic Regression: 0.7751, 0.7412
RNN-LSTM model: 0.8076, 0.7614
Associated with ICU Mortality:
Do Not Reanimate, Prednisolone, Disseminated intravascular coagulation;
Associated with ICU Survival:
Arterial blood gas pH, Oxygen saturation, Pulse
Jonas et al. (2018) [28] CV LoR (LASSO), RF *** NE NE NE LASSO identified as the most predictive of a positive response to vasoreactivity test: 6-MWD, diabetes, HDL-C, creatinine, right atrial pressure, and cardiac index
RF identified as the most predictive: NT-proBNP, HDL-C, creatinine, right atrial pressure, and cardiac index

6-MWD, HDL-C, hs-CRP, and creatinine levels best discriminated between long-term-responder and not
Performance metrics for comparison referred to cross-validation results
Tool available online
Sahni et al. (2018) [29] NA LoR, RF RF Trees AUROC AUROC
RF (demographics, physiological, lab, all comorbidities) 0.85 (0.84–0.86)
LoR (demographics, physiological, lab, all comorbidities) 0.91 (0.90–0.92)
Age, BUN, platelet count, haemoglobin, creatinine, systolic blood pressure, BMI, and pulse oximetry readings Performance metrics for comparison referred to cross-validation results
Rahimian et al. (2018) [30] CV CPH, RF, GBC GBC Ensemble AUROC AUROC (CI95), internal validation
variables, CPH, RF, GBC
QA: 0.740 (0.739, 0.741), 0.752 (0.751, 0.753), 0.779 (0.777, 0.781)
T: 0.805 (0.804, 0.806), 0.825 (0.824, 0.826), 0.848 (0.847, 0.849)
external validation
QA: 0.736, 0.736, 0.796
T: 0.788, 0.810, 0.826
age, cholesterol ratio, haemoglobin, and platelets, frequency of lab tests, systolic blood pressure, number of admissions during the last year Tool available online
Foysal et al. (2019) [31] CV Regression analysis and SVM *** NE SVM R2 score, Standard error of detection, Accuracy Accuracy: 98% NE Performance metrics for comparison referred to cross-validation results
Xu et al. (2019) [32] CV L1 Logistic Regression, Regress and Round, Naive Bayes, NN-MLP, DT, RF, AdaBoost, XGBoost. XGBoost, RF NA AUROC, Sensitivity, Specificity, NPV, PPV Mean AUROC: 0.77 on testing set
AUROC > 0.90 on 22 lab tests out of 43
On external validation: results were different according to lab test considered
NE DL missed Albumin as OS predictor
Burton et al. (2019) [33] CV Heuristic model (LoR) with microscopy thresholds, NN, RF, XGBoost XGBoost * Ensemble AUROC, Accuracy, PPV, NPV, Sensitivity, Specificity, Relative Workload Reduction (%) AUC Accuracy PPV NPV Sensitivity (%) Specificity (%) Relative Workload Reduction (%)
Pregnant patients 0.828, 26.94, 94.6 [±0.56], 26.84 [±1.88], 25.29 [±0.92]
Children (<11 years) 0.913, 62.00, 94.8 [±0·88], 55.00 [±2.12], 46.24 [±1.48]
Pregnant patients 0.894, 71.65, 95.3 [±0.24], 60.93 [±0.65], 43.38 [±0.41]
Combined performance 0.749, 65.65, 47.64 [±0.51], 97.14 [±0.28], 95.2 [±0.22], 60.93 [±0.60], 41.18 [±0.39]
WBC count, Bacterial count, Age, Epithelial cell count, RBC count
Fillmore et al. (2019) [34] CV L1 LoR (LASSO), SVM, RF RF Trees Accuracy LabTest: LR, SVM, RF
ALP: 0.98, 0.97, 0.98
ALT: 0.98, 0.94, 0.92
ALB: 0.97, 0.92, 0.98
HDLC: 0.98, 0.91, 0.98
Na: 0.97, 0.98, 0.99
Mg: 0.97, 0.95, 0.99
HGB: 0.97, 0.95, 0.99
Not provided precise data of the performances on testing set
Zimmerman et al. (2019) [35] CV LiR, LoR, RF, NN-MLP NN-MLP DL AUROC, Accuracy, Sensitivity, Specificity, PPV, NPV LiR Regression task: RMSEV
Linear Backward Selection Model 0.224
Linear All Variables Model 0.224

AUROC, Accuracy, Sensitivity, Specificity, PPV, NPV
LR, Backward Selection Model: 0.780, 0.724, 0.697, 0.730, 0.337, 0.924
LR, All Variables Model: 0.783, 0.729, 0.698, 0.736, 0.342, 0.925
RF, Backward Selection Model: 0.772, 0.739, 0.660, 0.754, 0.346, 0.918
RF, All Variables Model: 0.779, 0.742, 0.673, 0.756, 0.352, 0.921
MLP, Backward Selection Model: 0.792, 0.744, 0.684, 0.756, 0.356, 0.924
MLP, All Variables Model: 0.796, 0.743, 0.694, 0.753, 0.357, 0.926
Sex, age, ethnicity, Hypoxemia, mechanical ventilation, Coagulopathy, calcium, potassium, creatinine level Performance metrics for comparison referred to cross-validation results
Sharafoddini et al. (2019) [36] CV LASSO for choosing most important variables.
DT, LoR, RF, SAPS-II (score)
Logistic Regression Regression AUROC Including indicators improved the AUROC in all modelling techniques, on average by 0.0511; the maximum improvement was 0.1209 BUN, RDW, anion gap all 3 days.
day 1: TBil, phosphate, Ca, and Lac
day 2&3: Lac, BE, PO2, and PCO2
day 3: PTT and pH
Matsuo et al. (2019) [37] CV NN, CPH, CoxBoost, CoxLasso, Random Survival Forest NN DL Concordance Index, Mean Absolute Error Progression-free survival (PFS):
Concordance index, Mean absolute error (mean ± standard error)
CPH: 0.784 ± 0.069, 316.2 ± 128.3
DL: 0.795 ± 0.066, 29.3 ± 3.4
Overall survival (OS):
CPH: 0.607 ± 0.039, 43.6 ± 4.3
DL: 0.616 ± 0.041, 30.7 ± 3.6
PFS: BUN, Creatinine, Albumin,
(Only DL) WBC, Platelet, Bicarbonate, Haemoglobin

OS: BUN
(only DL) Bicarbonate
(only CPH) Platelet, Creatinine, Albumin
Yang et al. (2019) [38] OOB RF *** NE Trees *** OOB Predicting Outcome (discharge/death)
Out-of-bag error 0.073
Accuracy: 0.927
Recall/sensitivity: 0.702
Specificity: 0.973
Precision: 0.840
bicarbonate, phosphate, anion gap, white cell count (total), PTT, platelet, total calcium, chloride, glucose and INR Not clear how they split dataset and which results are reported
Daunhawer et al. (2019) [39] CV L1 Regularized LoR (LASSO), RF RF+LASSO NE AUROC AUROC cross-validation test set external set
RF: 0.933 ± 0.019, 0.927, 0.9329
LASSO: 0.947 ± 0.015, 0.939, 0.9470
RF + LASSO: 0.952 ± 0.013, 0.939, 0.9520
Gestational Age, weight, bilirubin level, and hours since birth
Estiri et al. (2019) [40] Pl CAD (Standard deviation and Mahalanobis distance), Hierarchical k-means Hierarchical k-means Clustering FP, TP, FN, TN, Sensitivity, Specificity, and fallout across the eight thresholds Specificity increases as threshold decreases. The lowest was 0.9938
Sensitivity in 39/41 variable > 0.85, Troponin I = 0.0545, LDL = 0.4867
About sensitivity, 39/41 CAD~ML, 9/41 CAD > ML
About FP, in 45/50 ML had less FP than CAD
Kayhanian et al. (2019) [41] CV LoR, SVM SVM SVM Sensitivity, Specificity, AUC, J-statistic Sensitivity, Specificity, J-statistic, AUC
Linear model, all variables: 0.75, 0.99, 0.7, 0.9
Linear model, three variables: 0.71, 0.99, 0.74, 0.83
SVM, all variables: 0.63, 1, 0.79, N/A
SVM, three variables: 0.8, 0.99, 0.63, N/A
Lactate, pH and glucose
Wang et al. (2019) [42] CV Auto-Weka (39 ML algorithms) RF Trees Sensitivity, Specificity, AUROC, Accuracy Time after ICH, Case number, Best algorithms Sensitivity, Specificity, Accuracy, AUC
1-month: 307 Random forest, 0.774, 0.869, 0.831, 0.899
6 months: 243 Random forest, 0.725, 0.906, 0.839, 0.917
1 month: ventricle compression, GCS, ICH volume, location, Hgb;
6 months: GCS, location, age, ICH volume, gender, DBP, WBC
Connection between HDL-C and reactivity of the pulmonary vasculature is
a novel finding
Ye et al. (2019) [43] NA Retrospective: RF, XGBoost, Boosting, SVM, LASSO, KNN
Prospective: RF
RF Trees AUROC, PPV, Sensitivity, Specificity RF’s AUROC: 0.884 (highest among all other ML models)

high-risk sensitivity, PPV, low–moderate risk sensitivity, PPV
EWS: 26.7%, 69%, 59.2%, 35.4%
ViEWS: 13.7%, 35%, 35.7%, 21.4%
Diagnoses of cardiovascular diseases, congestive heart failure, or renal diseases No information about tuning
Yang et al. (2020) [44] CV LoR, DT (CART), RF, and GBDT GBDT Ensemble AUROC, sensitivity, specificity, agreement with RT-PCR (Agr-PCR) AUROC; Sensitivity; Specificity; Agr-PCR
GBDT 0.854 (0.829–0.878); 0.761 (0.744–0.778); 0.808 (0.795–0.821); 0.791 (0.776–0.805); on cross-validation;
GBDT 0.838; 0.758; 0.740 on independent testing set
LDH, CRP, Ferritin No information about model, training, validation, test
Ma et al. (2020) [45] CV RF, XGBoost, LoR for selecting variables for the new model
New Model vs Score (CURB-65), XGBoost
New Model Other AUROC AUROC on testing set (13 patients), AUROC on cross-validation
New Model: 0.9667, 0.9514
CURB-65: 0.5500, 0.8501
XGBoost: 0.3333, 0.4530
LDH, CRP, Age Tool available online
Hyun et al. (2020) [46] NE k-means*** NE Clustering*** NE 3 Clusters
Cluster 2: abnormal haemoglobin and RBC
Cluster 3: highest mortality, intubation, cardiac medications and blood administration
BUN, creatinine, potassium, haemoglobin, and red blood cell
Lee et al. (2020) [47] CV RF, SVM, LASSO, Ridge, Elastic Net Regulation, MEWS RF Trees AUROC, AUPRC, BA, Sensitivity, Specificity, F1, PLR, and NLR AUROC AUPRC Sensitivity Specificity
RF OSO: 0.80 (0.76 to 0.84); 0.25 (0.18 to 0.33); 0.70 (0.62 to 0.82); 0.78 (0.66 to 0.83)
RF OSR: 0.88 (0.85 to 0.91); 0.39 (0.30 to 0.47); 0.81 (0.76 to 0.89); 0.81 (0.75 to 0.83)
OSO: Troponin I, creatine kinase and CK-MB;
OSR: Lactic Acid
Performance metrics for comparison referred to cross-validation results
Morid et al. (2020) [48] CV RF, XGBT, Kernel-based Bayesian Network, SVM, LoR, Naive Bayes, KNN, ANN RF Trees AUC, F1, Accuracy RF Model performances according to the detection method,
Accuracy AUC
Last recorded Value: 0.581, 0.589
Symbolic pattern detection: 0.706, 0.694
Local structural pattern: 0.781, 0.772
Global structural pattern: 0.744, 0.730
Local & Global: 0.813, 0.809
NE
Yu et al. (2020) [49] NA ANN*** NE DL *** Checking Proportions (CP), Prediction Accuracy, Aggregated Accuracy (AA) Threshold for CP.AA.
performing test
0.15: 90.14%; 95.83%
0.25: 85.78%; 95.05%
0.35: 79.71%; 93.32%
0.45: 71.70%; 90.95%
0.6: 50.46%; 85.30%
NE Not included data about performances, but only graph of AUROC of prediction to 1 month (with 4-month history)
Chicco and Jurman (2020) [50] VS LiR, RF, One-Rule, DT, ANN, SVM, KNN, Naive Bayes, XGBoost RF Trees MCC, F1, Accuracy, TP, TN, PRAUC, AUROC MCC F1 Accuracy TP TN PRAUC AUROC
All features RF + 0.384, 0.547, 0.740, 0.491, 0.864, 0.657, 0.800
Cr+ EF RF +0.418 0.754 0.585 0.541 0.855 0.541 0.698
Cr+EF+FU time LoR +0.616 0.719 0.838 0.785 0.860 0.617 0.822
Serum Creatinine and Ejection Fraction
Ye et al. (2020) [51] CV GDBT, AdaBoost, LGB, Logistic, Vote, XGB, Decision Tree, and Random Forest, stepwise LoR, LoR with RCS GDBT Ensemble AUROC, Recall, Precision, F1 Discrimination AUC
GDBT 73.51%, 95% CI 71.36%–75.65%
LoR with RCS 70.9%, 95% CI 68.68%–73.12%

0.3 and 0.7 were set as cut-off points for predicting outcomes (GDM or adverse pregnancy outcomes)
GBDT: Fasting blood glucose, HbA1c, triglycerides, and maternal BMI
LoR: HbA1c and high-density lipoprotein
Macias et al. (2020) [52] CV RF (features) + RNN-LSTM, RF RNN-LSTM (all variables) DL AUROC AUROC mortality prediction
1 month
RF 0.737
RNN (many) expert variables 0.781 ± 0.021
RNN RF variables 0.820 ± 0.015
RNN all variables 0.873 ± 0.021
Lobo et al. (2020) [53] VS RNN-LSTM + NN + RNN-LSTM *** NE DL Mean Error (ME), Mean Absolute Error (MAE), Mean Squared Error (MSE) Best model performance
ME: 0.017; MAE: 0.527; MSE: 0.489; predicting to 1 month with 5 month of history data
Roimi et al. (2020) [54] CV 6 RF+2 XGBoost, RF, XGBoost, LoR 6 RF+2 XGBoost Other AUROC, Brier score Modelling approach BIDMC RHCC
AUROC Derivation set, CV Validation set, Derivation set, CV Validation set
Logistic-regression: 0.75 ± 0.06, 0.70 ± 0.02, 0.80 ± 0.08, 0.72 ± 0.02
Random-Forest: 0.82 ± 0.03, 0.85 ± 0.01, 0.90 ± 0.03, 0.88 ± 0.02
Gradient Boosting Trees: 0.84 ± 0.04, 0.84 ± 0.02, 0.93 ± 0.04, 0.88 ± 0.01
Ensemble of models: 0.87 ± 0.03, 0.89 ± 0.01, 0.93 ± 0.03, 0.92 ± 0.01

validating the models of BIDMC over RHCC dataset and vice versa, the AUROCs of the models deteriorated to 0.59 ± 0.07 and 0.60 ± 0.06 for BIDMC and RHCC
Most of the strongest features included patterns of change in the time-series variables Performance metrics for comparison referred to cross-validation results
Kirk et al. (2020) [55] NA SVM (cut-offs features), LoR, Random Forest regression Algorithm RF Trees AUROC AUROC
baseline clinical and demographic values 0.52
inclusion of laboratory value thresholds from the day of discharge 0.54
add daily postoperative laboratory thresholds to the demographic and clinical variables 0.59
add postoperative complications 0.62
random forest regression all features 0.68
white blood cell count, bicarbonate, BUN, and creatinine
Li et al. (2020) [56] VS RF, LoR LoR Regression AUROC, Accuracy, Precision, F1, Recall Prospective cohort results
AU-ROC Accuracy Precision F1 score Recall
RF: 0.830 (0.770–0.887), 0.916 (0.891–0.936), 0.907 (0.881–0.928), 0.901 (0.874–0.922), 0.917 (0.892–0.937)
LoR: 0.858 (0.808–0.903), 0.905 (0.879–0.926), 0.887 (0.859–0.910), 0.883 (0.855–0.906), 0.905 (0.879–0.926)
RBC, SI, BE, Lac, DBP, pH
Balamurugan et al. (2020) [57] CV Auto-Weka (Naive Bayes, DT-J48, MLP, SVM) & 4 features selectors *** NE NE AUROC, F1, Precision, Accuracy, Recall, MCC, TPR, FPR Proposed model: features selected; Accuracy; TP Rate; FP Rate
GA + J48: 9; 94.32; 0.925; 0.118;
PSO + J48: 9; 96.25; 0.963; 0.163;
CFS + J48: 11; 84.63; 0.861; 0.871;
EWSORA + J48; 4; 98.72; 0.950; 0.165;
RBC, HGB, HCT, WBC Performance metrics for comparison referred to cross-validation results
Hu et al. (2020) [58] CV XGBoost, RF, LR, Score (APACHE II, PSI) XGBoost Ensemble AUROC AUROC
XGBoost 0.842 (95% CI 0.749–0.928)
RF 0.809 (95% CI 0.629–0.891)
LR 0.701 (95% CI 0.573–0.825)
APACHE II 0.720 (95% CI 0.653–0.784)
PSI 0.720 (95% CI 0.654–0.7897)
Fluid balance domain, Laboratory data domain, severity score domain, Management domain, Demographic and symptom domain, Ventilation domain
Aydin et al. (2020) [59] CV Naïve Bayes, KNN, SVM, GLM, RF, and DT DT * Trees AUC, Accuracy, Sensitivity, Specificity AUC (%) Accuracy (%) Sensitivity (%) Specificity (%)
RF 99.67; 97.45; 97.79; 97.21
KNN 98.68; 95.58; 95.08; 95.93
NB 98.71; 94.76; 94.06; 95.25
DT 93.97; 94.69; 93.55; 96.55
SVM 96.76; 91.24; 90.32; 91.86
GLM 96.83; 90.96; 90.66; 91.16
Platelet distribution width (PDW),
white blood cell count (WBC),
neutrophils,
lymphocytes
Metsker et al. (2020) [60] CV KNN for clustering data and then comparison among Linear Regression, Logistic Regression, ANN, DT, and SVM ANN DL AUROC, F1, Precision, Accuracy, Recall Model Precision Recall F1 score Accuracy AUC
29’s variables Linear Regression 0.6777, 0.7911, 0.7299 0.7472
31’s variables ANN 0.7982, 0.8152, 0.8064, 0.8261, 0.8988
Age, Mean Platelet Volume
Voglis et al. (2020) [61] Bt Generalized Linear Models (GLM), GLMBoost, Naïve Bayes classifier, and Random Forest GLMBoost Ensemble AUROC, Accuracy, F1, PPV, NPV, Sensibility, Specificity AUROC: 84.3% (95% CI 67.0–96.4)
Accuracy: 78.4% (95% CI 66.7–88.2)
Sensitivity: 81.4%
Specificity: 77.5%
F1 score: 62.1%
NPV (93.9%)
PPV (50%)
preoperative serum prolactin
preoperative serum insulin-like growth factor 1 level (IGF-1)
BMI
preoperative serum sodium level

* It was chosen as the most useful, although it was not the best performer; ** Different models were trained with a different number of features; *** A comparison of the ML models was not made; NA: Not available; NE: Not evaluable (meaning not pertinent). For all the other abbreviations, see Appendix B.