Table 2.
Prediction | Prevalence | N | Model | Recall | Precision | Balanced accuracy | F1 | AUC | AUPRC |
---|---|---|---|---|---|---|---|---|---|
Early Surgery | 0.024 | 824 | MDL | 0.300 ± 0.077* | 0.086 ± 0.021* | 0.610 ± 0.039* | 0.133 ± 0.033* | 0.725 ± 0.040* | 0.061 ± 0.014* |
Benchmark | 0.375 ± 0.076 | 0.069 ± 0.014 | 0.624 ± 0.038 | 0.116 ± 0.023 | 0.597 ± 0.050 | 0.047 ± 0.011 | |||
Late Surgery | 0.049 | 851 | MDL | 0.595 ± 0.051* | 0.080 ± 0.007* | 0.619 ± 0.026* | 0.140 ± 0.012* | 0.655 ± 0.026* | 0.077 ± 0.009* |
Benchmark | 0.440 ± 0.056 | 0.076 ± 0.009 | 0.580 ± 0.028 | 0.129 ± 0.016 | 0.635 ± 0.031 | 0.079 ± 0.011 |
We compared the performance of the MDL architecture against the benchmark (i.e. LASSO). We calculated 1000 bootstrap samples from the test set. For each sample, we calculated the performance metrics: recall, specificity, balanced accuracy, precision, F1-score, AUC, and AUPRC. We then calculated the average and standard deviation across the samples. For each prediction task, we underline the model that had the best performance for each metric. Finally, we performed a t-test to assess significance between each model’s performance metrics for each prediction task; we indicate significance with an asterisk
AUC, Area Under the Curve; AUPRC, Area Under the Precision-Recall Curve; MDL, Multimodal Deep Learning