Skip to main content
. 2023 Jan 6;23:2. doi: 10.1186/s12911-022-02096-x

Table 2.

Classical performance assessment of multimodal deep learning against benchmark

Prediction Prevalence N Model Recall Precision Balanced accuracy F1 AUC AUPRC
Early Surgery 0.024 824 MDL 0.300 ± 0.077* 0.086 ± 0.021* 0.610 ± 0.039* 0.133 ± 0.033* 0.725 ± 0.040* 0.061 ± 0.014*
Benchmark 0.375 ± 0.076 0.069 ± 0.014 0.624 ± 0.038 0.116 ± 0.023 0.597 ± 0.050 0.047 ± 0.011
Late Surgery 0.049 851 MDL 0.595 ± 0.051* 0.080 ± 0.007* 0.619 ± 0.026* 0.140 ± 0.012* 0.655 ± 0.026* 0.077 ± 0.009*
Benchmark 0.440 ± 0.056 0.076 ± 0.009 0.580 ± 0.028 0.129 ± 0.016 0.635 ± 0.031 0.079 ± 0.011

We compared the performance of the MDL architecture against the benchmark (i.e. LASSO). We calculated 1000 bootstrap samples from the test set. For each sample, we calculated the performance metrics: recall, specificity, balanced accuracy, precision, F1-score, AUC, and AUPRC. We then calculated the average and standard deviation across the samples. For each prediction task, we underline the model that had the best performance for each metric. Finally, we performed a t-test to assess significance between each model’s performance metrics for each prediction task; we indicate significance with an asterisk

AUC, Area Under the Curve; AUPRC, Area Under the Precision-Recall Curve; MDL, Multimodal Deep Learning