. 2023 Jan 6;23:2. doi: 10.1186/s12911-022-02096-x

Table 2.

Classical performance assessment of multimodal deep learning against benchmark

Prediction	Prevalence	N	Model	Recall	Precision	Balanced accuracy	F1	AUC	AUPRC
Early Surgery	0.024	824	MDL	0.300 ± 0.077*	0.086 ± 0.021*	0.610 ± 0.039*	0.133 ± 0.033*	0.725 ± 0.040*	0.061 ± 0.014*
Early Surgery	0.024	824	Benchmark	0.375 ± 0.076	0.069 ± 0.014	0.624 ± 0.038	0.116 ± 0.023	0.597 ± 0.050	0.047 ± 0.011
Late Surgery	0.049	851	MDL	0.595 ± 0.051*	0.080 ± 0.007*	0.619 ± 0.026*	0.140 ± 0.012*	0.655 ± 0.026*	0.077 ± 0.009*
Late Surgery	0.049	851	Benchmark	0.440 ± 0.056	0.076 ± 0.009	0.580 ± 0.028	0.129 ± 0.016	0.635 ± 0.031	0.079 ± 0.011

We compared the performance of the MDL architecture against the benchmark (i.e. LASSO). We calculated 1000 bootstrap samples from the test set. For each sample, we calculated the performance metrics: recall, specificity, balanced accuracy, precision, F1-score, AUC, and AUPRC. We then calculated the average and standard deviation across the samples. For each prediction task, we underline the model that had the best performance for each metric. Finally, we performed a t-test to assess significance between each model’s performance metrics for each prediction task; we indicate significance with an asterisk

AUC, Area Under the Curve; AUPRC, Area Under the Precision-Recall Curve; MDL, Multimodal Deep Learning