Skip to main content
. 2023 Jan 6;23:2. doi: 10.1186/s12911-022-02096-x

Table 3.

Generalizability performance assessment of multimodal deep learning against benchmark

Prediction System Prevalence N Model Recall Precision Balanced Accuracy F1 AUC AUPRC
Early Surgery Group Health 0.021 239 MDL 0.600 ± 0.161* 0.075 ± 0.020* 0.720 ± 0.081* 0.132 ± 0.036* 0.731 ± 0.109* 0.105 ± 0.050*
Benchmark 0.300 ± 0.152 0.056 ± 0.028 0.595 ± 0.076 0.094 ± 0.047 0.656 ± 0.113 0.149 ± 0.114
Henry Ford 0.039 324 MDL 0.640 ± 0.097* 0.127 ± 0.021* 0.732 ± 0.050* 0.212 ± 0.033* 0.795 ± 0.047* 0.128 ± 0.031*
Benchmark 0.200 ± 0.079 0.087 ± 0.034 0.557 ± 0.040 0.120 ± 0.047 0.714 ± 0.050 0.088 ± 0.023
Late Surgery Group Health 0.079 254 MDL 0.425 ± 0.079* 0.143 ± 0.026* 0.603 ± 0.041* 0.214 ± 0.038* 0.630 ± 0.046* 0.120 ± 0.020
Benchmark 0.600 ± 0.080 0.109 ± 0.014 0.590 ± 0.042 0.185 ± 0.024 0.641 ± 0.044 0.119 ± 0.023
Henry Ford 0.042 325 MDL 0.482 ± 0.099* 0.085 ± 0.017* 0.628 ± 0.051* 0.145 ± 0.029* 0.700 ± 0.053* 0.091 ± 0.024*
Benchmark 0.556 ± 0.096 0.112 ± 0.019 0.682 ± 0.048 0.186 ± 0.031 0.707 ± 0.057 0.097 ± 0.022

We compared the generalizability performance of the MDL architecture and the benchmark (i.e. LASSO). For each test system, we evaluated models’ performance using the performance metrics. We estimated significance performance between models by bootstrapping 1000 samples for each test system. For each prediction task and system, we performed a t-test comparing the bootstrapped samples between the two models across the performance metrics; we indicate significance with an asterisk for the MDL row. We underline the model that had the best average performance for each metric for each system

AUC, Area Under the Curve; AUPRC, Area Under the Precision-Recall Curve; MDL, Multimodal Deep Learning