Table 3.
Prediction | System | Prevalence | N | Model | Recall | Precision | Balanced Accuracy | F1 | AUC | AUPRC |
---|---|---|---|---|---|---|---|---|---|---|
Early Surgery | Group Health | 0.021 | 239 | MDL | 0.600 ± 0.161* | 0.075 ± 0.020* | 0.720 ± 0.081* | 0.132 ± 0.036* | 0.731 ± 0.109* | 0.105 ± 0.050* |
Benchmark | 0.300 ± 0.152 | 0.056 ± 0.028 | 0.595 ± 0.076 | 0.094 ± 0.047 | 0.656 ± 0.113 | 0.149 ± 0.114 | ||||
Henry Ford | 0.039 | 324 | MDL | 0.640 ± 0.097* | 0.127 ± 0.021* | 0.732 ± 0.050* | 0.212 ± 0.033* | 0.795 ± 0.047* | 0.128 ± 0.031* | |
Benchmark | 0.200 ± 0.079 | 0.087 ± 0.034 | 0.557 ± 0.040 | 0.120 ± 0.047 | 0.714 ± 0.050 | 0.088 ± 0.023 | ||||
Late Surgery | Group Health | 0.079 | 254 | MDL | 0.425 ± 0.079* | 0.143 ± 0.026* | 0.603 ± 0.041* | 0.214 ± 0.038* | 0.630 ± 0.046* | 0.120 ± 0.020 |
Benchmark | 0.600 ± 0.080 | 0.109 ± 0.014 | 0.590 ± 0.042 | 0.185 ± 0.024 | 0.641 ± 0.044 | 0.119 ± 0.023 | ||||
Henry Ford | 0.042 | 325 | MDL | 0.482 ± 0.099* | 0.085 ± 0.017* | 0.628 ± 0.051* | 0.145 ± 0.029* | 0.700 ± 0.053* | 0.091 ± 0.024* | |
Benchmark | 0.556 ± 0.096 | 0.112 ± 0.019 | 0.682 ± 0.048 | 0.186 ± 0.031 | 0.707 ± 0.057 | 0.097 ± 0.022 |
We compared the generalizability performance of the MDL architecture and the benchmark (i.e. LASSO). For each test system, we evaluated models’ performance using the performance metrics. We estimated significance performance between models by bootstrapping 1000 samples for each test system. For each prediction task and system, we performed a t-test comparing the bootstrapped samples between the two models across the performance metrics; we indicate significance with an asterisk for the MDL row. We underline the model that had the best average performance for each metric for each system
AUC, Area Under the Curve; AUPRC, Area Under the Precision-Recall Curve; MDL, Multimodal Deep Learning