TABLE 3.
Calibration performance of different model in training and test cohort.
| Model | Dataset | Brier score | Log loss | HL_stat | HL_p-value | MSE | MAE |
| Clinical | Train | 0.198 | 0.582 | 72.354 | <0.01 | 0.198 | 0.400 |
| Clinical | Test | 0.163 | 0.505 | 84.241 | <0.01 | 0.163 | 0.372 |
| Image | Train | 0.200 | 0.589 | 79.223 | <0.01 | 0.200 | 0.405 |
| Image | Test | 0.193 | 0.576 | 65.189 | <0.01 | 0.193 | 0.408 |
| Combined | Train | 0.075 | 0.156 | 10.27 | 0.246 | 0.075 | 0.181 |
| Combined | Test | 0.012 | 0.172 | 5.825 | 0.666 | 0.101 | 0.219 |
The combined model exhibits the best calibration, with lower brier scores, log loss, MSE, and MAE, and non-significant HL p-values, indicating reliable probability estimates.