Skip to main content
. 2025 Oct 13;12:1655302. doi: 10.3389/fmed.2025.1655302

TABLE 3.

Calibration performance of different model in training and test cohort.

Model Dataset Brier score Log loss HL_stat HL_p-value MSE MAE
Clinical Train 0.198 0.582 72.354 <0.01 0.198 0.400
Clinical Test 0.163 0.505 84.241 <0.01 0.163 0.372
Image Train 0.200 0.589 79.223 <0.01 0.200 0.405
Image Test 0.193 0.576 65.189 <0.01 0.193 0.408
Combined Train 0.075 0.156 10.27 0.246 0.075 0.181
Combined Test 0.012 0.172 5.825 0.666 0.101 0.219

The combined model exhibits the best calibration, with lower brier scores, log loss, MSE, and MAE, and non-significant HL p-values, indicating reliable probability estimates.