. 2025 Oct 13;12:1655302. doi: 10.3389/fmed.2025.1655302

TABLE 3.

Calibration performance of different model in training and test cohort.

Model	Dataset	Brier score	Log loss	HL_stat	HL_p-value	MSE	MAE
Clinical	Train	0.198	0.582	72.354	<0.01	0.198	0.400
Clinical	Test	0.163	0.505	84.241	<0.01	0.163	0.372
Image	Train	0.200	0.589	79.223	<0.01	0.200	0.405
Image	Test	0.193	0.576	65.189	<0.01	0.193	0.408
Combined	Train	0.075	0.156	10.27	0.246	0.075	0.181
Combined	Test	0.012	0.172	5.825	0.666	0.101	0.219

The combined model exhibits the best calibration, with lower brier scores, log loss, MSE, and MAE, and non-significant HL p-values, indicating reliable probability estimates.