Skip to main content
. 2020 Feb 27;27(4):621–633. doi: 10.1093/jamia/ocz228

Table 4.

Summary of advantages and disadvantages of calibration measurement methods presented in this tutorial

Calibration measure (examples of studies in which the measure was used) Pros Cons
Brier score44–46 < id="1124" data-dummy="list" list-type="none">
  • Easy calculation.

  • Measures a combination of discrimination and calibration.

</>
The contribution of each component (discrimination, calibration) is not easy to calculate or interpret.
Spiegelhalter’s z test47,48 Extension of Brier score that measures calibration only. P value can serve as a guide for how calibrated a model is. Not intuitive.
Average absolute error Easy calculation. Intuitive. Same problems as Brier score. Rarely used.
H-L test28,49 Widely used in the biomedical literature. P value can serve as a guide for how calibrated a model is. < id="1155" data-dummy="list" list-type="none">
  • Not designed to handle sample sizes >25 000.

  • Use of H-L C-statistic and H-L H-statistic can result in different significance.

</>
Reliability diagram25,26,50,51 Allows for visualization of regions of miscalibration and the “direction” of miscalibration (ie, underestimation, overestimation) Not a continuous graph. Hard to see when estimates are clustered in certain regions (zoom into a portion of the graph may be needed).
Expected calibration error and maximum calibration error25,52,53 Intuitive. No statistical test to help determine whether a model is adequately calibrated or not.
Cox’s slope and intercept54–56 Summarizes direction of miscalibration (ie, overall underestimation or overestimation). Can still result in perfect calibration of 0 and 1 even if regions are miscalibrated.
Integrated calibration index Can capture regions of miscalibration that Cox’s slope and intercept cannot. Requires Loess to build calibration model. Not intuitive.

H-L: Hosmer-Lemeshow.