Table 4.
Calibration measure (examples of studies in which the measure was used) | Pros | Cons |
---|---|---|
Brier score44–46 | < id="1124" data-dummy="list" list-type="none">
|
The contribution of each component (discrimination, calibration) is not easy to calculate or interpret. |
Spiegelhalter’s z test47,48 | Extension of Brier score that measures calibration only. P value can serve as a guide for how calibrated a model is. | Not intuitive. |
Average absolute error | Easy calculation. Intuitive. | Same problems as Brier score. Rarely used. |
H-L test28,49 | Widely used in the biomedical literature. P value can serve as a guide for how calibrated a model is. | < id="1155" data-dummy="list" list-type="none">
|
Reliability diagram25,26,50,51 | Allows for visualization of regions of miscalibration and the “direction” of miscalibration (ie, underestimation, overestimation) | Not a continuous graph. Hard to see when estimates are clustered in certain regions (zoom into a portion of the graph may be needed). |
Expected calibration error and maximum calibration error25,52,53 | Intuitive. | No statistical test to help determine whether a model is adequately calibrated or not. |
Cox’s slope and intercept54–56 | Summarizes direction of miscalibration (ie, overall underestimation or overestimation). | Can still result in perfect calibration of 0 and 1 even if regions are miscalibrated. |
Integrated calibration index | Can capture regions of miscalibration that Cox’s slope and intercept cannot. | Requires Loess to build calibration model. Not intuitive. |
H-L: Hosmer-Lemeshow.