Likelihood-based measures |
Reflects probability of obtaining the observed data |
Based on assumed model |
Likelihood ratio (LR), change in AIC or BIC |
The LR test is the uniformly most powerful test for nested models. The AIC and BIC can be used to assess non-nested models. |
While powerful, statistical association or model improvement may not be of clinical importance. |
Discrimination |
Assesses separation of cases and non-cases |
Only one component of model fit |
Difference in ROC curves, AUC, c-statistic |
Assesses discrimination between those with and without outcome of interest across the whole range of a continuous predictor or score. Useful for classification |
Based on ranks only. Does not assess calibration. Differences may not be of clinical importance. |
Clinical risk reclassification |
Examines difference in assigning to clinically important risk strata |
Strata should be pre-defined. Loses information if strata are not clinically important |
Reclassification calibration statistic |
Assesses calibration within cross-classified risk strata |
A test for each model is needed |
Categorical NRI |
Can assess changes in important risk strata. Cases and non-cases can be considered separately |
Depends on the number of categories and cut points used |
NRI(p) |
Nice statistical properties. Does not vary by event rate in the data |
May not be clinically relevant |
Conditional NRI |
Indicates improvement within clinically important risk subgroups |
Biased in its crude form, and a correction based on the full data is needed. |
Category-free measures |
Does not require cut points |
May lose clinical intuition |
Brier score |
Proper scoring rule |
May be difficult to interpret; the maximum value depends on incidence of the outcome. |
NRI(0) |
Continuous, does not depend on categories |
Based on ranks only. Measure of association rather than model improvement. Behavior may be erratic if the new predictor is not normally distributed. |
IDI |
Nice statistical properties. Related to the difference in model R2
|
Depends on event rate. Values are low and may be difficult to interpret. |
Decision analytics |
Estimates clinical impact of using model |
Not a direct estimate of model fit or improvement. Need reasonable estimates of decision thresholds |
Decision curve |
Displays the net benefit across a range of thresholds |
Does not compare model improvement directly but clinical consequences of using the models for treatment decisions |
Cost-benefit analysis |
Compares costs and benefits of one models or treatment strategy vs. another |
Need detailed estimates of costs and benefits of misclassification, including further diagnostic workup and treatments |