New approaches have been developed in recent years to quantify the improvement in prediction performance gained by adding a novel marker to a set of baseline predictors of risk. The paper by Tzoulaki et al.1 concerns risk reclassification techniques and focuses specifically on the net reclassification improvement (NRI) index. Their review shows that use of risk reclassification analysis is extremely common in practice, with 51 papers using the technique published in only 3 years since its introduction. Unfortunately and alarmingly, the review shows that the quality of reporting is dismal. Investigators seem confused about the roles and interpretations of risk reclassification metrics. Guidance on how to report results of risk reclassification analysis would be helpful to authors, reviewers and the field in general.
The risk reclassification table was first introduced by Cook.2 The table is constructed by choosing clinically meaningful risk categories and cross-classifying individuals according to their risks calculated with the baseline risk model and with the expanded risk model. The top panel of Table 1 provides an illustration. Cook and Ridker3 developed a whole analysis strategy around the risk reclassification table including new hypothesis tests and a new metric called ‘percent correct reclassification’. However, the value of these analysis techniques is doubtful and results can be misleading.4 Pencina et al.5 argued that the reclassification table itself was problematic, at least as proposed by Cook, because it did not distinguish between subjects with events (cases) and subjects without events (controls). They suggested constructing separate event and non-event reclassification tables as shown in the middle and bottom panels of Table 1. Entries above the diagonal correspond to risks that are higher with the expanded vs baseline model, representing improved prediction for subjects with events. Correspondingly, entries below the diagonal represent worse prediction for them. The event-NRI is the difference between the proportions of subjects above vs below the diagonal in the event reclassification table. Using a similar logic, the non-event-NRI is calculated from the non-event reclassification table by taking the difference between the proportions of subjects below vs above the diagonal. The NRI summary index that gained immediate popularity in the literature following Pencina’s paper is the sum,
Table 1.
Expanded model |
||||
---|---|---|---|---|
Baseline model | 0–5% | 5–20% | >20% | Total |
All subjects (n = 10 000) | ||||
0–5% | 5558 | 437 | 25 | 6020 |
5–20% | 1036 | 1095 | 386 | 2517 |
>20% | 40 | 329 | 1094 | 1463 |
Total | 6634 | 1861 | 1505 | 10000 |
Events only (n = 1017) | ||||
0–5% | 72 | 38 | 4 | 114 |
5–10% | 21 | 105 | 114 | 240 |
>20% | 0 | 33 | 630 | 663 |
Total | 93 | 176 | 748 | 1017 |
Non-events only (n = 8983) | ||||
0–5% | 5486 | 399 | 21 | 5906 |
5–20% | 1015 | 990 | 272 | 2277 |
>20% | 40 | 296 | 464 | 800 |
Total | 6541 | 1685 | 757 | 8983 |
A prerequisite for considering risk reclassification is that the risk models are well calibrated, in the sense that the observed event rates for subgroups defined by the predictors in the models are close to the values calculated from the models. A poorly calibrated risk model is considered invalid for calculating risk as a function of the modelled predictors. It is of great concern therefore that almost half of the papers reporting risk reclassification results do not report assessment of model calibration. A second basic premise for considering risk reclassification is that the chosen categories of risk are clinically meaningful in the sense that changing risk categories has clinical consequences. The review indicates that only 27% of papers provided justification for the particular risk categories used. This is a very poor state of affairs.
Even if the risk models are valid and the risk categories chosen are clinically relevant, is NRI a good way of summarizing improvement in risk reclassification performance? We do not find this single numeric summary very enlightening. Calculated as 17.4% in our example, the NRI seems to fall short of the task of gauging whether or not a substantial improvement has been obtained. Somewhat more revealing are its components, event-NRI and non-event-NRI. If only two risk categories were involved, the event-NRI is the increase in the proportion of subjects with events that are classified as high risk by the predictors and correspondingly the non-event-NRI is the increase in the proportion of subjects without events who are deemed at low risk. These are simple useful summaries. However, with more than two categories the interpretations are far less appealing because all upward movements of risk category are counted equally and all downward movements are counted equally. Yet, the clinical implications are usually not equal. For example, moving from the lowest to highest or moving from the lowest to intermediate categories typically has very different consequences. Perhaps a single numeric summary is not needed or at least should not be the main focus of analysis. One alternative suggestion is to report the net changes in proportions of subjects classified in each of the risk categories.6 These are (−2.1%, −6.3%, 8.4%) for subjects with events and (7.1%, −6.6%, −0.5%) for subjects without events in Table 1. In other words, of subjects with events, 8.4% more are in the high-risk category and 2.1% fewer are in the low-risk category, whereas of subjects without events, 7.1% more are in the low-risk category and 0.5% fewer are in the high-risk category. These are simple summaries of reclassification performance that seem more clinically relevant than the NRI index of 17.4%.
Although risk reclassification analysis with the NRI has taken off like wildfire in applications, it is not yet a highly developed rigorous statistical technique. Unfortunately, this point is not widely appreciated and it is not acknowledged in the review. In particular, statistical techniques used for hypothesis testing and for calculating confidence intervals for the NRI have not been validated. Of concern is the fact that current techniques do not account for sampling variability in model parameter estimates and this has been shown to render techniques for hypothesis testing grossly invalid when applied to the change in area under the ROC curve (AUC) statistic and to the integrated discrimination improvement (IDI) statistic.7,8 Similar issues are likely to pertain to the NRI. Furthermore, concerns were raised about use of the NRI in a series of commentaries on the original 2008 paper. Important modifications to the NRI statistic have been suggested by Pencina et al.9 since, but many of the original concerns remain including the technical concerns. In our opinion, the jury is still out on if and how the NRI should be applied in practice and reliance on it to summarize improvement in prediction performance seems premature for now.
A consensus set of guidelines for reporting of genetic risk prediction studies has appeared recently in several journals as ‘The GRIPS Statement’.10 This is a welcome step forward and, since most of the guidelines apply to general risk prediction studies, journals would do well to adopt them more broadly. With regard to analysis, the guidelines stress assessment of model calibration. We can hope therefore that future papers will be more consistent about reporting measures of model fit than were found in the review by Tzoulaki et al.1 However, no specific guidance is offered on how to quantify improvement in prediction performance by adding genetic factors to a risk prediction model, reflecting the lack of consensus among experts in the field at present. Perhaps in the future, a consensus for quantifying improvement in risk prediction performance will be developed and could be added to the GRIPS statement to offer more complete guidance to investigators, readers and journal editors.
Conflict of interest: None declared.
Funding
This work was supported by the National Institutes of Health [R01 GM054438 and U24 CA086368].
References
- 1.Tzoulaki I, Liberopoulus G, Ionnaidis JPA. Use of reclassification for assessment of improved prediction: an empirical evaluation. Int J Epidemiol. 2011;40:1094–105. doi: 10.1093/ije/dyr013. [DOI] [PubMed] [Google Scholar]
- 2.Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–35. doi: 10.1161/CIRCULATIONAHA.106.672402. [DOI] [PubMed] [Google Scholar]
- 3.Cook NR, Ridker PM. Advances in measuring the effect of individual predictors of cardiovascular risk: the role of reclassification measures. Ann Intern Med. 2009;150:795–802. doi: 10.7326/0003-4819-150-11-200906020-00007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pepe MS. Problems with risk reclassification methods for evaluating prediction models. Am J Epidemiol. 2011 doi: 10.1093/aje/kwr013. doi:10.1093/aje/kwr013 (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Pencina MJ, D’Agostino RB, Sr, D’Agostino RB, Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27:157–72. doi: 10.1002/sim.2929. [DOI] [PubMed] [Google Scholar]
- 6.Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by using risk stratification tables. Ann Intern Med. 2008;149:751–60. doi: 10.7326/0003-4819-149-10-200811180-00009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Vickers AJ, Cronin AM, Begg CB. One statistical test is sufficient for assessing new predictive markers. BMC Med Res Methodol. 2011;11:13. doi: 10.1186/1471-2288-11-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kerr KF, McClelland RL, Brown ER, Lumley T. Evaluating the incremental value of new biomarkers with integrated discrimination improvement. Am J Epidemiol. 2011 doi: 10.1093/aje/kwr086. doi:10.1093/aje/kwr086 (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pencina MJ, D’Agostino RB, Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30:11–21. doi: 10.1002/sim.4085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Janssens ACJW, Ioannidis JPA, van Duijn CM, Little J, Khoury MJ For the GRIPS Group. Strengthening the reporting of genetic risk prediction studies: the GRIPS Statement. PLoS Med. 2011;8:e1000420. doi: 10.1371/journal.pmed.1000420. [DOI] [PMC free article] [PubMed] [Google Scholar]