Abstract
When evaluating a new risk factor for disease (eg, a measurement from imaging studies), many investigators examine its value above and beyond existing biomarkers and risk factors. They compare the performance of an “old” risk model using established predictors and a “new” risk model that adds the new factor. Net reclassification index (NRI) statistics are a family of metrics for comparing two risk models. NRI statistics became popular in some medical fields and have appeared in high-impact journals. This article reviews NRI statistics and describes several issues with them. Problems include unacceptable statistical behavior, incorrect statistical inferences, and lack of interpretability. NRI statistics are unhelpful (at best) and misleading (at worst).
© RSNA, 2022
Summary
Net reclassification index statistics can be misleading, giving an optimistically biased assessment of a new biomarker; their use is discouraged.
Essentials
■ Net reclassification index (NRI) statistics are a family of statistics for comparing two risk models; categorical NRI statistics are based on predefined risk categories (eg, low-, medium-, and high-risk categories), whereas the category-free NRI statistic does not use any categorization of risk (it is also called the continuous NRI).
■ Overall NRI statistics lack any clear interpretation, and incorrect interpretations appear in the literature (eg, NRI does not measure the proportion of patients correctly reclassified); if you see a simple interpretation of an NRI statistic, there is a good chance it is wrong.
■ NRI statistics ignore the relative severity of different types of errors (eg, large overestimation and underestimation of risk are two types of risk model errors, but they carry different harms to patients); NRI statistics are not recommended due to lack of clinical relevance and unacceptable statistical properties, and alternatives to NRI include the area under the receiver operating characteristic curve (C-index), net benefit, and relative utility metrics.
Introduction
Risk models and diagnostic scoring systems are common in clinical medicine. A familiar example is the risk model used to assess an individual’s risk for atherosclerotic disease and suitability for statin therapy. Insurance coverage for statin therapy is partly based on a widely accepted risk model (an equation) to estimate an individual’s 10-year risk for atherosclerotic disease. This equation combines seven factors: age, sex, ethnicity, blood pressure, cholesterol, and diabetes and smoking history.
A researcher may seek to add more information to the risk model to improve the model for assessing an individual’s risk. For example, consider adding an eighth risk factor, perhaps the diameter of the aorta. How do we decide if the equation, or risk model, is improved? In this example, two types of improvement must be considered. First, we want to identify more of the individuals who will eventually have disease as high risk. Second, we also want to correctly classify individuals who will not get disease as low risk. Note that both of these aspects separately contribute to the utility of the model.
In the domain of diagnosis, a risk model or scoring system might be used to classify patients and inform appropriate clinical care. For example, the Breast Imaging Reporting and Data System (BI-RADS) is a six-category system to summarize radiologic assessment after mammography or MRI (1). The BI-RADS categories are ordered from 1 to 6, which can be considered categories of risk from low to high. A useful system will give women who truly have breast cancer high scores and women without breast cancer low scores. If an alternative scoring system to BI-RADS were proposed, then one would need to assess how it performs on both of these dimensions and whether it performs better than BI-RADS.
Net reclassification index (NRI) statistics are a family of metrics that have been used to compare the performance of two medical risk models. Figure 1 shows the mathematical calculation for NRI. NRI statistics have mostly been used when an existing risk model is extended with one or more additional risk factors, as in the example earlier, but they can be used to compare any two risk models for the same condition (2). The NRI can also be used for statistical testing, and an NRI P value of less than .05 has been presented as evidence of superior performance of the new risk model. Despite widespread use, the results from using an NRI statistic can be misleading or may even result in incorrect statistical conclusions.
Figure 1:
Mathematical definition of the net reclassification index (NRI). The NRI seeks to compare a new risk model to an existing model. “Event” and “nonevent” refer to patients who would and would not go on to experience an event without intervention. “Up” refers to patients who move to a higher risk category with the new risk model. “Down” refers to patients who move to a lower risk category. The event NRI and nonevent NRI are defined as shown in the equation. The (overall) NRI is the sum of the event NRI and nonevent NRI. Note that the NRI combines four proportions but is not itself a proportion. The quantitative value of an NRI statistic can range from −2 to 2. P = probability.
Where did NRI statistics come from? Why were they invented? A common experience across many fields of medicine is that investigators become excited about a new biomarker, let’s call it B. In the example of atherosclerotic disease risk, B might be aortic diameter. The investigators are excited about B because they think it can dramatically improve clinicians’ ability to identify the patients most likely to experience a cardiac event in the next 5 years. A preliminary evaluation of B reveals a large and statistically significant odds ratio, area under the receiver operating characteristic curve (AUC), or other performance metric. For the example of aortic diameter, individuals with larger aortas are indeed likely to have worse survival than those with normal aortic dimensions.
Brimming with anticipation of a major advance in clinical care, the investigators move on and evaluate how well B improves risk assessment above and beyond the conventional risk factors for atherosclerotic disease. In the parlance of the field, they assess the incremental value of B. This stage of investigation is frequently disappointing. The investigators are likely to calculate AUC, also known as the C-index. The AUC can be used to assess a model’s ability to discriminate between patients who will and will not be diagnosed with atherosclerotic disease within 10 years. But the two AUC values—the AUC for the risk model with and without B—might be nearly the same; the investigators may find that their new risk factor B changes the AUC by only 0.01–0.02, or even less. A statistical comparison of AUCs of the conventional model versus the model plus risk factor often shows a P value that is not statistically significant at the P < .05 level.
This common experience has led some investigators to conclude that AUC is “insensitive.” In 2007, an influential article proposed that new biomarkers should be evaluated based on the concept of reclassification (3): Given predefined categories of risk, the proposal was that a new biomarker is valuable if using it causes a sizeable proportion of patients to move to different risk categories. This concept was appealing to researchers—perhaps especially to those investigators who were disappointed in AUC results for their latest biomarker pursuit.
In a data set comprised of events (in the prognostic setting, patients who go on to have the clinical event) and nonevents (patients who do not), a good model should put events into high-risk categories and nonevents into low-risk categories. Researchers quickly recognized a big problem with the initial proposal to examine reclassification: All reclassifications were considered equally good, regardless of whether they made things better or worse. NRI statistics (4) entered the scene as metrics that rely on the concept of reclassification but avoid the biggest issue with the 2007 proposal. However, the concept of reclassification warrants further careful consideration. Despite intuitive appeal, reclassification turns out not to be a helpful framework for evaluating new risk models.
Whenever there are two candidate risk models, such as an “old” model and a “new” model, the overriding question is “Which risk model should we use?” Ideally, a risk model’s performance will be assessed in a way that aligns with what the risk model will be used for (5,6). For example, if a risk model will be used to recommend a preventative intervention to high-risk individuals, then net benefit or relative utility metrics (7–10) may be best suited to assess performance. Or, suppose a clinical trial is being planned to evaluate an intervention for preventing a bad clinical event. A prognostic model might be used to enrich the trial for individuals at high risk of the event (11). One can evaluate the model based on its ability to improve trial metrics (12,13). In these examples, we can assess the performance of the old and new risk models and compare these assessments. In stark contrast, NRI statistics cannot summarize a risk model’s performance. NRI statistics can only be used to compare risk models. This fact alone implies that NRI does not summarize the clinical or population impact of using the risk model.
Figure 2 provides examples showing NRI rewarding poor performance. The figure presents simple examples of new and old risk models when there are three categories of risk. In example 1 in this figure, the new risk model improves the risk categorization of more event patients than it worsens and similarly improves the risk categorization for more nonevent patients than it worsens. The overall NRI in example 1 is 0.35. In example 2, the new risk model improves the risk categorization of some event patients and does not worsen risk categorization for any event patients. However, in example 2 the new risk model worsens risk categorization for many nonevent patients—more than it improves. An inspection of the results in example 2 would likely lead to the conclusion that the new risk model is unacceptable because most of the population is nonevent patients, and nonevent patients are mostly harmed by the new risk model. However, the overall NRI statistic is 0.45 in example 2 compared with 0.35 in example 1.
Figure 2:
Three-category net reclassification index (NRI) for a sample of 1000 patients from a population with 90% events (patients who would go on to experience an event without intervention) and 10% nonevents (patients who would not go on to experience an event without intervention). Overall NRI is higher in example 2 compared with example 1. Yet, in example 2 more patients have worse risk classification (orange cells) compared with better risk classification (green cells) with a switch from the old to the new risk model. In other words, more patients would be harmed than helped by switching from the old risk model to the new risk model, even though the NRI statistic is positive. This behavior arises because NRI ignores the relative proportions of events and nonevents in the population. In example 3, the percentage of event patients reclassified and the percentage of nonevent patients reclassified are identical to those in example 1; the NRI statistics are also identical to those in example 1. However, compared with example 1, the new risk model in example 3 reclassifies more event patients from high risk to low risk instead of from high risk to medium risk and reclassifies more nonevent patients from low risk to high risk instead of from low risk to medium risk. The NRI statistics are identical for example 1 and example 3 because the NRI only accounts for the direction of changes and not their magnitude; thus, the NRI ignores crucial information about the clinical impact of reclassification.
Similarly, Cook (14) shows an example where all standard assessments of a biomarker suggest it has positive incremental value, yet the category-free NRI statistic is negative because more patients’ risks change in the wrong direction than the right direction when the biomarker is incorporated into risk prediction. The problem is that the category-free NRI statistic ignores the magnitude of change. All changes contribute equally to the statistic, no matter how large or infinitesimally small. To some extent the same issue can arise with categorical NRI statistics because they also only account for the direction of risk reclassification, not the magnitude (15). This is illustrated by example 3 in Figure 2. In addition, overall NRI statistics ignore differing harms of downgrading risk for event patients versus the harms of upgrading risk for nonevent patients (16).
Most NRI statistics lack any clear interpretation, and incorrect NRI interpretations regularly appear in the literature (15,17). In addition, NRI statistics have problematic statistical properties, with the potential to mislead researchers about the value of new biomarkers (18–21). Some have wondered whether NRI popularity stems from the fact that NRI statistics often yield larger values than other metrics (6,14,16). While this can help explain the appeal of the NRI, we should be alarmed if a method is adopted, not due to its merits, but because it gives favorable results (16).
When there are two candidate risk models, it can be interesting to explore if there are substantive differences between them in how they classify different individuals. However, all things considered, risk reclassification is not a helpful framework to evaluate new risk models. NRI statistics, which are based on reclassification, are flawed. NRI statistics do not quantify risk model performance, are not interpretable, and are prone to give misleading results. Although the AUC is criticized, fairly, because it does not quantify the clinical or public health benefit of a risk model, NRI statistics are a step in the wrong direction and should not be used.
Supported by National Institutes of Health award R01HL085757.
Disclosures of conflicts of interest: K.F.K. No relevant relationships.
Abbreviations:
- AUC
- area under the receiver operating characteristic curve
- NRI
- net reclassification index
References
- 1. American College of Radiology . Breast Imaging Reporting and Data System (BI-RADS) . Reston, Va: : American College of Radiology; , 2013. . [Google Scholar]
- 2. McKearnan SB , Wolfson J , Vock DM , Vazquez-Benitez G , O’Connor PJ . Performance of the Net Reclassification Improvement for Nonnested Models and a Novel Percentile-Based Alternative . Am J Epidemiol 2018. ; 187 ( 6 ): 1327 – 1335 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Cook NR . Use and misuse of the receiver operating characteristic curve in risk prediction . Circulation 2007. ; 115 ( 7 ): 928 – 935 . [DOI] [PubMed] [Google Scholar]
- 4. Pencina MJ , D’Agostino RB Sr , D’Agostino RB Jr , Vasan RS . Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond . Stat Med 2008. ; 27 ( 2 ): 157 – 172 ; discussion 207–212 . [DOI] [PubMed] [Google Scholar]
- 5. Pencina MJ , Fine JP , D’Agostino RB Sr . Discrimination slope and integrated discrimination improvement - properties, relationships and impact of calibration . Stat Med 2017. ; 36 ( 28 ): 4482 – 4490 . [DOI] [PubMed] [Google Scholar]
- 6. Kerr KF , Janes H . First things first: risk model performance metrics should reflect the clinical application . Stat Med 2017. ; 36 ( 28 ): 4503 – 4508 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Vickers AJ , Elkin EB . Decision curve analysis: a novel method for evaluating prediction models . Med Decis Making 2006. ; 26 ( 6 ): 565 – 574 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Kerr KF , Brown MD , Zhu K , Janes H . Assessing the Clinical Impact of Risk Prediction Models With Decision Curves: Guidance for Correct Interpretation and Appropriate Use . J Clin Oncol 2016. ; 34 ( 21 ): 2534 – 2540 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Baker SG . Putting risk prediction in perspective: relative utility curves . J Natl Cancer Inst 2009. ; 101 ( 22 ): 1538 – 1542 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Vickers AJ , Van Calster B , Steyerberg EW . Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests . BMJ 2016. ; 352 : i6 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Temple R . Enrichment of clinical study populations . Clin Pharmacol Ther 2010. ; 88 ( 6 ): 774 – 778 . [DOI] [PubMed] [Google Scholar]
- 12. Kerr KF , Roth J , Zhu K , et al . Evaluating biomarkers for prognostic enrichment of clinical trials . Clin Trials 2017. ; 14 ( 6 ): 629 – 638 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Cheng S , Kerr KF , Thiessen-Philbrook H , Coca SG , Parikh CR . BioPETsurv: Methodology and open source software to evaluate biomarkers for prognostic enrichment of time-to-event clinical trials . PLoS One 2020. ; 15 ( 9 ): e0239486 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Cook NR . Clinically relevant measures of fit? A note of caution . Am J Epidemiol 2012. ; 176 ( 6 ): 488 – 491 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Kerr KF , Wang Z , Janes H , McClelland RL , Psaty BM , Pepe MS . Net reclassification indices for evaluating risk prediction instruments: a critical review . Epidemiology 2014. ; 25 ( 1 ): 114 – 121 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Vickers AJ , Pepe M . Does the net reclassification improvement help us evaluate models and markers? Ann Intern Med 2014. ; 160 ( 2 ): 136 – 137 . [DOI] [PubMed] [Google Scholar]
- 17. Leening MJ , Vedder MM , Witteman JC , Pencina MJ , Steyerberg EW . Net reclassification improvement: computation, interpretation, and controversies: a literature review and clinician’s guide . Ann Intern Med 2014. ; 160 ( 2 ): 122 – 131 . [DOI] [PubMed] [Google Scholar]
- 18. Hilden J , Gerds TA . A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index . Stat Med 2014. ; 33 ( 19 ): 3405 – 3414 . [DOI] [PubMed] [Google Scholar]
- 19. Pepe MS , Janes H , Li CI . Net risk reclassification p values: valid or misleading? J Natl Cancer Inst 2014. ; 106 ( 4 ): dju041 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Pepe MS , Fan J , Feng Z , Gerds T , Hilden J . The Net Reclassification Index (NRI): a Misleading Measure of Prediction Improvement Even with Independent Test Data Sets . Stat Biosci 2015. ; 7 ( 2 ): 282 – 295 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Gerds TA , Hilden J . Calibration of models is not sufficient to justify NRI . Stat Med 2014. ; 33 ( 19 ): 3419 – 3420 . [DOI] [PubMed] [Google Scholar]



