Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 10.
Published in final edited form as: Stat Med. 2019 Jun 18;38(20):3817–3831. doi: 10.1002/sim.8204

Measures for evaluation of prognostic improvement under multivariate normality for nested and nonnested models

Danielle M Enserro 1,2, Olga V Demler 3, Michael J Pencina 4, Ralph B D’Agostino Sr 5
PMCID: PMC6827341  NIHMSID: NIHMS1055105  PMID: 31211443

Abstract

When comparing performances of two risk prediction models, several metrics exist to quantify prognostic improvement, including the change in the area under the Receiver Operating Characteristic curve, the Integrated Discrimination Improvement, the Net Reclassification Index at event rate, the change in Standardized Net Benefit, the change in Brier score, and the change in scaled Brier score. We explore the behavior and interrelationships between these metrics under multivariate normality in nested and nonnested model comparisons. We demonstrate that, within the framework of linear discriminant analysis, all six statistics are functions of squared Mahalanobis distance, a robust metric that properly measures discrimination by quantifying the separation between the risk scores of events and nonevents. These relationships are important for overall interpretability and clinical usefulness. Through simulation, we demonstrate that the performance of the theoretical estimators under normality is comparable or superior to empirical estimation methods typically used by investigators. In particular, the theoretical estimators for the Net Reclassification Index and the change in Standardized Net Benefit exhibit less variability in their estimates as compared to their empirically estimated counterparts. Finally, we explore how these metrics behave with potentially nonnormal data by applying these methods in a practical example based on the sex-specific cardiovascular disease risk models from the Framingham Heart Study. Our findings aim to give greater insight into the behavior of these measures and the connections existing among them and to provide additional estimation methods with less variability for the Net Reclassification Index and the change in Standardized Net Benefit.

Keywords: AUC, Brier score, IDI, net benefit, NRI, risk prediction

1 |. INTRODUCTION

Public health practice and quality of medical care rely heavily on the accuracy, precision, and robustness of risk prediction models. Health care providers use these models to assess a patient’s risk of developing an event during a specified time-frame given specific patient characteristics. For example, the Framingham Heart Study cardiovascular disease (CVD) risk model estimates the risk of a CVD event within 10 years with sex-specific prediction models, considering age, total and high-density lipoprotein (HDL) cholesterol, systolic blood pressure (SBP), treatment for hypertension, current smoking status, and diabetes mellitus.1 Based on an individual’s estimated risk, the health provider recommends a course of treatment for preventive action. Regarding risk estimation, a patient’s estimated risk level is dependent on the state of clinical practice at the time the data are collected. In the case of preventive care, the data analysis will assign higher risk levels to patients in need of more aggressive treatment than the standard care.

Due to the importance of risk prediction models in personalized medicine, their accurate estimation and evaluation is crucial for the development of valid and replicable algorithms. Proper validation techniques assist in the process of translating prediction models into decision rules used in clinical practice.2 This includes the development of new risk prediction models, determining if the addition of novel biomarkers will improve the prognostic performance of a well-established, standard risk prediction model, or in comparing the performances of two nonnested models predicting the same outcome. Quantifying the improvement in prognostic performance of the model is typically of interest. In this context, prognostic improvement refers to improved prognosis-making ability. The assessment of model calibration and discrimination is a key step in the prognostic validation process. Discrimination quantifies the model’s ability to distinguish correctly the two classes of outcome (ie, event vs nonevent) while calibration quantifies how close predicted risks are to observed event proportions (eg, for patients with a predicted risk of 20%, on average 20 out of 100 patients in the data should experience the event of interest).3,4 A number of measures quantifying prognostic improvement exist, including the change in the area under the Receiver Operating Characteristic curve (AUC), the Integrated Discrimination Improvement, the Net Reclassification Index, the change in Standardized Net Benefit, the change in Brier score, and the change in scaled Brier score.

A widely used measure of discrimination, the AUC, is the probability that the risk percentage assigned to a randomly selected patient who did not experience the event is less than the risk percentage of a randomly selected patient who did experience the event.3,5,6 The nonparametric Mann-Whitney statistic is a common estimator of AUC.7 The change in AUC (ΔAUC) is equivalent to the difference of the AUC of the new or expanded model and the AUC of the standard or baseline model.7 When comparing two AUCs, some authors argue that problems may arise when investigators are not interested in the entire area under the curve (ie, the full range of false-positive rates) and that a comparison of partial AUCs should be considered.810 However, other authors argue that the overall AUC is the reporting standard,11,12 and thus, the overall AUC remains the focus of this paper. Regarding hypothesis testing for AUC, Demler et al demonstrated that, if two risk models compared are correct under normality, the significance of the new predictor is equivalent to significant improvement in AUC.13 This finding renders testing for significant improvement in AUC redundant, and investigators should instead focus on accurate estimation of ΔAUC. The magnitude of the improvement in AUC depends on the strength of the baseline model; as the strength of the baseline model increases, the potential increase in AUC decreases. This phenomenon may lead investigators to be overly conservative14 and it was one of the motivating factors in the search for new methods for assessing discrimination improvement.15,16 Another motivation was the use of reclassification tables in model validation; reclassification tables tabulate patients into clinically relevant risk categories assuming a risk threshold chosen a priori.16 For example, the ACC/AHA Cholesterol Guidelines proposed by the American College of Cardiology (ACC) and the American Heart Association (AHA) categorize patients into low risk, intermediate risk, and high-risk categories based on risk thresholds of 5% and 7.5%, in order to aid practitioners in making treatment decisions.17 In response, Pencina et al proposed two new measures for assessing discrimination improvement: the Integrated Discrimination Improvement (IDI) and the Net Reclassification Index (NRI).14 IDI is the difference in discrimination slopes between two models, where each discrimination slope is the difference in average predicted probabilities among events and nonevents. NRI is a risk category-based measure, summarizing movement between assigned risk categories within a reclassification table in events and nonevents separately while penalizing for incorrect movements. Sample data tabulated in the reclassification table empirically estimates the proportions of patients who move up and move down in categories among events and nonevents. In the case of two risk categories with the risk threshold set as the event rate or incidence rate, NRI becomes a special version known as NRI at event rate (NRI(y)), a metric recommended by Pencina et al for its beneficial properties including invariance with respect to event rates and its relation to metrics from decision theory analytics.18 While some researchers debate the behavior, interpretation, and application of IDI and NRI,1923 many authors support their use and they recommend properly calibrating risk models before estimating either metric.18,24 Finally, the original derivations of ΔAUC, IDI, and NRI assume risk prediction functions where the analysis method predicts a binomial outcome over a fixed time point; however, extensions utilizing survival analysis for time-to-event outcomes do exist.2527

The first contribution of this paper is extending previously derived theoretical expressions under normality linking squared Mahalanobis distance (M2) to ΔAUC, IDI, and NRI(y) from nested models to nonnested models (Section 2). Pencina et al demonstrated that, for nested models satisfying the assumptions of linear discriminant analysis (LDA), relationships exist between ΔAUC and IDI with M2.28 Pencina et al further extended the relationship linking M2 to NRI(y) for nested models under the same conditions.18 The Mahalanobis distance (M) is a robust metric that properly measures discrimination by quantifying the separation between the risk scores of patients with and without the event. M possesses good discriminative performance for data from a wide range of probability distributions.29 By adopting a similar framework, we show that the formulas also hold for nonnested model comparisons, increasing their usability in practice. Within these derivations and additional derivations that follow, we mention several equivalencies among the measures of prognostic improvement. In this context, equivalence means that, if a hypothetical change within the scenario implies a favorable shift in one measure, the equivalent measure also demonstrates a favorable shift.

Next, we derive the theoretical expression linking M2 to the change in Standardized Net Benefit (ΔSNB(t)) (Section 2). Vickers and Elkin’s SNB(t) is a decision theoretic metric used in decision curve analysis to help address potential harms and benefits associated with misclassifying events and nonevents with a risk prediction model.30 The construction of the decision curve considers all possible choices of the Cost-to-Benefit ratio for classifying patients (ie, B/C; the ratio of false-positive loss to false-negative loss). The application of the curve assumes that the ensuing risk threshold C/(C + B) is adhered to; in the event a chosen threshold does not match the relevant patients’ C/B ratio, the utility consequences differ from the assumed decision curve. The resulting plot shows the Net Benefit (NB(t)) of treating patients according to the prediction model.31 NB(t) is defined as the difference in the number of true-positive classifications and the number of false-positive classifications, with the latter weighted by the C/B ratio. NB(t) is interpreted in units of the true-positive: how many excess patients were correctly treated at the same rate of not treating the patients who did not require treatment.31 In cost analysis, utility is defined as monetary cost or harm. If no predictions were made, there is the utility of treating no patients (ie, all patients classified as negative) or treating all patients (all patients classified as positive). The SNB(t), or relative utility, of a prediction model is the ratio of the maximum utility of the prediction model (vs no prediction) to the utility of perfect prediction (vs no prediction), where the no prediction utility is equal to the larger of the utility of treating no patients or the utility of treating all patients.32 SNB(t) expresses the usefulness of the risk model as the proportion of the maximum gain relative to the best standard strategy at a given threshold.33 ΔSNB(t) has also been extended for use in time-to-event analysis.34

Third, we derive the theoretical expression linking M2 to the change in Brier score (ΔBS) and the change in scaled Brier score (ΔSBS) (Section 2). BS is a quadratic scoring rule originally utilized for verification of weather forecasts.35 Similar to mean squared error, its estimation uses the squared differences between actual outcomes and their corresponding predicted probabilities.36 Lower values of BS are indicative of better model predictions, and BS’s maximal attainable value depends on prevalence of the outcome in the data.31 SBS is a rescaled version of BS, where BS is scaled by its maximum score assuming a noninformative model and a given event rate. This allows SBS to range from 0% to 100%, with larger values indicative of better predictions by the model.31 While all aforementioned measures are affected by miscalibration, BS can be expressed in terms specifically demonstrating the effect of miscalibration on its estimation.36 Specifically, it was shown by Yates that BS can be decomposed into four components: event prevalence, calibration-in-the-large, the variance between predicted probabilities, and the covariance between predicted probabilities and observed outcomes.37 Calibration-in-the-large is satisfied when the average of the model-based risks is equal to the event rate, and it considered the simplest global measure of calibration. If the model is unbiased, this component will be zero. However, poor calibration may still exist if some groups of patients are badly under-predicted while others are badly over-predicted, and these potential issues will affect the third and fourth components of Yates’ decomposition.37 Due to the demonstrated relationships in this paper, we also show that any significant improvement in SBS and BS is equivalent to the significance of the added regression coefficients in nested model comparisons (Section 3).

Through simulation, we compare the performance of the theoretical estimators for ΔAUC, IDI, NRI(y), ΔSNB(t), ΔSBS, and ΔBS with their empirical estimator counterparts (Section 4). While the simulations assume multivariate normality among the predictors, we do explore how these metrics perform with data with some departure from normality by applying both the theoretical estimators and the empirical estimators to a practical example using data from the Framingham Heart Study (Section 5). Finally, we discuss the implications of the theoretical derivations and the potential for further research, and we make recommendations for their use in practice (Section 6).

2 |. DISCRIMINATION IMPROVEMENT MEASURES UNDER NORMALITY FOR NESTED AND NONNESTED MODELS

Table 1 summarizes the theoretical estimators under normality for ΔAUC, IDI, NRI(y), ΔSNB(t), ΔSBS, and ΔBS. These formulas apply to both nested and nonnested model comparisons. Pencina et al derived the formulas for ΔAUC, IDI, and NRI(y) in the nested model case.14,18 We demonstrate that these formulas also extend to nonnested models. We also provide formulas under normality for ΔSNB(t), ΔSBS, and ΔBS. We assume a data framework satisfying the assumptions of LDA, including multivariate normality, homogeneity of variance and covariance, and independence among sample observations.38 Under the assumption of multivariate normality of predictor variables, LDA produces more efficient regression coefficient estimates than logistic regression.38,39 We provide abbreviated details of the assumptions and derivations here; full details are available in the Appendix.

TABLE 1.

Risk model validation measures equivalents under normality

Measure Theoretical Estimators Under Normality for Nested and Nonnested Models
ΔAUC
Φ(Mnew22)Φ(Mold22) (1)
IDI
 12πMnew2exp((u0.5Mnew2)22Mnew2)(11+rexp(u)11+rexp(u))du 12πMold2exp((u0.5Mold2)22Mold2)(11+rexp(u)11+rexp(u))du (2)
NRI(y)
2(Φ(Mnew22)Φ(Mold22)) (3)
ΔSNB(t)
{[Φ(Mnew22ln(t(1y)(1t)y)Mnew2)Φ(Mold22ln(t(1y)(1t)y)Mold2)]+t(1y)y(1t)[Φ(Mnew22+ln(t(1y)(1t)y)Mnew2)Φ(Mold22+ln(t(1y)(1t)y)Mold2)],yty(1t)t(1y)[Φ(Mnew22ln(t(1y)(1t)y)Mnew2)Φ(Mold22ln(t(1y)(1t)y)Mold2)]+[Φ(Mnew22+ln(t(1y)(1t)y)Mnew2)Φ(Mold22+ln(t(1y)(1t)y)Mold2)],y>t (4)
ΔSBS
 12πMnew2exp((u0.5Mnew2)22Mnew2)(11+rexp(u)11+rexp(u))du 12πMold2exp((u0.5Mold2)22Mold2)(11+rexp(u)11+rexp(u))du (5)
ΔBS
y(1y))[12πMold2exp((u0.5Mold2)22Mold2)(11+rexp(u)11+rexp(u))du12πMnew2exp((u0.5Mnew2)22Mnew2)(11+rexp(u)11+rexp(u))du] (6)

ΔAUC = Change in area under the Receiver Operating Characteristic curve; IDI = Integrated Discrimination Improvement;

NRI(y) = Net Reclassification Index at event rate; ΔSNB(t) = Change in Standardized Net Benefit;

ΔSBS = Change in scaled Brier score; ΔBS = Change in Brier score.

Mnew2 is the squared Mahalanobis distance for the new or updated model; Mold2 is the squared Mahalanobis distance for the standard or baseline model; r is the prevalence or incidence ratio of nonevents to events; y is the event rate; t is the assumed threshold.

Let Y be a binary outcome, where Y = 1 for an event of interest and Y = 0 for a nonevent. We define a column vector X of p + q variables conditional on Y that follow a multivariate normal distribution such that X | Y = 1 ∼ N(μ1, Σ1) and X | Y = 0 ∼ N(μ0, Σ0). We assume that Σ1 = Σ0 = Σ and that Σ−1 denotes its inverse. We also assume that Y and X are nonmissing for all N independent observations. We let δ = μ1μ0 denote the vector of mean differences between the events and nonevents. We assume that we wish to evaluate improvement in the performance of a risk model with p + q variables over a model with the first p variables only (the nested case) or that we wish to compare a model with the first p predictors to a model with the last q predictors (the nonnested case). The first model is the new or updated model and the second model is the standard or baseline model. We define M2=δTΣ1δ. The predicted probability of an event is p(X)=11+reL*(X), where r is the prevalence or incidence ratio of nonevents to events and L*(X) is the LDA classification function based on predictors in X (eg, Lp*(X)=aTX12aT(μ1+μ0), Lq*(X)=bTX12bT(μ1+μ0) and Lp+q*(X)=cTX12cT(μ1+μ0) are the LDA classification functions based on the p predictors, q predictors and, p + q predictors respectively). We let p(X | Y = 1) and p(X | Y = 0) be the predicted probabilities of event in events and nonevents, respectively. Se(t) is the model sensitivity (ie, the true positive rate) and Sp(t) is the model specificity (ie, the true negative rate), both assuming the threshold t. For NRI(y), we set the threshold t equal to the event rate, y.

Denoting metrics from the new or updated model and the standard or baseline model with the subscripts “new” and “old”, respectively, the following three definitions describe ΔAUC, IDI, and NRI(y).18,28

A. Increase in the AUC:

ΔAUC=P(pnew(X|Y=1)>pnew(X|Y=0))P(pold(X|Y=1)>pold(X|Y=0)).

B. IDI (as the difference in discrimination slopes):

IDI={E(pnew(X|Y=1))E(pnew(X|Y=0))}{E(pold(X|Y=1))E(pold(X|Y=0))}=E(11+reLnew*(X|Y=1)11+reLnew*(X|Y=0))E(11+reLold*(X|Y=1)11+reLold*(X|Y=0)).

C. NRI at event rate y:

NRI(y)=(Senew(y)Seold(y))+(Spnew(y)Spold(y)).

Employing the aforementioned LDA assumptions, we have the first three relationships listed in Table 1 for ΔAUC, IDI, and NRI(y). Relationship (1) follows from the work of Su and Liu.40 Relationship (2) relies on the normality assumption and the definition of expectations. Numerical integration is necessary for Relationship (2) since it does not have a closed-form solution. Relationship (3) follows from the work of Pencina et al.18

We now move to ΔSNB(t) under normality. ΔSNB(t) for a given threshold t, estimated by SNBnew(t) − SNBold(t), is defined such that

ΔSNB(t)={ΔSe(t)+t(1y)y(1t)ΔSp(t),yty(1t)t(1y)ΔSe(t)+ΔSp(t),y>t,

where ΔSe(t) = Senew(t) − Seold(t) and ΔSp(t) = Spnew(t) − Spold(t). When comparing risk models with ΔSNB(t), both models should assume the same value for t. Using the definition for ΔSNB(t) and the assumption of normality, we can write explicitly the expression for ΔSNB(t) (Table 1). Relationship (4) follows from the definitions of Se(t) and Sp(t) under normality.12 When we assign the threshold as the event rate, ΔSNB(t) is equivalent to NRI(y).18,33

Next, we show that under assumptions of normality ΔSBS asymptotically is also a function of M2. This relationship follows from SBS being asymptotically equivalent to the discrimination slope (DS) when the risk model is well calibrated and unbiased.24 Yates defined DS as the difference in the mean predicted probabilities of the event and nonevent groups.37 The name stems from DS being the slope estimate in a linear regression model treating the predicted probabilities as the continuous outcome and the binary event status as the predictor.36 The DS for one model can be written as

DS=E(p(X)|Y=1)E(p(X)|Y=0)andundernormality,=E(11+reL*(X|Y=1)11+reL*(X|Y=0)).

Therefore, we have the expression for ΔSBS (Table 1). Relationship (5) is a reasonable approximation since LDA produces well-calibrated and unbiased models. For this reason, we can also express ΔBS asymptotically as a function of M2 and the event rate y (Relationship (6) in Table 1).

3 |. SIGNIFICANT IMPROVEMENT IN SCALED BRIER SCORE AND BRIER SCORE IS EQUIVALENT TO SIGNIFICANCE OF ADDED REGRESSION COEFFICIENTS

We note from the results of the previous section that ΔSBS is asymptotically equivalent to IDI under normality. This agrees with previous findings that this relationship holds in general for two well-calibrated risk models that are unbiased (ie, calibration-in-the-large).41,42

Relationship (5) in Table 1 is applicable to both nested and nonnested model comparisons. In the nested model case, the first integral is evaluated based on the expanded model and the second integral is evaluated based on the baseline model, which is nested within the expanded model. Furthermore, it has been shown previously that, for comparisons of nested risk prediction models, significant improvements in some of the metrics described in this paper are equivalent to the significance of the added regression coefficients, including ΔAUC and IDI.13,28,43 In particular, Pencina et al and Pepe et al demonstrate that, in comparisons of nested risk models, testing the null hypothesis of no improvement in IDI (H0: IDI = 0) is equivalent to testing the null hypothesis that the regression coefficients of the added predictors are zero, even when the predictor variables are nonnormal.28,43 Since IDI is asymptotically equivalent to ΔSBS in general, testing for significant improvement in ΔSBS (H0: ΔSBS = 0) is also equivalent to testing whether the added regression coefficients are equal to zero. This also applies to testing for significant improvement in ΔBS (H0: ΔBS = 0).

Noting these equivalencies is rather important; they act as evidence that subjecting risk model performance measures to hypothesis testing is redundant and unnecessary. Instead, investigators should establish significance of added predictor variables in the new model and then focus on precise estimation of these model performance characteristics to quantify improvement in prognostic performance. This notion falls in line with a more general point in applied statistics: investigators should focus on efficient tests of parameters within well-chosen models, and the application of hypothesis testing to purely illustrative or graphical data is inefficient and unnecessary.

4 |. PERFORMANCE OF FORMULAS UNDER NORMALITY VS EMPIRICAL ESTIMATORS

We compared the performance of the presented theoretical estimators in Table 1 using sample-specific estimates of M2 to their empirical counterparts through simulation in order to assess their behavior under various scenarios. The empirical estimates were obtained using methods provided in the literature,14,18,30,35 by determining the relevant contribution for each clinical case and calculating the required averages or sums for the provided formulas summarized in Table A1 in the Appendix. We used 1000 iterations with a sample size of N = 100 000 and an assumed event rate of 10%, which corresponds close to CVD rates observed in the Framingham Heart Study.44 We considered both nested model comparisons and nonnested model comparisons. For the nested model comparisons, we assumed a baseline model with two strong, normally distributed predictors, x1 and x2, with effect sizes of 0.7 and 0.8, respectively, and an added variable model including predictors x1 and x2 with a third normally distributed variable x3 with a moderate effect size of 0.5. For the nonnested model comparison, we compared one model with two normally distributed predictors, x1 and x2, with effect sizes of 0.5 and 0.7, respectively, to a second model including normally distributed predictors x3 and x4, with strong effect sizes of 0.8 and 0.9, respectively. Table 2 summarizes means, standard deviations, and standardized mean differences from the theoretical estimator results vs the empirically estimated results.45 Values in this table were multiplied by 1000 for ease of reading.

TABLE 2.

Means and standard deviations from simulations comparing theoretical estimators vs empirical estimators

Theoretical Mean (Std Dev) Nested Empirical Mean (Std Dev) Standardized Mean Difference Theoretical Mean (Std Dev) Nonnested Empirical Mean (Std Dev) Standardized Mean Difference
ΔAUC 23.11 (0.93) 23.13 (1.06) −20.09 74.29 (2.84) 74.28 (3.07) 3.38
IDI 28.13 (1.13) 28.12 (1.19) 8.62 79.26 (3.08) 79.22 (3.09) 12.97
NRI(y) 38.22 (1.52) 38.12 (3.63) 35.94 120.05 (4.59) 120.03 (6.43) 3.58
ΔSNB(t = 0.05) 44.26 (1.79) 44.39 (5.61) −31.22 133.22 (5.02) 133.23 (9.03) −1.37
ΔSNB(t = 0.075) 42.86 (1.74) 42.93 (4.38) −21.00 133.93 (5.10) 134.07 (7.72) −21.39
ΔSNB(t = 0.20) 43.92 (1.77) 43.86 (4.15) 18.81 131.14 (4.94) 131.36 (6.72) −37.30
ΔSBS 28.13 (1.13) 28.11 (1.32) 16.28 79.26 (3.08) 79.21 (3.22) 15.87
ΔBS −2.53 (0.101) −2.53 (0.119) −18.12 −7.13 (0.277) −7.13 (0.290) −14.11

All values multiplied by 1000 for ease of reading.

ΔAUC = Change in area under the Receiver Operating Characteristic curve; IDI = Integrated Discrimination Improvement;

NRI(y) = Net Reclassification Index at event rate; ΔSNB(t) = Change in Standardized Net Benefit, where t is the assumed threshold of either 0.05, 0.075, or 0.20; ΔSBS = Change in scaled Brier score; ΔBS = Change in Brier score.

Nested model comparison:

Standard model includes predictors x1 + x2, where x1 has effect size of 0.7 and x2 has effect size of 0.8.

Added-variable model includes predictors x1 + x2 + x3, where x3 has effect size of 0.5.

Nonnested model comparison:

First model includes predictors x1 + x2, where x1 has effect size of 0.5 and x2 has effect size of 0.7.

Second model includes predictors x3 + x4, where x3 has effect size of 0.8 and x4 has effect size of 0.9.

Standardized mean difference was estimated as Mean(Theoretical)Mean(Empirical)PooledStandardDeviation. This is also known as Cohen’s d.45 The standardized mean difference presented was also multiplied by 1000 after estimation.

Figure 1 presents a graphical representation of the simulated comparisons for ΔAUC, IDI, and NRI(y) assuming nested models, with corresponding R2 values of 0.79, 0.89, and 0.15, respectively. If the two estimators had perfect agreement, the points on the plot would fall in the pattern of a 45-degree diagonal line. The theoretical estimator and the empirical estimator seem to agree with less variability for ΔAUC (theoretical standard deviation (SD) = 0.9 vs empirical SD = 1.1) and IDI (theoretical SD = 1.1 vs empirical SD = 1.2). However, there is far less precision in the empirical estimates for NRI(y) as compared to the theoretical estimator (theoretical SD = 1.5 vs empirical SD = 3.6). It has been proposed that NRI is a noisier version of the IDI since it replaces the difference in model probabilities with the indicator function.46 The increased variability in the empirical estimator of the NRI(y) could lead to a slower conversion to its true value. Due to this observation, we further examined the relationship between theoretical NRI(y) and empirical NRI(y) with various effect sizes in the added predictor, including 0.2, 0.7, 0.9, and 1.5 (Supplemental Figure 1). For all four choices of effect size, there are four distinct clusters of points indicating the correspondence of the theoretical estimate to the empirical estimate. This demonstrates that the assumed relationship appears consistent across a variety of effect sizes varying from weak to excessively strong. We also note the flattening of the cluster of values as the effect size decreases toward the null, indicating that weaker added predictors give more variability in the empirical estimates.

FIGURE 1.

FIGURE 1

Theoretical estimators vs empirical logistic regression for ΔAUC, IDI, and NRI(y) assuming normal variables for nested models

Supplemental Figure 2 presents ΔAUC, IDI, and NRI(y) for the nonnested model comparison case. The R2 values for ΔAUC, IDI, and NRI(y) are 0.90, 0.95, and 0.55, respectively. Both versions of estimators ΔAUC and IDI produce estimates with similar variability (ΔAUC: theoretical SD = 2.8 vs empirical SD = 3.1; IDI: theoretical SD = 3.1 vs empirical SD = 3.1). We again note that the correspondence between theoretical NRI(y) and empirical NRI(y) does not appear as strong visually as the three other measures (theoretical SD = 4.6 vs empirical SD = 6.4). Interestingly, the relationship appears to be stronger for nonnested models. Overall, the observed R2 values are nonnegligible, with evidence that there is much less variability in estimating theoretical NRI(y) as compared with estimating empirical NRI(y).

Figure 2 presents a graphical representation of the simulated comparisons for ΔSNB(t) assuming nested models with three choices for the threshold t: 0.05, 0.075, and 0.20. We opted for 0.05 and 0.20 as these are one-half and two-times the assumed event rate.18 The third threshold of 0.075 reflects the ACC/AHA cut-off for 10-year atherosclerotic CVD risk used for recommending statin treatment.17 We opted not to present graphically the event rate of 0.10 as a threshold as this would yield the same result as NRI(y). The three threshold choices of 0.05, 0.075, and 0.20 have corresponding R2 values of 0.12, 0.16, and 0.14, respectively. All three R2 values are quite low, indicating more variability in the empirical estimation of ΔSNB(t) than for the theoretical estimate. This is nearly identical to the issue we noted with NRI(y). We again note larger differences in variability: theoretical SD = 1.8 vs empirical SD = 5.6 for 0.05 threshold, theoretical SD = 1.7 vs empirical SD = 4.4 for 0.075 threshold, and theoretical SD = 1.8 vs empirical SD = 4.1 for 0.20 threshold. Supplemental Figure 3 demonstrates that, for a range of added effect sizes, the correspondence between theoretical ΔSNB(t) and empirical ΔSNB(t) exists, and that there is consistently less variability in the theoretical estimator compared to the empirical estimator. There are improvements in the R2 values for ΔSNB(t) in nonnested model comparisons (Supplemental Figure 4), with R2 ranging from 0.40 to 0.49 (threshold 0.05: theoretical SD = 5.0 vs empirical SD = 9.0; threshold 0.075: theoretical SD = 5.1 vs empirical SD = 7.7; threshold 0.20: theoretical SD = 4.9 vs empirical SD = 6.7). We also observed through simulations that R2 is maximized when the threshold is equal to the event rate, meaning variability is reduced as the threshold nears the event rate.

FIGURE 2.

FIGURE 2

Theoretical estimators vs empirical logistic regression for change in Standardized Net Benefit assuming normal variables for nested models

Figure 3 and Supplemental Figure 5 summarize simulation results comparing the theoretical estimators for ΔSBS and ΔBS to their empirically estimated counterparts, for nested models and nonnested models, respectively. We note that the scale of ΔBS is negative, as values of BS closer to zero indicate better discrimination. Since the new model exhibits better discrimination, taking the difference of the new model’s BS minus the old model’s BS results in a negative value for ΔBS. A positive difference would indicate that the first model has better fit. In the nested model case, the R2 is 0.71 for both ΔSBS and ΔBS, and in the nonnested model case, the R2 is 0.85 for both. Both sets of plots exhibit that the theoretical estimates correspond well with the empirical estimates. ΔSBS is asymptotically equivalent to IDI and ΔBS is asymptotically equivalent to a rescaled IDI, which explains why the behaviors of the ΔSBS and ΔBS formulas look very close to the behavior of the IDI formulas.

FIGURE 3.

FIGURE 3

Theoretical estimators vs empirical logistic regression for change in scaled Brier score and change in Brier score assuming normal variables for nested models

Treating the empirical method as the standard method of estimation, the theoretical estimators appear to be unbiased for all statistics for both nested and nonnested model comparisons (Table 2). The theoretical estimators also have smaller standard deviations than their empirically estimated counterparts, indicating that the theoretical estimators have less chance of variability in their estimates when used correctly. We note the largest differences in variance for NRI(y) and ΔSNB(t). Based on the simulation results, we propose that investigators can choose to use either method of estimation for ΔAUC, IDI, ΔSBS, and ΔBS in situations where both the theoretical estimators and the empirical formulas are applicable. Depending on the nature of the data and available software, the empirical estimators may be easier to calculate or vice versa. However, in the case of NRI(y) and ΔSNB(t), estimation through the theoretical estimators assuming normality have less variability as compared to empirical estimation, and we recommend that the theoretical estimators be used in scenarios with multivariate normal predictors or normality in the linear predictor.

Our simulations focused on multivariate normal predictors in the risk models, as described by the underlying assumptions of LDA. However, it is rare to have predictors with perfectly normal distributions in practical applications. In the next section, we apply the formulas to data from the Framingham Heart Study, using variables with departures from normality. We demonstrate correspondence between the theoretical estimators and empirical estimators, despite some departures from normality.

5 |. PRACTICAL EXAMPLE USING NONNORMAL DATA FROM THE FRAMINGHAM HEART STUDY

The Framingham Heart Study (FHS) is a pioneer in risk prediction modeling of CVD. In 2008, a landmark paper focused on prediction of CVD events within 10 years with sex-specific prediction models, considering age, total cholesterol and HDL cholesterol, SBP, hypertension treatment, smoking status, and diabetes status.1 While many risk prediction functions were initially produced using LDA, logistic regression and Cox proportional hazards regression have gained popularity as the standard modeling techniques for estimating risk prediction models.31

We used a subset of variables from the general 10-year CVD risk models to illustrate the performance of our findings in a real-world example. Our sample consists of 1467 high-risk men either on hypertension treatment, currently smoking, or having prevalent diabetes from Original cohort Exam 11 and Offspring Exam 1 or Exam 3. We observed 307 (21%) broadly defined CVD events over a 10-year span. Using this sample, we compared two separate and well-calibrated risk prediction models predicting 10-year CVD events from age alone (Model 1), and age, SBP, total cholesterol, and HDL cholesterol (Model 2). None of the chosen predictor variables followed the normal distribution. We applied a natural logarithmic transformation to all four predictor variables as this was done in the original paper.1

We assessed the improvement in performance of Model 2 as compared to Model 1, through estimating ΔAUC, IDI, NRI(y), ΔSNB(t), ΔSBS, and ΔBS in the following ways.

  1. Formula-based method assuming normal theory estimating M2 from the sample and applying to formulas provided in Table 1.

  2. The empirical logistic regression approach to obtain predicted probabilities required for the empirical estimators provided in Appendix Table A1.

We set the cutoff for NRI(y) to the event rate. We set the threshold assumed by ΔSNB(t) to 0.075. As noted, we fitted the models assuming full follow-up with LDA and logistic regression. Since risk prediction models must be well calibrated prior to estimating any of the metrics described in this paper, we ensured model calibration before estimating the discrimination measures. LDA and logistic regression estimate predicted probabilities that are calibrated-in-the-large, which ensures that IDI and NRI are estimated for calibrated models. To obtain an estimate of variance, bootstrap resampling was performed with 1000 resamples.

Estimates and bootstrap standard errors for the two methods are summarized in the first two columns of Table 3. Overall, the values estimated by the two different methods are fairly close, with some notable yet expected differences. In the logistic regression case, empirical estimates of AUC, SNB(t), and BS appear to result in values larger than those obtained with the parametric estimators. However, for ΔAUC, IDI, NRI(y), and ΔSNB(t), which are the metrics measuring change between the two models, the opposite phenomenon occurs. Namely, empirical estimates of ΔAUC, IDI, NRI(y), and ΔSNB(t) result in values smaller than those obtained with the theoretical estimators. We note a small change in ΔAUC, ranging from 0.038 to 0.040 depending on which method is used. Despite the meager magnitude, Model 2 does help the AUC increase from 0.688 and 0.683 to 0.726 and 0.723. NRI(y) and ΔSNB(t) provide useful interpretations as well as insight into the benefits and costs associated with application of risk models.18,30,33 NRI(y) equals 0.061 and ΔSNB(t) equals 0.032 according to the theoretical estimators, with NRI(y) falling to 0.046 and ΔSNB(t) increasing to 0.044 with empirical logistic regression estimation. These could be regarded as large differences, with a 25% decrease in NRI(y) and a 37% increase in ΔSNB(t). We also note that the standard errors are larger in magnitude for the empirical logistic estimates for all metrics, with the exception of BS. The most notable increase in standard error occurs for NRI(y) and ΔSNB(t), as expected based on the simulation results. Finally, the values of ΔSBS and ΔBS are very close between the two methods, as expected.

TABLE 3.

Improvement in cardiovascular disease risk prediction after adding systolic blood pressure, total cholesterol, and high density lipoprotein cholesterol

Measure Theoretical Estimator-Based After Log-Transformation Estimate (SE) Empirical Logistic After Log-Transformation Estimate (SE) Theoretical Estimator-Based After Box-Cox Transformation Estimate (SE)
AUC1 0.683 (0.012) 0.688 (0.012) 0.683 (0.012)
AUC2 0.723 (0.012) 0.726 (0.012) 0.722 (0.012)
ΔAUC 0.040 (0.008) 0.038 (0.008) 0.039 (0.008)
IDI 0.039 (0.007) 0.037 (0.008) 0.038 (0.008)
NRI(y) 0.061 (0.012) 0.046 (0.022) 0.059 (0.012)
SNB(t = 0.075)1 0.018 (0.004) 0.041 (0.020) 0.019 (0.004)
SNB(t = 0.075)2 0.050 (0.009) 0.084 (0.025) 0.050 (0.009)
ΔSNB(t = 0.075) 0.032 (0.006) 0.044 (0.026) 0.031 (0.006)
SBS1 0.074 (0.009) 0.069 (0.027) 0.074 (0.010)
SBS2 0.113 (0.012) 0.105 (0.027) 0.112 (0.012)
ΔSBS 0.039 (0.008) 0.035 (0.008) 0.038 (0.008)
BS1 0.153 (0.004) 0.154 (0.004) 0.153 (0.005)
BS2 0.147 (0.004) 0.148 (0.004) 0.147 (0.004)
ΔBS −0.006 (0.001) −0.006 (0.001) −0.006 (0.001)

AUC = Area under the Receiver Operating Characteristic curve; IDI = Integrated Discrimination Improvement; NRI(y) = Net Reclassification Index at event rate; SNB(t = 0.075) = Standardized Net Benefit assuming threshold t = 0.075;

SBS = scaled Brier score; BS = Brier score.

Theoretical estimator-based refers to sample estimator for squared Mahalanobis distance used in place of M2.

Empirical logistic refers to quantities calculated on the basis of estimated probabilities from a logistic regression model.

SE = standard error estimated using bootstrap resampling with 1000 resamples.

Subscript 1 indicates model predicting CVD in 10 years from age only, where age is transformed with the natural logarithmic function.

Subscript 2 indicates model predicting CVD in 10 years from age, systolic blood pressure, total cholesterol, and HDL cholesterol, where all predictors are transformed with the natural logarithmic function.

As evident in our FHS example, most data in practical application is not necessarily normally distributed. While we log-transformed the predictors in the FHS data, this did not correct all departures from normality. For this reason, we applied the Box-Cox normalizing transformation in the theoretical estimator example to address any departure from normality in the linear predictor.47 The updated results are located in the last column of Table 3. We note minor changes for all metrics from the results listed in the first column. This indicates that, even prior to the Box-Cox transformation, the theoretical estimators performed well despite some nonnormality, since the estimates from the theoretical estimators did not change dramatically after additional nonnormality correction from the Box-Cox transformation. Thus, we have some evidence that the formulas can be applied in situations with nonnormal predictor variables and reliable results will still be achieved.

6 |. CONCLUSION

In this paper, we took advantage of the existence of explicit solutions to the LDA coefficients under normality. This allowed us to demonstrate that, under normality, all six measures described in this paper are functions of the Mahalanobis distances of the two models being compared. It is geometrically trivial that these metrics must be a function of M; two scenarios with the same M can be transformed into one another by linear transformation of the predictor vectors, and the canonically transformed distributions will be multidimensional spheres with their centers located M-units apart.29 Our demonstration that the nonnormal data from the Framingham Heart Study still gave rise to the performance measures is an important observation, and thus, theoretical estimators seemingly can be applied in cases with departure from normality in the predictors. This shows not only how powerful M is as a metric on its own, but it also aides in providing further insight about the interpretation and clinical usefulness of all presented metrics. Since M as a measure of discrimination is quite robust to departures from normality,48 metrics dependent only on M may also be considered robust.

ΔAUC and NRI(y) under normality depend only on the Mahalanobis distances of the two models. Regarding additional assumptions, ΔAUC and NRI(y) are not affected by the prevalence or incidence of the modeled event as demonstrated in their derivations, which allows for ease of comparing values across studies with different event rates or incidence rates. However IDI, ΔSNB(t), ΔSBS, and ΔBS under normality include the event rate in their derivations. In order to verify which measures truly depend on event rate, we plotted the six metrics as functions of events rate (Supplemental Figure 6). As anticipated, ΔAUC and ΔNRI(y) do not exhibit evidence of dependence on event rate. IDI and ΔSBS increase in value as the event rate increases. ΔSNB(t) and ΔBS seem to show a less prominent dependence. For ΔSNB(t), the nature of the correspondence may be affected by the relationship between the event rate and choice of threshold assumed in the calculation, which should be applied to both models under consideration. For metrics dependent on the event rate, the range of values is specific to the study in which they are applied; event rate or incidence rate specific ranges exist. Any comparisons using these metrics should assume the same event rate or incidence rate in their calculation. Heavy dependence on the event rate may also cause metrics to be susceptible to issues with calibration, which can occur in the application of a risk model on a validation cohort where the event rates or incidence rates of the derivation cohort and the validation cohort differ. We recommend that risk models are properly calibrated before discrimination improvement is assessed. Issues with event rate dependence may also arise in situations where the sample event rate is distorted compared to the true event rate in the population, such as in case-cohort studies with artificially high event rates. Investigators previously provided suggestions for adapting rate-dependent measures to realistic event rates.49,50

In simulation, we demonstrated that the performance of the theoretical estimators under normality is comparable or superior to empirical estimation methods typically used by investigators. In the Framingham data example, we noted that the empirical estimator SE was minimally larger than the theoretical estimator SE using log-transformation of predictor variables for ΔAUC (2.8%), IDI (1.8%), and ΔSBS (5.9%). For ΔBS, the theoretical estimator SE was 2.2% larger than the corresponding empirical estimator SE. This difference disappeared after application of the Box-Cox transformation. We observed rather large differences in SE for NRI(y) and ΔSNB(t), where the empirical estimator SEs are larger by 86% and 298%, respectively. These differences translate into the percent difference when comparing the widths of 95% confidence intervals using this data: we note that the intervals from empirical estimation are minimally shorter for BS, minimally longer for ΔAUC, IDI, and ΔSBS, and dramatically longer for NRI(y) and ΔSNB(t). Assuming both methods achieve the same coverage probability, this implies that estimation with the theoretical estimators is generally more precise except for ΔBS. Coverage of confidence intervals for both methods will be researched in future work. While our practical example provided an application to some nonnormal data, further research is needed to better understand how sensitive the parametric estimators are to the departures from normality of predictor variables in the model. In a future study, we will explore the behavior of the theoretical estimators using models with various forms of nonnormal predictors in simulated scenarios in order to verify the interrelationships when normality does not hold.

By extending the theoretical formulas for ΔAUC, IDI, and NRI(y) from nested to nonnested model comparisons, we increased their usability in practice. These formulas, combined with the theoretical derivations for ΔSNB(t), ΔSBS, and ΔBS, provide additional estimation methods for investigators. Due to increased variability in the empirical estimation of NRI(y) and ΔSNB(t), the theoretical estimators for these particular metrics may be the superior estimation method. As seen in the practical example, nonnormality in the predictor variables does not drastically affect the estimation of the prognostic improvement measures; however, the investigator may consider using transformations if it is a biologically natural step to take in analysis.

Supplementary Material

Supplementary Materials

ACKNOWLEDGEMENTS

Dr Danielle M. Enserro is a Senior Biostatistician for NRG Oncology through the Clinical Trials Development Division of the Department of Biostatistics and Bioinformatics at Roswell Park Comprehensive Care Center, Elm and Carlton Streets, Buffalo, NY 14263. While working on this paper, Dr Enserro was supported by the National Institutes of Health grants R01HL131015 and R01HL131029 from the National Heart, Lung, and Blood Institute (https://www.nhlbi.nih.gov), and she is currently supported by the NRG Oncology Statistical and Data Management Center grant U10CA180822 (https://www.nrgoncology.org). While working on this paper, Dr Olga V. Demler was supported by the National Institutes of Health grant R01HL113080 by K01 award K01HL135342 received by OVD from the National Heart, Lung and Blood Institute (https://www.nhlbi.nih.gov).

Funding information

National Heart, Lung, and Blood Institute (National Institutes of Health), Grant/Award Number: R01HL131029, R01HL131015, R01HL113080, and K01HL135342; NRG Oncology Statistical and Data Management Center, Grant/Award Number: U10CA180822

APPENDIX

TABLE A1.

Empirical estimators for risk model validation measures

Measure Empirical Estimator
ΔAUC The difference in C-statistics, where the C-statistic is defined as 1n0n1xiY0xjY1I[cxi,cxj] where Y1 and Y0 are the sets of subjects with and without events, respectively, n1 and n0 are the sizes of the sets, and I[cxi,cxj]={1,ifcxi<cxj0.5,ifcxi=cxj0,otherwise and c is the linear coefficient estimates.
IDI (p^¯new,eventsp^¯old,events)(p^¯new,noneventsp^¯old,nonevents) where p^¯events=iineventsp^¯i#events and p^¯nonevents=jinnoneventsp^¯j#nonevents, or themeans of the model-based predicted probabilities of the event for those who develop the event those who do not have the event, respectively.
NRI(y) [p^up,Y=1p^down,Y=1]+[p^down,Y=0p^up,Y=0] where p^up,Y=1=#eventsmovingup#events, p^down,Y=1=#eventsmovingdown#events, p^up,Y=0=#noneventsmovingup#nonevents, and p^down,Y=0=#noneventsmovingdown#noneventsnonevents. The cutoff between to the two risk categories is set to the event rate.
ΔSNB(t) {ΔSe(t)+t(1y)y(1t)ΔSp(t),yty(1t)t(1y)ΔSe(t)+ΔSp(t),y>t where t is the assumed threshold, y is the event rate, ΔSe(t) is the difference in model sensitivity and ΔSp(t) is the difference in model specificity
ΔSBS The difference in scaled Brier scores, where scaled Brier score is defined as 1BSBSmax BS is Brier score (see below) and BSmax=p¯(1p¯) where p¯ is the mean of the predicted probabilities.
ΔBS The difference in Brier scores, where the Brier Score is defined as 1Ni=1N(yipi)2, or the squared differences between actual binary outcomes and predicted probabilities.

ΔAUC = Change in area under the Receiver Operating Characteristic curve; IDI = Integrated Discrimination Improvement; NRI(y) = Net Reclassification Index at event rate;

ΔSNB(t) = Change in Standardized Net Benefit; ΔSBS = Change in scaled Brier score; ΔBS = Change in Brier score.

Footnotes

DISCLOSURE

The content of this paper is solely the responsibility of the authors.

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of the article.

REFERENCES

  • 1.D’Agostino RB, Vasan RS, Pencina MJ, et al. General cardiovascular risk profile for use in primary care: the Framingham heart study. Circulation. 2008;117:743–753. [DOI] [PubMed] [Google Scholar]
  • 2.Reilly BM, Evans AT. Translating clinical research into clinical practice: impact of using prediction rules to make decisions. Ann Intern Med. 2006;144:201–209. [DOI] [PubMed] [Google Scholar]
  • 3.D’Agostino RB, Griffith JL, Schmid CH, Terrin N. Measures for evaluating model performance. In: Proceedings of the Biometrics Section. Alexandria, VA: American Statistical Association; 1997:253–258. [Google Scholar]
  • 4.Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. New York, NY: Springer Science+Business Media; 2001. [Google Scholar]
  • 5.Bamber D The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol. 1975;12:387–415. [Google Scholar]
  • 6.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36. [DOI] [PubMed] [Google Scholar]
  • 7.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristics curves: a nonparametric approach. Biometrics. 1988;44(3):837–845. [PubMed] [Google Scholar]
  • 8.McClish D Analyzing a portion of the ROC curve. Med Decis Making. 1989;9:190–195. [DOI] [PubMed] [Google Scholar]
  • 9.Thompson ML, Zucchini W. On the statistical analysis of ROC curves. Statist Med. 1989;8:1277–1290. [DOI] [PubMed] [Google Scholar]
  • 10.Dodd LE, Pepe MS. Partial AUC estimation and regression. Biometrics. 2003;59:614–623. [DOI] [PubMed] [Google Scholar]
  • 11.Tzoulaki I, Liberopoulos G, Ioannidis JPA. Assessment of claims of improved prediction beyond the Framingham risk score. JAMA. 2009;302(21):2345–2352. [DOI] [PubMed] [Google Scholar]
  • 12.Pencina MJ, D’Agostino RB, Massaro JM. Understanding increments in model performance metrics. Lifetime Data Anal. 2013;19(2):202–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Demler OV, Pencina MJ, D’Agostino RB. Equivalence of improvement in area under ROC curve and linear discriminant analysis coefficient under assumption of normality. Statist Med. 2011;30(12):1410–1418. [DOI] [PubMed] [Google Scholar]
  • 14.Pencina MJ, D’Agostino RB, D’Agostino RB, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statist Med. 2008;27:157–172. [DOI] [PubMed] [Google Scholar]
  • 15.Greenland P, O’Malley PG. When is a new prediction marker useful? A consideration of lipoprotein-associated phospholipase A2 and C-reactive protein for stroke risk. Arch Intern Med. 2005;165:2454–2456. [DOI] [PubMed] [Google Scholar]
  • 16.Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–935. [DOI] [PubMed] [Google Scholar]
  • 17.Goff DC, Lloyd-Jones DM, Bennett G, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on practice guidelines. J Am Coll Cardiol. 2014;63(25 Pt B):2935–2959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Pencina MJ, Steyerberg EW, D’Agostino RB. Net reclassification index at event rate: properties and relationships. Statist Med. 2017;36(28):4455–4467. [DOI] [PubMed] [Google Scholar]
  • 19.Kerr KF, McClelland RL, Brown ER, Lumley T. Evaluating the incremental value of new biomarkers with integrated discrimination improvement. Am J Epidemiol. 2011;174(3):364–374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kerr KF, Wang Z, Janes H, McClelland RL, Psaty BM, Pepe MS. Net reclassification indices for evaluating risk-prediction instruments: a critical review. Epidemiology. 2014;25:114–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Pepe MS. Problems with risk reclassification methods for evaluating prediction models. Am J Epidemiol. 2011;173(11):1327–1335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hilden J, Gerds TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Statist Med. 2014;33(19):3405–3414. [DOI] [PubMed] [Google Scholar]
  • 23.Gerds TA, Hilden J. Calibration of models is not sufficient to justify NRI. Statist Med. 2014;33(19):3419–3420. [DOI] [PubMed] [Google Scholar]
  • 24.Pencina MJ, Fine JP, D’Agostino RB. Discrimination slope and integrated discrimination improvement - properties, relationships and impact of calibration. Statist Med. 2017;36(28):4482–4490. [DOI] [PubMed] [Google Scholar]
  • 25.Pencina MJ, D’Agostino RB. Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Statist Med. 2004;23:2109–2123. [DOI] [PubMed] [Google Scholar]
  • 26.Chambless LE, Cummiskey CP, Cui G. Several methods to assess improvement in risk prediction models: extension to survival analysis. Statist Med. 2011;30:22–38. [DOI] [PubMed] [Google Scholar]
  • 27.Pencina MJ, D’Agostino RB, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Statist Med. 2011;30:11–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Pencina MJ, D’Agostino RB, Demler OV. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Statist Med. 2012;31:101–113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Mahalanobis PC. On the generalised distance in statistics. Proc Natl Inst Sci India. 1936;2:49–55. [Google Scholar]
  • 30.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6): 565–574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, NY: Springer Science+Business Media; 2009. [Google Scholar]
  • 32.Baker SG, Cook NR, Vickers AJ, Kramer BS. Using relative utility curves to evaluate risk prediction. J R Stat Soc Ser A Soc. 2009;172: 729–748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Van Calster B, Vickers AJ, Pencina MJ, Baker SG, Timmerman D, Steyerberg EW. Evaluation of markers and risk prediction models: overview of relationships between NRI and decision-analytic measures. Med Decis Making. 2013;33(4):490–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Vickers AJ, Cronin AM, Elkin EB, Gonen M. Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers. BMC Med Inform Decis Mak. 2008;8:53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3. [Google Scholar]
  • 36.Schmid CH, Griffith JL. Multivariate classification rules: calibration and discrimination. In: Encyclopedia of Biostatistics. Hoboken, NJ: Wiley Online Library; 2005:1–6. [Google Scholar]
  • 37.Yates JF. External correspondence: decompositions of the mean probability score. Organ Behav Hum Perform. 1982;30: 132–156. [Google Scholar]
  • 38.Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936;7(2):179–188. [Google Scholar]
  • 39.Efron B The efficiency of logistic regression compared to normal discriminant analysis. J Am Stat Assoc. 1975;70(352):892–898. [Google Scholar]
  • 40.Su JQ, Liu JS. Linear combinations of multiple diagnostic markers. J Am Stat Assoc. 1993;88(424):1350–1355. [Google Scholar]
  • 41.Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology. 2010;21(1):128–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Wu YC, Lee WC. Alternative performance measures for prediction models. PLOS One. 2014;9(3):e91249 10.1371/journal.pone.0091249. Accessed March 7, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Pepe MS, Kerr KF, Longton G, Wang Z. Testing for improvement in prediction model performance. Statist Med. 2013;32(9):1467–1482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Enserro DM, Vasan RS, Xanthakis V. Twenty-year trends in the American Heart Association cardiovascular health score and impact on subclinical and clinical cardiovascular disease: the Framingham heart study. J Am Heart Assoc. 2018;7(11):e008741 https://www.ahajournals.org/doi/10.1161/JAHA.118.008741. Accessed May 18, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Cohen J Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. [Google Scholar]
  • 46.Demler OV, Pencina MJ, Cook NR, D’Agostino RB. Asymptotic distribution of delta AUC, NRIs, and IDI based on theory of U-statistics. Statist Med. 2017;36(21):3334–3360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Box GEP, Cox DR. An analysis of transformations. J R Stat Soc Series B Stat Methodol. 1964;26:211–246. [Google Scholar]
  • 48.Tiku M, Islam M, Qumsiyeh S. Mahalanobis distance under non-normality. Statistics. 2010;44(3):275–290. [Google Scholar]
  • 49.Gu W, Pepe M. Measures to summarize and compare the predictive capacity of markers. Int J Biostat. 2009;5(1):27 10.2202/1557-4679.1188. Accessed April 1, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Bansal A, Pepe MS. Estimating improvement in prediction with matched case-control designs. Lifetime Data Anal. 2013;19(2): 170–201. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

RESOURCES