Abstract
Developing individualized prediction rules for disease risk and prognosis has played a key role in modern medicine. When new genomic or biological markers become available to assist in risk prediction, it is essential to assess the improvement in clinical usefulness of the new markers over existing routine variables. Net reclassification improvement (NRI) has been proposed to assess improvement in risk reclassification in the context of comparing two risk models and the concept has been quickly adopted in medical journals. We propose both nonparametric and semiparametric procedures for calculating NRI as a function of a future prediction time t with a censored failure time outcome. The proposed methods accommodate covariate-dependent censoring, therefore providing more robust and sometimes more efficient procedures compared with the existing nonparametric-based estimators. Simulation results indicate that the proposed procedures perform well in finite samples. We illustrate these procedures by evaluating a new risk model for predicting the onset of cardiovascular disease.
Keywords: Inverse probability weighted (IPW) estimator, Net reclassification improvement (NRI), Risk prediction, Survival analysis
1 Introduction
Developing individualized prediction rules for disease risk and prognosis is fundamental for successful disease prevention and treatment selection. For many diseases, risk prediction models have been developed and incorporated into clinical practice guidelines. For example, the Gail model was developed for predicting individual breast cancer risk (Gail et al. 1989) and a risk calculator based on that model can be used to assist physicians making screening recommendations. For cardiovascular disease (CVD), prediction models such as the Framingham Risk Score (FRS) are used for stratifying patients into different levels of risks. However, much refinement is needed even for the best of these models because of their limited discriminatory accuracy. For example, the Framingham model, largely based on traditional clinical risk factors, has recognized limitations in its clinical utility (Hemann et al. 2007). A considerable fraction of patients who experienced CVD events had none of the identified risk factors, indicating a need to explore avenues beyond routine clinical measures for more accurate prediction (Khot et al. 2003). This fuels much of the current search for novel biologic markers and genetic factors that, when combined with routine clinical risk factors, may provide accurate prediction at the individual level.
When new genomic or biological markers become available to assist in risk prediction, it is essential to assess the clinical usefulness of these new markers compared to existing routine markers. Careful evaluation of the incremental value is particularly crucial when markers are either expensive or invasive to measure. To quantify the added clinical value of new markers over a conventional risk scoring system for predicting disease risk, one may calculate the difference in the prediction measures for the existing conventional model and the new model, which includes information from the new markers. For example the difference in the areas under the receiver operating characteristic curves (AUC of ROC) are often used to quantify the improvement in discrimination attributable to added markers. Since a risk model is often used to stratify patients into proper risk categories, statistical summaries that depend on clinically meaningful risk thresholds may be more relevant (Cook 2007; Cui 2009; Lloyd-Jones 2010). As an alternative to measuring the difference between AUCs, net reclassification improvement (NRI) has also been proposed to assess improvement in risk reclassification in the context of comparing two risk models constructed with and without novel markers (Pencina et al. 2008). Using “up” and “down” to denote changes in one or more risk categories in the upward and downward directions, respectively, for a subject between their baseline and augmented risk values, the NRI is defined as
Such a measure is appealing because it acknowledges both desirable risk reclassifications (up for diseased and down for healthy subjects) and undesirable risk reclassifications (down for diseased and up for healthy subjects). Due to its simplicity, NRI has been quickly adopted in medical journals. However, compared with many other measures for incremental values, the concept has not received much attention in the statistical literature.
Since a risk model is often used for predicting an individual's future outcome, it is essential to incorporate the additional dimension of time when assessing the performance of a risk model in a cohort study. For both deriving and evaluating risk models, prospective cohort data is often used. In this setting a subject's health status at a future time t is sometimes unknown due to loss of follow-up, termination of a study or the occurrence of a competing risk event. Such censoring poses additional challenges compared with settings previously examined in the literature which focus on incremental value calculation with a dichotomous outcome. Currently there is limited development in methods to estimate the incremental value of novel markers with censored failure time outcomes. Recently Pencina and D'Agostino (2011) proposed a method for calculating time-dependent NRI, based on nonparametric Kaplan–Meier (KM) estimators in order to account for censoring in cohort data. The asymptotic properties of a similar estimator is studied in detail in Uno et al. (2009). However, the validity of these estimators relies critically on the assumption that censoring is independent of predictors used in the risk models. Furthermore, the nonparametric procedure considered in these estimators may potentially lead to efficiency loss. A more flexible and more efficient estimating procedure is needed in practice.
In this manuscript, we propose quantitative procedures for calculating NRI as a function of a future prediction time t with a censored failure time outcome. Compared with existing nonparametric estimators, our procedures do not require the assumption that censoring is independent of predictors, therefore the methods would be widely applicable to many practical situations. We also consider procedures that aim to improve efficiency while maintaining robustness. This manuscript is organized as follows. In Sect. 2, we specify models and define NRI suitable for event time outcomes. In Sect. 3, we describe procedures for estimating time-dependent NRI using data obtained from a prospective cohort study with a failure time outcome. We comment on the theoretical properties of our proposed estimators in Sect. 4. We then describe simulation studies to evaluate the performance of the proposed estimators. The results are reported in Sect. 5. An application of our procedures for comparing two CVD risk models is presented in Sect. 6. Concluding remarks are in Sect. 7.
2 Measures of risk stratification and reclassification
Consider the situation that a vector of predictor Y measured at baseline is used for predicting the time to event outcome T. Risk models can be built using sub-vectors of Y. Let Y(1), a function of Y, denote a vector of conventional predictor values in the existing model. Let Y(2), also a function of Y, denote a vector of predictors used in the new model that contains Y(1), but also new predictor values. Individual-level risk at a future time t can be derived as , based on the conventional model, and , the corresponding risk based on the new model, respectively. Since, in practice, risk categories are often uncertain for many diseases, a more objective and flexible measure of improvement in risk prediction would be based on P or Q in their original continuous scales. Therefore, following the definition of Pencina and D'Agostino (2011), in this manuscript we focus on the time-dependent continuous NRI, which is a more general definition that does not rely on the existence of risk categories. In the time-dependent setting, we further denote an ‘event’ person at time t as those with , and a ‘nonevent’ person as T > t. Here, NRI(t) is equal to the sum of ‘event NRI’ and ‘nonevent NRI’, which are defined as:
and
Since, NRIu,v(t) = event NRIu(t) + nonevent NRIv(t), it follows that . In practice we may chose u and v such that improvement in risk estimates is meaningful (Uno et al. 2009). Setting u = v = 0 gives the ‘continuous NRI’ considered in Pencina and D'Agostino (2011). For the ease of presentation, in the sequel, we'll omit the subscript u and v from our notations and assume u = v = 0, but note that our estimators can be constructed for any arbitrary u and v. In the next section, we show how each component of NRI(t) can be estimated.
3 Estimation
Suppose we have a cohort of N individuals from the targeted population followed prospectively. Due to censoring, the observed data consist of N i.i.d copies of vector, , where for Ti and Ci denote failure time and censoring time respectively. Yi are predictors from individual i measured at time 0, including subset Yi(1) used in the existing model (model 1) and Yi(2) in the new model (model 2) such that Yi(1) ∈ Yi(2). For illustration, we first assume that P and Q both follow the conventional Cox regression models. Specifically, at time t, we assume and , where Λ0k is the baseline cumulative hazard function, βk are unknown vector of parameters, for model k = 1, 2, and , . It is important to note that these models are most likely not correctly specified. Nevertheless under a mild regularity condition, the standard maximum partial likelihood estimator for βk converges to a constant vector, as n → ∞ (Hjort 1992). This provides theoretical ground for our asymptotic studies.
To estimate NRI(t), Pencina and D'Agostino (2011) first expressed the two key components as
and
where B(θ) = Q(θ2) – P(θ1) and . To account for censoring, Pencina and D'Agostino (2011) proposed to use the KM estimator to estimate the survival function using data from all subjects for and using subjects with B(θ) > 0 for estimation of . We refer to the resulting estimator as the ‘KM estimator’ hereafter.
Uno et al. (2009) considered estimating NRI(t) based on an inverse-probability-of-censoring weighted (IPW) estimator (hereafter referred to as the ‘IPW estimator’), with its key components estimated as
(3.1) |
(3.2) |
where , , , and is the KM estimator of H(·) = P(C > ·). Due to the equivalence between the KM estimator and the IPW estimator for marginal survival functions under independent censoring (Satten and Datta 2001), the two estimators are likely to have very similar robustness and efficiency. Both estimators are consistent under an independent censoring assumption regardless of the adequacy of the two fitted models, P(θ1) and Q(θ2). This is particularly appealing for model comparisons.
One potential weakness of both estimators is that they can be biased if censoring is dependent on a subset of Y(2). On the other hand, when model 2 is correctly specified, such covariate-dependent censoring can be incorporated based on the model since C ⊥ T given or Q(θ2). This motivates us to propose a more robust alternative to the Uno et al. (2009) estimator by estimating censoring probabilities given Y(2) via kernel smoothing over Q(θ2). Let and where Δi(θ) = I{Bi(θ) > 0}. To estimate NRI(t), we propose to modify equations (3.1) and (3.2) by considering the following more robust IPW censoring weights
where ,
and , K is a symmetric kernel density function, with h = h(n) → 0 as the bandwidth. Note that is simply the subset of individuals with and is the set of all individuals. Consequently we can then use these more robust kernel smoothing weights in the IPW estimator, to obtain the ‘Smooth-IPW (S-IPW) estimators’,
(3.3) |
(3.4) |
This resulting estimator for NRI(t) is
The estimator can be shown to have the property of ‘double robustness’, i.e., it only requires that the risk model Q is correctly specified or that the independent censoring assumption holds.
Additionally, to improve upon the efficiency of the class of nonparametric estimators, we propose considering a semiparametric estimator. Note that
Therefore NRI(t) can be estimated semiparametrically as
with the ‘SEM’ estimators,
(3.5) |
(3.6) |
Under the correctly specified model Q(θ2), the semiparametric estimator accommodates a covariate-dependent censoring situation and would be more efficient compared to the Smooth-IPW estimator. In practice, to estimate NRI(t), if estimates from such a semiparametric method agree well with those of the nonparametric methods, one may choose to report results based on the semiparametric method for additional gain in efficiency. To automatize the procedure, we suggest considering a combined estimator (hereafter referred as the ‘combined estimator’), which takes the form
with being a weight that is dependent on the aptness of the semiparametric model. For example, can be taken to be the p-value from a consistent test of the proportional hazards assumption for a Cox regression model fit. Such an estimator provides a simple procedure which is robust over a wide variety of situations. In numerical studies, we show that such a combined estimator can be more efficient compared with the nonparametric estimators, while maintaining the double robustness property.
We note that the proposed estimators can be easily generalized to NRI based on risk categories. Consider a situation where individuals are classified as low, intermediate or high risk: low risk if their risks are below r1, and high risk if their risks are above r2. The reclassification accuracy of risk models in such a setting can be quantified with a 3-category NRI of the form . To estimate and P(up|T > t), we may simply replace with in Eqs. 3.3 and 3.4, respectively. Similarly, to estimate and P(down|T > t), one may replace with in Eqs. 3.3 and 3.4. Similarly, one may obtain a semiparametric estimator of by replacing with , or in Eqs. 3.5 and 3.6.
4 Inference
To make inference about , we study the asymptotic properties of proposed estimators. In the Appendix, we show that is uniformly consistent for NRI(θ0, t), where with βk0 being the unique maximizer of the expected value of the corresponding partial likelihood. Furthermore, we show that the process is asymptotically equivalent to a sum of i.i.d terms, where εi(t) is defined in the the Appendix. By a functional central limit theorem of Pollard (1990), the process converges weakly to a mean zero Gaussian process in t. We also show that is uniformly consistent for NRI(θ0, t), and that the process is asymptotically equivalent to a sum of i.i.d terms where ζi(t) is defined in the Appendix. Again, by a functional central limit theorem, the process converges weakly to a mean zero Gaussian process in t. With weak convergence of both and , it follows that the combined estimator converges to a zero-mean process. Due to the variation in , the combined estimators may not be a Gaussian process. We show in our simulation that to make inference, resampling procedures such as a bootstrap method can provide a valid approximation of the limit distribution. Specifically, at each of the bth bootstrap iterations, with b = 1, . . . , B, we conduct a random sampling with replacement of the original dataset, and fit our new and old risk models based on the sampled dataset, denoted as and . These estimates from the fitted models are then used to calculate and based on the bootstrapped samples. This procedure will be repeated B times, and confidence intervals can be constructed either based on the percentile method, or a normal approximation where the standard error is calculated based on the empirical standard errors of and . The combined estimator can be inferred similarly by repeatedly calculate the weights based on each bootstrap sample in addition to and .
In the absence of an independent validating set, often in practice the same dataset is used for both fitting the model with several predictors and calculating a measure such as NRI(t). Such an ‘apparent’ summary may potentially lead to the so-called ‘overfitting’ phenomenon, i.e. estimates of model performance will tend to be more optimistic compared with the corresponding estimates if the model were to applied to a new dataset. Several methods for correcting the bias from apparent estimates can be considered. The 0.632 Bootstrap method (Efron and Tibshirani 1997) has been shown to have better performance compared with a simple cross-validated approach. The estimator was derived in our simulation as follows: we first obtained a bootstrapped estimate by sampling the data with replacement to obtain the training set. The training set is used to estimate the model parameters . The remaining subjects make up the validation set, and are used to calculate the various estimates of NRI using parameter values . This is repeated B times and is the mean across the repetitions. The 0.632 bootstrap estimate is,
where is the estimate without using cross-validation. To construct a confidence interval based on , we follow the suggestions given in Tian et al. (2007) and Uno et al. (2007) by shifting the apparent error based confidence interval in the amount of bias estimated as . Specifically, if [L, R] is the confidence interval calculated based on the procedure described above, then the bias corrected confidence interval is .
5 Simulation studies
To examine the performance of various NRI(t) estimators, we conducted simulation studies under several different scenarios. Throughout we chose n = 500 and used 200 bootstrap samples to calculate standard errors. Results for each setting were produced from 1,000 simulations. We calculated NRI(t), for t = 3 for comparing two risk models using the KM, IPW, Smooth-IPW, SEM and the combined estimators described in Sect. 3. We fitted Cox regression models to calculate risks for both the new and existing models using corresponding predictors.
For the first setting presented in Table 1, two predictors Y1 and Y2 were simulated from a multivariate normal distribution with mean (0, 0.5), σy1 = σy2 = 1 and a correlation ρ of 0.25. The relationship between survival time T and Y followed a proportional hazards model with parameters β1 = log(3) and β2 equal to log(1.5). Censoring time was generated from a U(0, a) distribution where a was chosen to produce approximately 40% censoring. Note that in this setting, model Q is correctly specified and the independent censoring assumption is correct. We took the baseline model to consist of Y1 and the new model to include both predictors. As expected, all estimators shown in Table 1 provide unbiased estimates. The bootstrap-based variance estimators perform well with coverage percentage close to the 95% nominal level. Since the risk based on the new model is correctly specified, the semiparametric method is the most efficient. Improvement in efficiency over the nonparametric procedures is observed with our combined estimators.
Table 1.
Method | Pr(Pi – Qi > 0|Ti > t) | NRI (t) | ||
---|---|---|---|---|
True values | 0.592 | 0.358 | 0.468 | |
KM | ||||
Mean(Bias) | 0.003 | 0.001 | 0.002 | |
Std. Dev. | 0.034 | 0.030 | 0.104 | |
Mean(std error) | 0.034 | 0.030 | 0.103 | |
95 % bootstrap CI cov. | 0.946 | 0.946 | 0.946 | |
IPW | ||||
Mean(Bias) | 0.002 | 0.002 | –0.001 | |
Std. Dev. | 0.034 | 0.030 | 0.105 | |
Mean(std error) | 0.034 | 0.031 | 0.104 | |
95 % bootstrap CI cov. | 0.943 | 0.95 | 0.951 | |
Smooth IPW | ||||
Mean(Bias) | 0.001 | 0.003 | –0.003 | |
Std. Dev. | 0.034 | 0.030 | 0.104 | |
Mean(std error) | 0.034 | 0.030 | 0.103 | |
95 % bootstrap CI cov. | 0.946 | 0.942 | 0.949 | |
SEM | ||||
Mean(Bias) | 0.001 | 0.003 | –0.003 | |
Std. Dev. | 0.024 | 0.029 | 0.082 | |
Mean(std error) | 0.025 | 0.028 | 0.080 | |
95 % bootstrap CI cov. | 0.952 | 0.942 | 0.937 | |
Combined | ||||
Mean(Bias) | 0.002 | 0.003 | –0.002 | |
Std. Dev. | 0.029 | 0.028 | 0.089 | |
Mean(std error) | 0.031 | 0.029 | 0.095 | |
95 % bootstrap CI cov. | 0.968 | 0.949 | 0.969 |
KM Kaplan–Meier estimator, IPW inverse probability weighted estimator, Smooth IPW smooth inverse probability weighted estimator, SEM semiparametric estimator, Combined combined estimator, as defined in the text
Under this setting we also considered a null model where β2 = 0 i.e. there is no incremental value of the new marker and NRI(t) = 0. We found that in this situation all estimators tend to slightly over estimate NRI(t), and variance estimators based on the bootstrap estimators tend to be conservative (see Table 2). We do not recommend calculating NRI(t) in the case when the new marker does not independently predict outcome in a model with conventional predictors. Note that all theoretical results in the Appendix are derived under the assumption that β2 ≠ 0 and thus our proposed procedures are only valid under this assumption. In practice, if the null setting is a likely possibility, estimation should be treated with care.
Table 2.
Method | Pr(Pi – Qi > 0|Ti > t) | NRI(t) | ||
---|---|---|---|---|
Null model: β2 = 0 | ||||
True values | 0.5 | 0.5 | 0 | |
KM | ||||
Mean(Bias) | 0.01 | –0.02 | 0.061 | |
Std. Dev. | 0.034 | 0.026 | 0.091 | |
Mean(std error) | 0.043 | 0.033 | 0.118 | |
95 % bootstrap CI cov. | 0.996 | 0.971 | 0.98 | |
IPW | ||||
Mean(Bias) | 0.01 | –0.019 | 0.058 | |
Std. Dev. | 0.034 | 0.026 | 0.092 | |
Mean(std error) | 0.044 | 0.033 | 0.119 | |
95 % bootstrap CI cov. | 0.996 | 0.972 | 0.981 | |
Smooth IPW | ||||
Mean(Bias) | 0.009 | –0.019 | 0.055 | |
Std. Dev. | 0.034 | 0.026 | 0.092 | |
Mean(std error) | 0.044 | 0.033 | 0.118 | |
95 % bootstrap CI cov. | 0.996 | 0.972 | 0.981 | |
SEM | ||||
Mean(Bias) | 0.009 | –0.019 | 0.057 | |
Std. Dev. | 0.023 | 0.025 | 0.067 | |
Mean(std error) | 0.029 | 0.031 | 0.081 | |
95 % bootstrap CI cov. | 0.99 | 0.967 | 0.957 | |
Combined | ||||
Mean(Bias) | 0.008 | –0.019 | 0.055 | |
Std. Dev. | 0.029 | 0.025 | 0.077 | |
Mean(std error) | 0.039 | 0.032 | 0.104 | |
95 % bootstrap CI cov. | 0.997 | 0.971 | 0.977 |
KM Kaplan–Meier estimator, IPW inverse probability weighted estimator, Smooth IPW smooth inverse probability weighted estimator, SEM semiparametric estimator, Combined combined estimator, as defined in the text
The second setting we considered was identical to the first setting, except that censoring time was dependent on marker values. Here, censoring time,
where U was generated from a Uniform(0, a) distribution where with a chosen to yield about 40% censoring, X was generated from a N(0, 1) distribution and B was generated from a N(2·Y1, 1) distribution. Note that in this setting, model Q is correctly specified but the independent censoring assumption is not correct. As seen in the results presented in Table 3, the KM estimator yields biased estimators for both NRI(t) and its two key components. The IPW estimator is biased for both and NRI(t), whereas the smooth-IPW estimator substantially alleviates such biases. However, we observed large variation in nonparamatric estimators of NRI(t) as compared with the semiparametric and combined estimators (Table 3).
Table 3.
Method | Pr(Pi – Qi > 0|Ti > t) | NRI(t) | ||
---|---|---|---|---|
True values | 0.611 | 0.45 | 0.322 | |
KM | ||||
Mean(Bias) | 0.067 | –0.062 | 0.259 | |
Std. Dev. | 0.040 | 0.040 | 0.126 | |
Mean(std error) | 0.041 | 0.040 | 0.129 | |
95 % bootstrap CI cov. | 0.615 | 0.659 | 0.483 | |
IPW | ||||
Mean(Bias) | –0.024 | 0.005 | –0.057 | |
Std. Dev. | 0.034 | 0.045 | 0.131 | |
Mean(std error) | 0.035 | 0.044 | 0.130 | |
95 % bootstrap CI cov. | 0.897 | 0.944 | 0.918 | |
Smooth IPW | ||||
Mean(Bias) | –0.013 | 0.007 | –0.038 | |
Std. Dev. | 0.041 | 0.041 | 0.133 | |
Mean(std error) | 0.040 | 0.040 | 0.132 | |
95 % bootstrap CI cov. | 0.937 | 0.939 | 0.941 | |
SEM | ||||
Mean(Bias) | 0 | –0.001 | 0.002 | |
Std. Dev. | 0.025 | 0.039 | 0.098 | |
Mean(std error) | 0.026 | 0.037 | 0.095 | |
95 % bootstrap CI cov. | 0.951 | 0.932 | 0.938 | |
Combined | ||||
Mean(Bias) | –0.006 | 0.002 | –0.016 | |
Std. Dev. | 0.031 | 0.039 | 0.109 | |
Mean(std error) | 0.035 | 0.039 | 0.117 | |
95 % bootstrap CI cov. | 0.975 | 0.951 | 0.971 |
KM Kaplan–Meier estimator, IPW inverse probability weighted estimator, Smooth IPW smooth inverse probability weighted estimator, SEM semiparametric estimator, Combined combined estimator, as defined in the text
The third setting we investigated considers the case where survival time depends on four markers Yi, for i = 1, . . . , 4, but we only have access to the first two. In particular, Y comes from a multivariate normal distribution with mean 0, and σij = 1 for i = j and 0.25 otherwise. Survival time relates to the marker data through a model where the hazard function is specified as λ(t|Y) = 0.1*{3Y1 + 1.5Y2 + 2Y3 + 2.5Y4 + exp(3Y1)}. Note that in this setting, model Q is misspecified as depending only on Y1 and Y2. Censoring time in this setting is generated the same as in the first setting, which does not dependent on T or Y. Since the SEM estimator misspecified the relationship between T and Y as λ(t|Y) = λ0 exp(β1Y1 + β2Y2), it yields biased results. All other estimators are unbiased (Table 4). Throughout the three settings we considered, the combined estimator remained unbiased and more efficient than other nonparametric estimators.
Table 4.
Method | Pr(Pi – Qi > 0|Ti > t) | NRI (t) | ||
---|---|---|---|---|
True values | 0.646 | 0.395 | 0.504 | |
KM | ||||
Mean(Bias) | 0.007 | –0.002 | 0.016 | |
Std. Dev. | 0.072 | 0.023 | 0.160 | |
Mean(std error) | 0.074 | 0.024 | 0.164 | |
95 % bootstrap CI cov. | 0.94 | 0.945 | 0.947 | |
IPW | ||||
Mean(Bias) | 0.004 | –0.001 | 0.008 | |
Std. Dev. | 0.072 | 0.023 | 0.160 | |
Mean(std error) | 0.074 | 0.024 | 0.165 | |
95 % bootstrap CI cov. | 0.945 | 0.942 | 0.95 | |
Smooth IPW | ||||
Mean(Bias) | 0.003 | –0.001 | 0.007 | |
Std. Dev. | 0.072 | 0.023 | 0.160 | |
Mean(std error) | 0.074 | 0.024 | 0.164 | |
95 % bootstrap CI cov. | 0.943 | 0.946 | 0.95 | |
SEM | ||||
Mean(Bias) | –0.046 | 0.003 | –0.099 | |
Std. Dev. | 0.022 | 0.022 | 0.068 | |
Mean(std error) | 0.022 | 0.023 | 0.068 | |
95 % bootstrap CI cov. | 0.448 | 0.943 | 0.682 | |
Combined | ||||
Mean(Bias) | –0.009 | 0.000 | –0.020 | |
Std. Dev. | 0.057 | 0.022 | 0.128 | |
Mean(std error) | 0.062 | 0.023 | 0.139 | |
95 % bootstrap CI cov. | 0.970 | 0.947 | 0.976 |
KM Kaplan–Meier estimator, IPW inverse probability weighted estimator, Smooth IPW smooth inverse probability weighted estimator, SEM semiparametric estimator, Combined combined estimator, as defined in the text
To evaluate the procedures described above, we simulated 10 markers from a multivariate normal distribution with mean 0, σYi = 1 and pairwise correlations equal to 0.25. The number of parameters and sample size were chosen to mimic the setting of our data example described in Sect. 6. We consider a Cox model for failure time, with hazard ratio parameters for 10 markers specified as β = (log(2), log(.77), 0, log(1.81), 0, 0, 0, log(0.5), 0, log(1.2)). The baseline model consists only of the first marker. To derive a new model based on the information on all 10 markers, for each simulation, we first fit a model with all ten markers. The expanded model consists of all markers that have non-zero β at an α = 0.05 level. We found that in the case of estimating NRI, under our simulated scenario, the apparent summaries are quite close to the true values in many cases. Since the bias is at the rate of g/N, where g is the number of predictors under consideration for risk model building, overfitting may be of more concern when large numbers of genetic markers are involved with a relatively small sample size. In the situation there is a slight indication of overfitting, the 0.632 bootstrap procedure appears to be adequate in correcting the bias (see Table 5).
Table 5.
Estimator | Pr(Pi – Qi > 0|Ti > t) | NRI(t) | ||
---|---|---|---|---|
True values | 0.684 | 0.275 | 0.817 | |
Smooth IPW | Apparent | |||
Mean(Bias) | 0.000 | 0.004 | –0.007 | |
Std. Dev. | 0.036 | 0.028 | 0.108 | |
CI coverage | 0.962 | 0.963 | 0.964 | |
0.632 Bootstrap | ||||
Mean(Bias) | –0.008 | 0.008 | –0.032 | |
Std. Dev. | 0.034 | 0.027 | 0.102 | |
CI coverage | 0.971 | 0.969 | 0.968 | |
Bootstrapped SE | ||||
Mean(std error) | 0.039 | 0.030 | 0.114 | |
SEM | Apparent | |||
Mean(Bias) | 0.003 | –0.001 | 0.009 | |
Std. Dev. | 0.023 | 0.025 | 0.072 | |
CI coverage | 0.955 | 0.954 | 0.945 | |
0.632 Bootstrap | ||||
Mean(Bias) | 0.005 | –0.003 | 0.015 | |
Std. Dev. | 0.022 | 0.024 | 0.072 | |
CI coverage | 0.953 | 0.962 | 0.937 | |
Bootstrapped SE | ||||
Mean(std error) | 0.024 | 0.025 | 0.071 | |
Combined | Apparent | |||
Mean(Bias) | 0.001 | 0.001 | 0.001 | |
Std. Dev. | 0.028 | 0.026 | 0.087 | |
CI coverage | 0.982 | 0.969 | 0.975 | |
0.632 Bootstrap | ||||
Mean(Bias) | –0.002 | 0.003 | –0.008 | |
Std. Dev. | 0.027 | 0.025 | 0.085 | |
CI coverage | 0.989 | 0.975 | 0.983 | |
Bootstrapped SE | ||||
Mean(std error) | 0.035 | 0.028 | 0.102 |
Smooth IPW smooth inverse probability weighted estimator, SEM semiparametric estimator, Combined combined estimator, as defined in the text
6 Example
The Framingham risk model (FRM) has been used for population-wide CVD risk assessment. The model was developed based on several common clinical risk factors, including age, gender, total cholesterol level, high-density lipoprotein (HDL) cholesterol level, smoking, systolic blood pressure and high blood pressure treatment (Wilson et al. 1998). To improve the predictive capacity of the FRM, a new risk model has been developed recently using data from the Women's Health Study (Cook et al. 2006), based on variables in the Framingham risk model and an inflammation marker, C-reactive protein (CRP). Prior to adapting the new model in routine practice, it is important to quantify its prediction performance, especially in comparison to that of FRM. We illustrate here how our proposed procedures can be used to evaluate and compare the clinical utility of the two risk models using an independent dataset from the Framingham Offspring Study (Kannel et al. 1979).
The Framingham Offspring Study was established in 1971 with 5,124 participants who were monitored prospectively for epidemiological and genetic risk factors for CVD. We consider here 1,728 female participants who have CRP measurement and other clinical information at the second exam and are free of CVD at the time of examination. The average age of this subset was about 44 years (standard deviation = 10). The outcome we consider is the time from exam date to first major CVD event including CVD-related death. During the followup period 269 participants were observed to encounter at least one CVD event and the 10-year event rate was about 4%. For illustration we chose t = 10 years as in Wilson et al. (1998). For each individual, two risk scores were calculated: one based on the FRM (Model 1), combining information on age, systolic blood pressure, smoking status, high-density lipoprotein (HDL), total cholesterol, medication for hypertension; the other based on an algorithm developed in Cook et al. (2006) (Model 2), with the addition of CRP concentration. We use Cox models to specify the relation between the time-to-CVD events and model scores (linear predictors from the models).
Both models are well calibrated based on calibration plots (not shown). For comparison, we first give AUC results and use the bootstrap to obtain confidence intervals. The AUC for an ROC curve at 10-years is 0.752 (95% CI: 0.721,0.783) for Model 1 and 0.758 (95% CI: 0.729,0.787) for Model 2. The difference between the two AUCs is not statistically significant: 0.006 (95% CI: –0.033, 0.046). We now investigate whether the new models reclassify patients in terms of their risks and CVD outcome at 10 years. We consider NRI (10-years) for such an evaluation using the methods described in Sect. 3. Table 6 shows that estimates from the three nonparametric models are quite consistent, all indicating that the new model does not add significant improvement gauged by NRI. The semiparametric model, however, does indicate a significant incremental value with NRI = 0.167 (SE = 0.067), and the combined estimator indicates a similar magnitude of improvement, though not significant (NRI = 0.132, SE = 0.137). Note that since we considered a continuous NRI with u = v = 0, the observed improvement at this magnitude may not be interpreted as clinically substantial. Since different conclusions could be reached depending on which estimation method is chosen, this analysis highlights the need to consider multiple robust approaches for calculating NRI.
Table 6.
Method | Pr(Pi – Qi > 0|Ti > t) | $NRI(t) | |
---|---|---|---|
KM | |||
Est | 0.483 | 0.508 | –0.049 |
SE | 0.069 | 0.028 | 0.176 |
IPW | |||
Est | 0.478 | 0.508 | –0.059 |
SE | 0.070 | 0.028 | 0.178 |
Smooth IPW | |||
Est | 0.480 | 0.508 | –0.057 |
SE | 0.070 | 0.028 | 0.178 |
SEM | |||
Est | 0.587 | 0.503 | 0.167 |
SE | 0.015 | 0.026 | 0.067 |
Combined | |||
Est | 0.570 | 0.504 | 0.132 |
SE | 0.054 | 0.027 | 0.137 |
KM Kaplan–Meier estimator, IPW inverse probability weighted estimator, Smooth IPW smooth inverse probability weighted estimator, SEM semiparametric estimator, Combined combined estimator, as defined in the text
7 Discussion
NRI provides an alternative tool for evaluating risk prediction models (Pencina et al. 2008) beyond the traditional ROC curve framework. The concept has continued to gain popularity in the medical literature, yet its statistical properties have not been well studied to date in the statistical literature, and existing methods for calculating NRI under the failure time outcome setting are limited. In this manuscript, we provide a more thorough investigation of a variety of estimation procedures. Our proposed nonparametric and semiparametric estimators improve upon existing methods both in terms of robustness and efficiency under a variety of practical situations. Such improvement is quite important, since we observe that compared with other measures such as AUC, NRI estimates, in general, are not very stable with substantial variations in the estimators we have considered. The proposed procedures can be used for estimating both continuous NRI and NRI with pre-specified fixed categories. As illustrated in the example, the choice of estimation method can lead to different conclusions. In practice, the method chosen should depend on a number of important considerations including the likelihood that the model has been correctly specified and that the assumptions concerning censoring are correct. In addition, in situations where the new marker may be expensive or difficult to ascertain, an approach which considers both the risks and benefits of obtaining the marker should be considered in a decision-making process. We recommend such measures to be used in practice with caution. A thorough evaluation of a risk model should consider a wide spectrum of measures for assessing discrimination and calibration, and NRI may be better served as one of the summary measures to complement graphical displays of risk distributions(Gu and Pepe 2009). All analyses were performed in R. Code for implementing the proposed procedures is available upon request.
Acknowledgment
The Framingham Heart Study and the Framingham SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University. The Framingham SHARe data used for the analyses described in this manuscript were obtained through dbGaP (access number: phs000007.v3.p2). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or the NHLBI. The work is supported by grants U01-CA86368, P01-CA053996, R01- GM085047, R01-GM079330 awarded by the National Institutes of Health.
Appendix
Throughout, we assume that the joint density of (T, C, Y) is twice continuously differentiable, Y are bounded, and 1 > P(T > t) > 0, 1 > P(C > t) > 0. The kernel function K is a symmetric probability density function with compact support and bounded second derivative. The bandwidth h → 0 such that nh4 → 0. In addition, the estimator converges to θ0k for k = 1, 2 as n → ∞ (Hjort 1992), where βk0 is the unique maximizer of the expected value of the corresponding partial likelihood and Λk0 is the baseline cumulative hazard for k = 1, 2. We denote the parameter space for θk by Ωk and assume that Ωk is a compact set containing θ0k. Furthermore, we assume that β2 ≠ 0 and note that and are the respective limits of and , for any given Y(2) and Y(1). The in-probability convergence of and and P(θ01) are uniform in Y(2) and Y(1) due to the convergence of .
Asymptotic Properties of
From the same arguments as given in Cai et al. (2010) and Dabrowska (1997), it follows that we have the uniform consistency of to , where and , for ι = 1 and •. It follows, using the law of numbers (Pollard 1990), that
This along with the convergence of to θ0 implies that is uniformly consistent for NRI(θ0, t).
Throughout, we will use the fact that if either C ⊥ T, Y(2) (model may be misspecified) or i.e. the Cox model is correctly specified though censoring may be such that C ⊥ T | Y(2) (double robustness). We first write the i.i.d representation of for any θ. Note that . We first examine the initial component,
where and . Let and . Then by the uniform consistency of the IPW weights, we have
Examining the numerator, where , and . Note that
Using a Taylor series expansion, Lemma A.3 of Bilias et al. (1997) and the asymptotic expansion for given in Du and Akritas (2002),
where
Now by a change of variable, and ,
where . Similar arguments can be used to obtain an asymptotic expansion for (3) as and therefore, the numerator, . The same arguments as given above can be used to obtain an asymptotic expansion for as where D(t)–, U–1i(t), U–2i(t), and U–3i(t) are defined similarly to D(t), U1i(t), U2i(t), and U3i(t) with replaced with T > t. Therefore, .
Note that regardless of correct model specification, where ψi are i.i.d mean zero random variables by Lin and Wei (1989) and Uno et al. (2009). Using a Taylor series approximation and the i.i.d representation of for any θ, we can write as a sum of i.i.d terms, defined below.
where . By a functional central limit theorem of Pollard (1990), the process converges weakly to a mean zero Gaussian process in t.
Asymptotic Properties of
Recall that we assume the Cox model is correctly specified and thus, and . To derive asymptotic properties of we assume the same regularity conditions as in Andersen and Gill (1982). The uniform consistency of for Q(θ2, t, Y(2)) in t and Y(2) follows directly from the uniform consistency of and . It follows from the uniform law of large numbers (Pollard 1990) that is uniformly consistent for NRI(θ0, t). Andersen and Gill (1982) show that is a normal random variable and converges to a Gaussian process. By the functional delta method it can be shown that converges to a zero mean Gaussian process in t and Y(2) (Zheng et al. 2008). Similar to the derivation for , it can be shown that the process is asymptotically equivalent to . In particular, for a fixed θ, where . Thus, where . Once again, using a functional central limit theorem, this implies that converges to a Gaussian process with mean zero.
Contributor Information
Yingye Zheng, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA.
Layla Parast, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.
Tianxi Cai, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.
Marshall Brown, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA.
References
- Andersen P, Gill R. Cox's regression model for counting processes: a large sample study. Ann Stat. 1982;10:1100–1120. [Google Scholar]
- Bilias Y, Gu M, Ying Z. Towards a general asymptotic theory for Cox model with staggered entry. Ann Stat. 1997;25:662–682. [Google Scholar]
- Cai T, Tian L, Uno H, Solomon S, Wei L. Calibrating parametric subject-specific risk estimation. Biometrika. 2010;97:389–404. doi: 10.1093/biomet/asq012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cook N. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928. doi: 10.1161/CIRCULATIONAHA.106.672402. [DOI] [PubMed] [Google Scholar]
- Cook N, Buring J, Ridker P. The effect of including c-reactive protein in cardiovascular risk prediction models for women. Annals of Internal Medicine. 2006;145:21. doi: 10.7326/0003-4819-145-1-200607040-00128. [DOI] [PubMed] [Google Scholar]
- Cui J. Overview of risk prediction models in cardiovascular disease research. Ann Epidemiol. 2009;19:711–717. doi: 10.1016/j.annepidem.2009.05.005. [DOI] [PubMed] [Google Scholar]
- Dabrowska D. Smoothed cox regression. Ann Stat. 1997;25(4):1510–1540. [Google Scholar]
- Du Y, Akritas M. Uniform strong representation of the conditional Kaplan–Meier process. Math Methods Stat. 2002;11:152–182. [Google Scholar]
- Efron B, Tibshirani R. Improvements on cross-validation: the.632+ bootstrap method. J Am Stat Assoc. 1997;92(438):548–560. [Google Scholar]
- Gail M, Brinton L, Byar D, Corle D, Green S, Schairer C, Mulvihill J. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst (JNCI) 1989;81:1879. doi: 10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]
- Gu W, Pepe M. Measures to summarize and compare the predictive capacity of markers. Int J Biostat. 2009;5:27. doi: 10.2202/1557-4679.1188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hemann B, Bimson W, Taylor A. The framingham risk score: an appraisal of its benefits and limitations. Am Heart Hosp J. 2007;5:91–96. doi: 10.1111/j.1541-9215.2007.06350.x. [DOI] [PubMed] [Google Scholar]
- Hjort N. On inference in parametric survival data models. Int Stat Rev. 1992;60(3):355–387. [Google Scholar]
- Kannel W, Feinleib M, McNamara P, Garrison R, Castelli W. An investigation of coronary heart disease in families. Am J Epidemiol. 1979;110:281. doi: 10.1093/oxfordjournals.aje.a112813. [DOI] [PubMed] [Google Scholar]
- Khot U, Khot M, Bajzer C, Sapp S, Ohman E, Brener S, Ellis S, Lincoff A, Topol E. Prevalence of conventional risk factors in patients with coronary heart disease. JAMA. 2003;290:898–904. doi: 10.1001/jama.290.7.898. [DOI] [PubMed] [Google Scholar]
- Lin D, Wei L. The robust inference for the Cox proportional hazards model. J Am Stat Assoc. 1989;84:1074–1078. [Google Scholar]
- Lloyd-Jones D. Cardiovascular risk prediction. Circulation. 2010;121:1768–1777. doi: 10.1161/CIRCULATIONAHA.109.849166. [DOI] [PubMed] [Google Scholar]
- Pencina M, D'Agostino R., Sr Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30:11–21. doi: 10.1002/sim.4085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pencina M, D'Agostino R, Sr, D'Agostino R., Jr Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27:157–172. doi: 10.1002/sim.2929. [DOI] [PubMed] [Google Scholar]
- Pollard D. Empirical processes: theory and applications. Institute of Mathematical Statistics; Hayward: 1990. [Google Scholar]
- Satten G, Datta S. The kaplan-meier estimator as an inverse-probability-of-censoring weighted average. Am Stat. 2001;55:207–210. doi: 10.1198/000313001317098185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tian L, Cai T, Goetghebeur E, Wei L. Model evaluation based on the sampling distribution of estimated absolute prediction error. Biometrika. 2007;94:297–311. [Google Scholar]
- Uno H, Cai T, Tian L, Wei L. Evaluating prediction rules for t-year survivors with censored regression models. J Am Stat Assoc. 2007;102:527–537. [Google Scholar]
- Uno H, Tian L, Cai T, Kohane I, Wei L. Comparing risk scoring systems beyond the roc paradigm in survival analysis. Harvard University Biostatistics Working Paper Series. 2009. p. 107.
- Wilson P, D'Agostino R, Levy D, Belanger A, Silbershatz H, Kannel W. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97:1837. doi: 10.1161/01.cir.97.18.1837. [DOI] [PubMed] [Google Scholar]
- Zheng Y, Cai T, Pepe M, Levy W. Time-dependent predictive values of prognostic biomarkers with failure time outcome. J Am Stat Assoc. 2008;103:362–368. doi: 10.1198/016214507000001481. [DOI] [PMC free article] [PubMed] [Google Scholar]