Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Nov 10.
Published in final edited form as: Stat Med. 2017 Jul 31;36(25):4041–4049. doi: 10.1002/sim.7414

On comparing two correlated C indices with censored survival data

Xiaoxia Han a, Yilong Zhang b, Yongzhao Shao a,*,
PMCID: PMC5909734  NIHMSID: NIHMS889268  PMID: 28758216

Abstract

As new biomarkers and risk prediction procedures are in rapid development, it is of great interest to develop valid methods for comparing predictive power of two biomarkers or risk score systems. Harrell’s C statistic has been routinely used as a global adequacy assessment of a risk score system, and the difference of two Harrell’s C statistics as a test statistic has been suggested in recent literature for comparison of predictive power of two biomarkers for censored outcome. In this study, we showed that such a test can have severely inflated type I errors as the difference between the two Harrell’s C statistics does not converge to zero under the null hypothesis of equal predictive power measured by concordance probabilities, as illustrated by two counterexamples and corresponding numerical simulations. We further investigate a necessary and sufficient condition under which the difference of two Harrell’s C statistics converges to zero under the null hypothesis.

Keywords: C index, concordance probability, censored data, predictive models, discriminary power

1. Introduction

As new biomarkers and diagnostic/prognostic procedures are in rapid development, methods to evaluate and compare performance of two biomarkers or two given scoring algorithms are of great current interest. In particular, the discriminatory accuracy for predicting binary outcomes is often compared using the receiver operating characteristic (ROC) curve, e.g. the area under the ROC curve (AUC) which is a measure of concordance probability [1, 2]. In the case of testing added value of new biomarkers to an existing predictive model, Pepe et al. [3] have shown that under some regularity conditions the hypothesis of no added values for nested models is equivalent to the hypothesis that the parameters corresponding to the added new predictors equal to 0. As follow up, many experts have advocated using likelihood ratio test for the significance of the added predictors instead of direct testing of performance measures such as differences between the correlated AUCs. Clearly, the usefulness of this approach depend crucially on the assumption of correctly-specified nested models. Importantly, correct model specification is frequently challenging in applications, e.g. for censored survival outcomes where specifying a correct model for the unknown censoring distribution is often practically impossible. Therefore, it is of interest to study nonparametric procedures [4, 5, 6, 7]. Additionally, in many applications, it is important to compare two non-nested biomarkers or predictive models. For example, Wieand et al. [8] investigated a case-control study for pancreatic cancer to determine which of the two serum biomarkers, a cancer antigen CA-125 and a carbohydrate antigen CA-19-9, can better distinguish cases from control. Etzioni et al. [9] analyzed serum levels of prostate specific antigen (PSA) for prostate cancer and compared the predictive performance of two different risk scores based on measures of PSA, namely total serum PSA and the ratio of free to total PSA. For Alzheimer’s disease (AD), AD patients tend to have increased values of cerebrospinal fluid (CSF) tau protein than the normal controls [10, 11, 12], and recent studies have shown that there is a significant elevation of plasma tau in AD patients [13, 14, 15]. Therefore, comparing performance of the invasive and expensive CSF tau protein [16, 17] to the non-invasive plasma tau protein is of great interest for diagnosing Alzheimer’s disease (AD) [18]. In short, it is important to investigate nonparametric procedures to compare performance for two correlated non-nested biomarkers or predictive models. For predicting a binary outcomes, nonparametric comparison of the AUCs between two ROCs of correlated risk scores X and Y can often be made using DeLong’s test [4]. However, nonparametric comparison of predictive accuracy for survival outcomes is much more complicated in the presence of censoring as will be discussed next.

For a randomly selected subject, let T, D and X denote event time, censoring time, and predictive score, respectively. Without much loss of generality, for easy exposition, we assume no tied observations in T, D and X. Let = min(T, D) denote the censored observed time and δ = I(T < D) denote event indicator under right censoring. Harrell’s C statistic has been routinely used as a global adequacy assessment of a risk score system. More specifically, for independently observed data (i, δi, Xi), i = 1, …, n, Harrell’s C statistic [19] is defined as

C^TX=ijδiI(Ti<Tj)I(Xi>Xj)ijδiI(Ti<Tj). (1)

Note that ĈTX is commonly regarded as an estimate for the following concordance probability [20]

CTX=pr(X1>X2|T1<T2), (2)

where (T1, X1) and (T2, X2) are bivariate observations from a pair of randomly selected independent subjects. The concordance index CTX is a useful global assessment of the predictive power of a risk score system X. Unfortunately, ĈTX is generally a biased estimator of CTX and its bias depends on the unknown censoring distribution [5]. More precisely, instead of converging to the concordance probability CTX, it is known that Harrell’s C Statistic ĈTX converges to the following BTX:

BTX=pr(X1>X2|T1<T2,T1D1D2), (3)

which clearly depends on the distributions of censoring time D1 and D2. Note that without censoring, we have D1D2 = ∞, thus CTX = BTX. That is, Harrell’s C statistic ĈTX converges to the concordance probability CTX in the absence of right censoring.

Let X and Y be two biomarkers or risk prediction procedures for a subject with event time T. A question of interest is to compare predictive powers between X and Y, i.e., to test the following null hypothesis

H0:CTX=CTY. (4)

For testing the above hypothesis, Antolini et al. [21] proposed a jackknife based methodology to compare two correlated C indices. Uno et al. [5] and Liu et al. [6] independently investigated test procedures based on inverse-probability-of-censoring weighted (IPCW) C statistic defined as

CTX=ijδi{G^(Xi)}2I(Ti<Tj)I(Xi>Xj)ijδi{G^(Xi)}2I(Ti<Tj), (5)

where Ĝ(·) is the Kaplan-Meier estimator for the censoring distribution. The IPCW C statistic is consistent to CTX when censoring is assumed to be noninformative, i.e., the survival time T is independent of the censoring time D [5, 6]. However, the survival time T is generally not independent of the censoring time D in practice. Researchers commonly adopt the conditional independent censoring assumption that event time is independent of censoring time conditional on covariates [22]. For example, subjects in a study are frequently lost to follow-up differentially according to the severity of the disease. Moreover, this IPCW based approach is computationally intensive since it involves resampling based evaluation of the variance of the test statistic as well as p-values. Specifically, let Δ = TXTY denote the difference between two IPCW-based C statistics, Uno et al. [5] and Liu et al. [6] both proposed a Wald type test statistic that require estimation of the standard error of Δ, se). To calculate the standard error of the difference between two IPCW C-statistics, Uno et al. [5] used a perturbation resampling method, while Liu et al. [6] provided an asymptotic variance formula and estimated it empirically.

Liu et al. [6] also numerically compared the IPCW method to the test procedure based on the difference between two Harrell’s C statistics, i.e., ΔĈ = ĈTXĈTY. Interestingly, their simulation results indicated that the ΔĈ based procedure produced negligible bias in estimating ΔC = CTXCTY. Recently, for testing H0 : CTX = CTY, Kang et al. [7] proposed a formal nonparametric approach using ΔĈ as a test statistic and developed an analytical form of the standard error of ΔĈ. Compared with the resampling procedure, the testing procedure proposed by Kang et al. [7] is computational efficient. In their simulation study, they simulated two predictive scores X and Y from bivariate normal distribution under independent censoring. Under this setting, their simulation results indicated that the ΔĈ led to negligible bias, had satisfactory performance in terms of Type I error,. Kang et al. [7] also provided a publicly accessible R package compareC to implement their test procedure. However, since Harrell’s C statistic ĈTX is well known to be a biased estimate of the concordance probability CTX. In general, ΔĈ = ĈTXĈTY may not have mean zero even asymptotically under the null hypothesis H0 : CTX = CTY.

In this paper, we address two important questions regarding the comparison of two correlated C indices. 1) Does the test procedure based on the difference of Harrell’s C statistics have valid type I error in general? 2) If not, what is the necessary and sufficient condition under which ΔĈ = ĈTXĈTY converges to zero under the null hypothesis H0 : CTX = CTY. The rest of this paper is organized as follows. In section 2, we provide a change point Cox proportional hazard (PH) model and a threshold model as two practical counterexamples, in which the difference of two Harrell’s C statistics (i.e., ΔĈ) based approach can have severely inflated Type I error. We also compare Kang et al.’s with the IPCW method. Moreover, we investigate the necessary and sufficient condition under which ΔĈ converges to zero under the null hypothesis H0 : CTX = CTY. The paper is concluded in Section 3.

2. Results

Under H0 : CTX = CTY, for the test procedure based on ΔĈ to be valid, it is necessary that ΔĈ is consistent to 0. It is equivalent to BTXBTY = 0, since ĈTX converges to BTX and ĈTY converges to BTY. However, the condition BTXBTY = 0 under H0 : CTX = CTY does not always hold. In this section, we first discuss two counterexamples and then provide the necessary and sufficient condition for BTXBTY = 0 under H0 : CTX = CTY.

2.1. Counterexamples

Cox PH model is widely used in survival analysis, and it assumes the hazard ratio (HR) does not depend on time. However, in many applications, a change in HR occurs frequently at some time point during followup. For example, HR may change after a major operation such as organ transplantation or when a treatment does not exert its effect until a time lag passes. Specifically, Liang et al. [23] developed a change point variant of the Cox PH model to investigate relative risks varying with early versus late onset of disease while incorporating other covariates. Here we consider a similar change point Cox PH model with two covariates X and Y. The hazard ratio of X does not change over time, while the hazard ratio of Y is different before and after a change point τ. We assume an exponential baseline hazard with constant hazard rate λ0, and thus the hazard function is written as

λ(t|X,Y)=λ0 exp(β1X+β2(t)Y), (6)

where

β2(t)=β20I(tτ)+β21I(t>τ), and β20β21.

Under this setting, risk score X has effect β1, which does not change during the follow up time while risk score Y has effect β20 before time τ and β21 after time τ.

We conduct simulations studies to investigate the nonparametric test procedure proposed by Kang et. al. [7] using their package compareC. For the change point model (6), covariates X and Y are both generated from Uniform[0, 1]. Parameters in hazard function are λ0 = 0.01, β1 = 3, β20 = 0.1, and β21 = 15. The change point τ = 8.59 is selected such that CTX = CTY = 0.655. We consider both exponential censoring exp(λ) and uniform censoring U(0, c2). Censoring parameter λ and c2 are chosen to achieve the desired censoring proportions. We repeat the simulation 5000 times and calculate the proportion of rejection being observed in these simulation experiments. The simulation results are summarized in Table 1.

Table 1.

Change point Cox PH simulation results. Covariates X and Y are generated from Uniform[0, 1]. Parameters in hazard function are λ0 = 0.01, β1 = 3, β20 = 0.1, and β21 = 15. The change point τ = 8.59 is selected such that CTX = CTY = 0.655 based on 100,000 simulations. Parameters λ for exponential censoring and c2 for uniform (0, c2) censoring are chosen to achieve the desired censoring proportions.

Censoring
distribution / n
% censor ĈTX ĈTY ĈTXĈTY % Reject H0
at α = 5%
% Reject H0
at α = 10%
Exponential n=100 0% 0.655 0.655 0.000 5.6% 10.7%
10% 0.661 0.642 0.020 7.6% 13.3%
20% 0.669 0.627 0.042 12.7% 20.3%
33% 0.679 0.604 0.075 24.6% 35.4%
50% 0.694 0.575 0.119 41.8% 54.2%
n=200 0% 0.655 0.655 0.000 5.5% 10.3%
10% 0.661 0.642 0.019 8.9% 15.7%
20% 0.668 0.627 0.041 19.0% 27.9%
33% 0.678 0.604 0.075 41.5% 53.6%
50% 0.693 0.571 0.123 70.7% 80.9%
n=500 0% 0.655 0.655 0.000 4.7% 9.8%
10% 0.661 0.642 0.019 12.6% 20.8%
20% 0.668 0.627 0.041 37.1% 50.3%
33% 0.679 0.604 0.075 79.3% 87.6%
50% 0.694 0.569 0.125 97.7% 99.1%

Uniform n=100 10% 0.661 0.641 0.020 8.1% 13.8%
20% 0.669 0.625 0.044 13.1% 21.4%
33% 0.679 0.602 0.077 26.9% 37.5%
50% 0.697 0.568 0.129 51.5% 62.9%
n=200 10% 0.661 0.642 0.019 8.8% 15.3%
20% 0.669 0.626 0.043 20.7% 30.3%
33% 0.68 0.602 0.077 44.7% 56.6%
50% 0.698 0.563 0.135 80.1% 87.7%
n=500 10% 0.661 0.642 0.019 12.7% 21.1%
20% 0.669 0.626 0.042 40.7% 52.9%
33% 0.679 0.603 0.077 81.6% 89.1%
50% 0.697 0.562 0.135 99.3% 99.7%

As shown in Table 1, simulation results indicate that, without censoring, test procedure based on ΔĈ works reasonably well. More specifically, both ĈTX and ĈTY are close to the true value 0.655, and the proportions of reject H0 are close to the nominal values in the absence of censoring. When there is censoring, it is no surprise that both ĈTX and ĈTY are biased, since Harrell’s C statistic depends on censoring distribution. Note that the biases for ĈTX and ĈTY are not the same, and thus ΔĈ is not always close to zero when there is censoring. When censoring proportion is 20%, the bias is greater than 0.04, and when censoring proportion increase to 50%, the bias is as large as 0.12. Moreover, the bias does not decrease as sample size increases. In addition, for sample size n=100, the rejection proportions at 5% level is far from 5%, i.e. 40% for exponential censoring and higher than 50% for uniform censoring. As sample size increase to n=500, the rejection proportions at 5% level are almost 100% under both censoring distribution.

In addition to the above Cox change-point HR model (6), an additional counterexample is the threshold model. In this case, CTX and CTY can be the same, but the value of BTX and BTY can be different since censoring time can have different effects on ĈTX and ĈTY. Specifically, for some fixed time τ, we consider two predictive scores X and Y (conditioning on T) generated from a mixture of bivariate normal distribution, i.e.,

(X,Y)|T~I(T>τ)𝒩(μ,1)+I(T<τ)𝒩(μ,2), (7)

where

μ=(u0 log(T),u0 log(T)),1=(σ12ρρσ22)and2=(σ22ρρσ12).

The parameter ρ controls correlation between X and Y conditional on T. When σ22<σ12, biomarker X has higher prognostic accuracy for early event, while biomarker Y has higher prognostic accuracy for later event.

For the threshold model (7), in the simulation, event time T is generated from the standard exponential distribution (i.e. with hazard rate 1). Two risk scores X and Y are generated from a mixture of bivariate normal distribution described in equation (7), with parameters μ0 = 2, ρ = 0, σ12=9 and σ22=0.25. The cut point τ = 0.718 is selected such that CTX = CTY = 0.803. We consider both exponential censoring exp(λ) and uniform censoring U(0, c2). The censoring parameters λ and c2 are chosen to achieve the desired censoring proportions. We repeat the simulation 5000 times and calculate the proportion of rejection being observed in these simulation experiments. Simulation results are summarized in Table 2. Briefly, the simulation results are similar to the change point model. These counterexamples indicate that Kang et al.’s [7] test procedure based on ΔĈ under right censoring can lead to inflated Type I error and non-negligible bias. Therefore, the suggested test procedure based on ΔĈ is invalid in general.

Table 2.

Threshold model simulation results. Event time T is generated from exponential distribution with rate 1. Two risk scores X and Y are generated from a mixture of bivariate normal distribution described in equation (7), with parameters μ0 = 2, ρ = 0, σ12=9 and σ22=0.25. The cut point τ = 0.718 is selected such that CTX = CTY = 0.803 based on 100,000 simulations. Parameters λ for exponential censoring and c2 for uniform (0, c2) censoring are chosen to achieve the desired censoring proportions.

Censoring
distribution / n
% censor ĈTX ĈTY ĈTXĈTY % Reject H0
at α = 5%
% Reject H0
at α = 10%
Exponential n=100 0% 0.802 0.803 −0.001 5.5% 10.7%
10% 0.815 0.804 0.011 6.9% 13.1%
20% 0.830 0.806 0.024 10.3% 17.3%
33% 0.851 0.809 0.042 20.2% 29.0%
50% 0.883 0.818 0.065 39.2% 51.5%
n=200 0% 0.802 0.802 0.000 4.7% 10.0%
10% 0.815 0.803 0.012 8.0% 13.8%
20% 0.830 0.805 0.025 16.0% 25.4%
33% 0.851 0.809 0.043 34.7% 45.7%
50% 0.883 0.817 0.067 66.9% 76.8%
n=500 0% 0.802 0.803 −0.001 5.2% 10.0%
10% 0.816 0.804 0.012 10.9% 17.9%
20% 0.831 0.806 0.025 30.8% 42.5%
33% 0.852 0.809 0.043 69.1% 79.3%
50% 0.884 0.818 0.066 96.6% 98.5%

Uniform n=100 10% 0.815 0.803 0.012 7.0% 12.5%
20% 0.829 0.804 0.025 10.7% 18.1%
33% 0.847 0.806 0.041 19.3% 28.4%
50% 0.879 0.811 0.067 42.0% 54.0%
n=200 10% 0.815 0.803 0.012 7.9% 14.0%
20% 0.829 0.804 0.024 15.0% 23.4%
33% 0.847 0.806 0.041 31.7% 43.9%
50% 0.879 0.812 0.067 68.4% 79.1%
n=500 10% 0.815 0.803 0.012 11.1% 18.5%
20% 0.829 0.804 0.024 29.2% 40.4%
33% 0.847 0.806 0.041 64.4% 74.7%
50% 0.879 0.811 0.068 97.2% 98.7%

In addition, we conducted additional simulation studies to compare the one-shot non-parametric approach [7] and the IPCW method [6] for the threshold model and change point model. The set-up for simulation parameters is the same for change point model and threshold model as described above. We consider the exponential censoring distribution, exp(λ), and values of the censoring parameter λ are chosen to achieve the desired censoring proportions. We repeat the simulation 5000 times and calculate the proportion of rejection being observed in these simulation experiments. Tables 3 and 4 summarized the simulation results for sample size n=100, 200 and 400. Generally, the biases of both methods increase as censoring proportion increases, while the IPCW method produces a relative smaller bias compared to the one-shot nonparametric approach. For the one-shot nonparametric approach, the expected Type I error is seriously inflated as censoring proportion increases in both threshold model and change point model model. For the IPCW method, the expected Type I error first becomes conservative and then becomes inflated as censoring proportion increases in threshold model, while the Type I error for the IPCW method becomes too conservative as censoring proportion increases in the change point model (6). Moreover, the simulation results for various sample sizes indicated that the adverse impact of censoring pattern on both methods could not be reduced by increasing sample size.

Table 3.

Compare one-shot nonparametric method and IPCW method. Change point Cox PH model: covariates X and Y are generated from Uniform[0, 1]. Parameters in hazard function are λ0 = 0.01, β1 = 3, β20 = 0.1, and β21 = 15. The change point τ = 8.59 is selected such that CTX = CTY = 0.655 based on 100,000 simulations. The values of parameter λ for exponential censoring is chosen to achieve the desired censoring proportions. Threshold model: event time T is generated from exponential distribution with the rate λ =1. Two risk scores X and Y are generated from a mixture of bivariate normal distribution described in equation (7), with parameters μ0 = 2, ρ = 0, σ12=9 and σ22=0.25. The cut point τ = 0.718 is selected such that CTX = CTY = 0.803 based on 100,000 simulations. Parameters λ for exponential censoring is chosen to achieve the desired censoring proportions. Sample size n=100.

The one-shot nonparametric method The IPCW method
% Censor ĈTX ĈTY ĈTXĈTY % Reject H0
at α = 5%
% Reject H0
at α = 10%
TX TY TXTY % Reject H0
at α = 5%
% Reject H0
at α = 10%
n=100

Change point model 0% 0.657 0.660 −0.003 5.0% 9.8% 0.661 0.659 0.002 4.6% 9.1%
10% 0.663 0.647 0.016 6.4% 11.8% 0.661 0.659 0.002 4.2% 8.3%
20% 0.669 0.633 0.036 10.5% 18.0% 0.661 0.658 0.002 3.6% 7.3%
33% 0.68 0.612 0.068 21.9% 31.6% 0.661 0.659 0.002 2.7% 5.7%
50% 0.695 0.582 0.113 39.4% 51.6% 0.661 0.658 0.003 1.7% 4.1%
   65% 0.710 0.563 0.147 47.2% 59.5% 0.663 0.657 0.006 1.3% 3.1%
80% 0.720 0.565 0.156 36.3% 48.4% 0.682 0.656 0.027 1.2% 2.7%

Threshold model 0% 0.802 0.802 0.000 5.4% 10.2% 0.803 0.802 0.001 4.9% 9.8%
10% 0.816 0.803 0.013 7.2% 12.6% 0.803 0.802 0.001 4.5% 8.8%
20% 0.830 0.805 0.025 10.7% 18.0% 0.803 0.802 0.001 3.8% 7.5%
33% 0.852 0.808 0.044 20.9% 30.9% 0.805 0.802 0.003 3.3% 6.9%
50% 0.884 0.818 0.066 40.0% 52.4% 0.809 0.803 0.006 2.1% 5.2%
65% 0.918 0.834 0.083 57.1% 69.7% 0.821 0.803 0.018 4.7% 8.0%
80% 0.955 0.865 0.090 65.5% 77.8% 0.863 0.802 0.061 10.2% 17.6%

Table 4.

Compare one-shot nonparametric method and IPCW method. Change point Cox PH model: covariates X and Y are generated from Uniform[0, 1]. Parameters in hazard function are λ0 = 0.01, β1 = 3, β20 = 0.1, and β21 = 15. The change point τ = 8.59 is selected such that CTX = CTY = 0.655 based on 100,000 simulations. The values of parameter λ for exponential censoring is chosen to achieve the desired censoring proportions. Threshold model: event time T distribution with the rate λ =1. Two risk scores X and Y are generated from a mixture of bivariate normal distribution described in equation (7), with parameters μ0 = 2, ρ = 0, σ12=9 and σ22=0.25. The cut point τ = 0.718 is selected such that CTX = CTY = 0.803 based on 100,000 simulations. Parameters λ for exponential censoring is chosen to achieve the desired censoring proportions. Sample size n=200 and 400.

The one-shot nonparametric method The IPCW method
% Censor ĈTX ĈTY ĈTXĈTY % Reject H0
at α = 5%
% Reject H0
at α = 10%
TX TY TXTY % Reject H0
at α = 5%
% Reject H0
at α = 10%
n=200

Change point model 0% 0.656 0.659 −0.003 5.1% 9.9% 0.660 0.658 0.002 5.0% 9.9%
10% 0.662 0.646 0.016 6.5% 12.8% 0.660 0.658 0.002 4.8% 8.9%
20% 0.668 0.631 0.037 15.8% 25.7% 0.660 0.657 0.003 3.9% 7.6%
33% 0.679 0.609 0.070 37.6% 51.2% 0.660 0.658 0.002 3.5% 7.1%
50% 0.695 0.576 0.119 68.5% 78.0% 0.661 0.657 0.003 2.3% 5.5%
65% 0.709 0.551 0.158 80.0% 87.5% 0.661 0.656 0.005 2.0% 4.0%
80% 0.718 0.545 0.173 69.4% 80.5% 0.664 0.653 0.010 1.0% 2.1%

Threshold model 0% 0.803 0.802 0.001 4.8% 10.3% 0.803 0.802 0.001 4.5% 9.9%
10% 0.817 0.803 0.013 8.0% 14.2% 0.803 0.802 0.001 3.9% 8.9%
20% 0.831 0.805 0.026 16.3% 25.2% 0.804 0.802 0.002 3.4% 7.7%
33% 0.852 0.808 0.044 36.9% 48.8% 0.804 0.802 0.002 3.1% 6.5%
50% 0.885 0.817 0.068 68.9% 79.1% 0.807 0.803 0.004 2.6% 5.6%
65% 0.919 0.834 0.085 87.6% 92.7% 0.815 0.803 0.013 4.0% 7.0%
80% 0.956 0.866 0.090 94.3% 97.2% 0.849 0.799 0.050 12.6% 20.7%

n=400

Change point model 0% 0.656 0.656 0.000 4.8% 9.6% 0.660 0.655 0.005 5.9% 10.8%
10% 0.662 0.643 0.019 11.2% 18.4% 0.660 0.655 0.006 4.9% 9.5%
20% 0.668 0.628 0.041 30.4% 42.2% 0.660 0.655 0.005 4.2% 8.4%
33% 0.679 0.605 0.074 61.3% 71.3% 0.660 0.655 0.006 3.4% 7.5%
50% 0.694 0.570 0.124 54.8% 57.4% 0.660 0.654 0.006 3.6% 7.2%
65% 0.709 0.541 0.168 37.5% 38.3% 0.661 0.654 0.007 2.3% 5.7%
80% 0.719 0.532 0.187 55.3% 57.2% 0.662 0.653 0.010 0.8% 1.7%

Threshold model 0% 0.803 0.802 0.000 5.7% 10.1% 0.803 0.802 0.000 5.6% 9.6%
10% 0.816 0.803 0.013 10.3% 17.7% 0.803 0.803 0.000 5.0% 9.2%
20% 0.830 0.805 0.025 26.0% 37.5% 0.803 0.803 0.000 4.4% 8.6%
33% 0.852 0.809 0.044 60.2% 71.5% 0.803 0.803 0.001 2.9% 6.1%
50% 0.884 0.817 0.067 92.7% 95.8% 0.805 0.803 0.002 2.7% 5.4%
65% 0.919 0.834 0.085 99.3% 99.7% 0.811 0.803 0.007 3.0% 5.8%
80% 0.956 0.866 0.090 99.9% 100.0% 0.838 0.800 0.038 14.6% 20.8%

2.2. A necessary and sufficient condition

After introducing two counterexamples, we next investigate a necessary and sufficient condition under which ΔĈ converges to zero under the null for censored survival outcomes.

Proposition 1

Under the null hypothesis H0 : CTX = CTY, if pr(X1 > X2, T1 < T2) and pr(Y1 > Y2, T1 < T2) are both non-zero, for a pair of multivariate independent observations (X1, Y1, T1, D1) and (X2, Y2, T2, D2), a necessary and sufficient condition for BTX = BTY is

pr(T1<D1D2|X1>X2,T1<T2)=pr(T1<D1D2|Y1>Y2,T1<T2). (8)

Proof 1

Since CTX = CTY, i.e., pr(X1 > X2 | T1 < T2) = pr(Y1 > Y2 | T1 < T2), by Bayes theorem, we have

pr(X1>X2,T1<T2)=pr(Y1>Y2,T1<T2). (9)

Using Bayes theorem again, we have BTX = BTY, i.e., pr(X1 > X2 | T1 < T2, T1D1D2) = pr(Y1 > Y2 | T1 < T2, T1D1D2) holds if and only if

pr(T1<D1D2,X1>X2,T1<T2)=pr(T1<D1D2,Y1>Y2,T1<T2). (10)

By bayes theorem and equation (9), equation (8) holds if and only if equation (10) holds.

Given the above counterexamples, the applications of the test by Kang et al. [7] to real data cannot be properly justified unless a checkable sufficient condition for BTX = BTY under H0 CTX = CTY is identified. Thus the above necessary and sufficient condition for BTX = BTY under H0 CTX = CTY will be crucial when applying the test by Kang et al. [7]. Unfortunately, in practical applications, it seems extremely hard to check the above necessary and sufficient condition, i.e. equality of the two concordance probabilities in Proposition 1 equation (8). Next, we provide a sufficient condition in Proposition 2 which is not easy to check in practical applications, but it can be used to properly understand the simulation results of Liu et al. [6] and Kang et al. [7].

Proposition 2

If X and Y conditional on T and D have the same distribution, then

pr(T1<D1D2,X1>X2,T1<T2)=pr(T1<D1D2,Y1>Y2,T1<T2). (11)

Proof 2

Let X | T, D and Y | T, D denote the distribution of X and Y conditional on T and D, respectively. Suppose X | T, D and Y | T, D have the same distribution. For a pair of multivariate independent observations (X1, Y1, T1, D1) and (X2, Y2, T2, D2), we then have (X1, T1, D1), (X2, T2, D2), (Y1, T1, D1) and (Y2, T2, D2) follow the same distributions. Due to independence, the distribution of (X1, T1, D1, X2, T2, D2) and (Y1, T1, D1, Y2, T2, D2) are the same. Therefore we have (11).

Remark 1

Propositions 1 and 2 provide theoretical insight for proper understanding of the simulation results of Liu et al. [6] and Kang et al. [7]. In both of their simulation studies, independent censoring was assumed. Moreover, (X, Y) | T were simulated from bivariate normal distribution with mean μ and covariance matrix Σ, where

μ=log(T)×(1,1),=(1ρρ1).

Obviously, the conditional distribution X | T, D and Y | T, D are both normal with mean log(T) and variance 1, and thus the condition in Proposition 2 is satisfied. Under the null hypothesis, equation (8) holds by equation (9) and bayes theorem. Therefore, it is not surprising that both simulation studies showed negligible bias for ΔĈ in Liu et al. [6] and Kang et al. [7] because the if and onliy if condition in Proposition 1 is satisfied.

Remark 2

Equation (8) does not necessarily hold in general as shown in the two counterexamples in which X|T and Y | T do not have the same distribution under the null hypothesis. Specifically, in the change point model (6), it is easy to show that

pr(t|X,Y)=exp {exp(λ0t exp(β1X+β2(t)Y))}, (12)

Since the marginal distribution of X and Y are both U[0, 1] distribution, we have

pr(t|X)=01pr(t|X,y)dy; pr(t|Y)=01pr(t|x,Y)dx. (13)

It is clear that pr(t | X) does not equal to pr(t | Y) from equation (12). By Bayes theorem and independent censoring, X | T, D and Y | T, D do not have the same distribution.

In the threshold model, from equation (7),

X|T~I(T>τ)𝒩(u0 log(T),σ12)+I(T<τ)𝒩(u0 log(T),σ22),
Y|T~I(T>τ)𝒩(u0 log(T),σ22)+I(T<τ)𝒩(u0 log(T),σ12).

Given σ12>σ22, X | T, D and Y | T, D do not have the same distribution. Moreover, we have shown that BXTBYT for both of the two examples via simulation studies in the previous subsection.

3. Discussion

Prediction modeling is a mainstay of statistical practice. It is of great current interest due to the promise of highly predictive biomarkers recently identified through imaging and genetic/genomic technologies. Recently, based on simulation results, Kang et al. [7] proposed a promising nonparametric test procedure based on the difference of two Harrell’s C statistics and further developed an R package compareC to implement their approach. However, in general, the difference of two correlated Harrell’s C statistics may not converges to zero under the null hypothesis, thus may lead to severely inflated type I error for the proposed test. Thus, the validity of this approach is not without question in general as indicated by the counterexamples we provided in this paper. We further investigate a necessary and sufficient condition under which the difference of two Harrell’s C statistics converges to zero. This result provides theoretical insight for the understanding of the simulation results of Liu et al. [6] and Kang et al. [7].

As a cautionary note, we would like to emphasize that there is no currently available practical strategy to verify the necessary and sufficient conditions we investigated in this short paper. Thus, given the counterexamples presented in this paper, the use of the test proposed in Kang et al. [7] is not justifiable before new practicable ways are identified to verify such conditions. An alternative approach to hypothesis testing is the use of confidence intervals for the difference of two correlated Harrell’s C indices based on ΔĈ. Because Harrell’s C statistic has an explicit formula and is very easy to compute, it is straightforward to calculate the standard error se(ΔĈ) and confidence intervals using bootstrap as in Uno et al. [5] and Liu et al. [6] for the difference of the modified C statistics Δ. However, both the standard error se(ΔĈ) and the corresponding confidence interval depend on the unknown censoring distributions that are frequently different for different studies. Thus, despite easy implementation using available statistical packages and algorithms, practical applications of such confidence intervals for the difference of Harrell’s C indices should be avoided or accompanied with warnings because of the unclear validity/reliability and difficult interpretation of such confidence intervals.

Acknowledgments

This work was supported in part by NIH grants P30 AG0851, 2P30 CA16087 and 5P30 ES00260.

References

  • 1.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. 1982;143(1):29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
  • 2.Gönen M, Heller G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika. 2005;92(4):965–970. [Google Scholar]
  • 3.Pepe MS, Kerr KF, Longton G, Wang Z. Testing for improvement in prediction model performance. Statistics in medicine. 2013;32(9):1467–1482. doi: 10.1002/sim.5727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988:837–845. [PubMed] [Google Scholar]
  • 5.Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei L. On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in medicine. 2011;30(10):1105–1117. doi: 10.1002/sim.4154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Liu X, Jin Z, Graziano JH. Comparing paired biomarkers in predicting quantitative health outcome subject to random censoring. Statistical methods in medical research. 2012 doi: 10.1177/0962280212460434. 0962280212460 434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kang L, Chen W, Petrick NA, Gallas BD. Comparing two correlated c indices with right-censored survival outcome: a one-shot nonparametric approach. Statistics in medicine. 2015;34(4):685–703. doi: 10.1002/sim.6370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wieand S, Gail MH, James BR, James KL. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika. 1989;76(3):585–592. [Google Scholar]
  • 9.Etzioni R, Pepe M, Longton G, Hu C, Goodman G. Incorporating the time dimension in receiver operating characteristic curves: a case study of prostate cancer. Medical Decision Making. 1999;19(3):242–251. doi: 10.1177/0272989X9901900303. [DOI] [PubMed] [Google Scholar]
  • 10.Hock C, Golombowski S, Naser W, Müller-Spahn F. Increased levels of! t protein in cerebrospinal fluid of patients with alzheimer’s disease: Correlation with degree of cognitive impairment. Annals of neurology. 1995 doi: 10.1002/ana.410370325. [DOI] [PubMed] [Google Scholar]
  • 11.Sunderland T, Mirza N, Putnam KT, Linker G, Bhupali D, Durham R, Soares H, Kimmel L, Friedman D, Bergeson J, et al. Cerebrospinal fluid β-amyloid 1–42 and tau in control subjects at risk for alzheimers disease: The effect of apoe ε4 allele. Biological psychiatry. 2004;56(9):670–676. doi: 10.1016/j.biopsych.2004.07.021. [DOI] [PubMed] [Google Scholar]
  • 12.Kester MI, Goos JD, Teunissen CE, Benedictus MR, Bouwman FH, Wattjes MP, Barkhof F, Scheltens P, van der Flier WM. Associations between cerebral small-vessel disease and alzheimer disease pathology as measured by cerebrospinal fluid biomarkers. JAMA neurology. 2014;71(7):855–862. doi: 10.1001/jamaneurol.2014.754. [DOI] [PubMed] [Google Scholar]
  • 13.Chiu MJ, Yang SY, Horng HE, Yang CC, Chen TF, Chieh JJ, Chen HH, Chen TC, Ho CS, Chang SF, et al. Combined plasma biomarkers for diagnosing mild cognition impairment and alzheimers disease. ACS chemical neuroscience. 2013;4(12):1530–1536. doi: 10.1021/cn400129p. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zetterberg H, Wilson D, Andreasson U, Minthon L, Blennow K, Randall J, Hansson O. Plasma tau levels in alzheimer’s disease. Alzheimer’s research & therapy. 2013;5(2):1. doi: 10.1186/alzrt163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Chiu MJ, Chen YF, Chen TF, Yang SY, Yang FPG, Tseng TW, Chieh JJ, Chen JCR, Tzen KY, Hua MS, et al. Plasma tau as a window to the brainnegative associations with brain volume and memory function in mild cognitive impairment and early alzheimer’s disease. Human brain mapping. 2014;35(7):3132–3142. doi: 10.1002/hbm.22390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Amorim JA, de Barros MVG, Valença MM. Post-dural (post-lumbar) puncture headache: risk factors and clinical features. Cephalalgia. 2012 doi: 10.1177/0333102412453951. 0333102412453 951. [DOI] [PubMed] [Google Scholar]
  • 17.Van Crevel H, Hijdra A, De Gans J. Lumbar puncture and the risk of herniation: when should we first perform ct? Journal of neurology. 2002;249(2):129–137. doi: 10.1007/pl00007855. [DOI] [PubMed] [Google Scholar]
  • 18.Olsson B, Lautner R, Andreasson U, Öhrfelt A, Portelius E, Bjerke M, Hölttä M, Rosén C, Olsson C, Strobel G, et al. Csf and blood biomarkers for the diagnosis of alzheimer’s disease: a systematic review and meta-analysis. The Lancet Neurology. 2016;15(7):673–684. doi: 10.1016/S1474-4422(16)00070-3. [DOI] [PubMed] [Google Scholar]
  • 19.Harrell FE, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Statistics in medicine. 1984;3(2):143–152. doi: 10.1002/sim.4780030207. [DOI] [PubMed] [Google Scholar]
  • 20.Pencina MJ, D’Agostino RB. Overall c as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Statistics in medicine. 2004;23(13):2109–2123. doi: 10.1002/sim.1802. [DOI] [PubMed] [Google Scholar]
  • 21.Antolini L, Nam BH, D’Agostino RB. Inference on correlated discrimination measures in survival analysis: a nonparametric approach. Communications in Statistics-Theory and Methods. 2004;33(9):2117–2135. [Google Scholar]
  • 22.Tsiatis A. Semiparametric theory and missing data. Springer Science & Business Media; 2007. [Google Scholar]
  • 23.Liang KY, Self SG, Liu X. The cox proportional hazards model with change point: An epidemiologic application. Biometrics. 1990:783–793. [PubMed] [Google Scholar]

RESOURCES