Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Aug 15.
Published in final edited form as: Stat Med. 2021 Apr 29;40(18):4035–4052. doi: 10.1002/sim.9012

Biomarker Evaluation Under Imperfect Nested Case–control Design

Xuan Wang 1, Yingye Zheng 2, Majken Karoline Jensen 3, Zeling He 1, Tianxi Cai 1,4
PMCID: PMC8286316  NIHMSID: NIHMS1705660  PMID: 33915597

Summary

The nested case–control (NCC) design has been widely adopted as a cost-effective sampling design for biomarker research. Under the NCC design, markers are only measured for the NCC subcohort consisting of all cases and a fraction of the controls selected randomly from the matched risk sets of the cases. Robust methods for evaluating prediction performance of risk models have been derived under the inverse probability weighting (IPW) framework. The probabilities of samples being included in the NCC cohort can be calculated based on the study design1 or estimated non-parametrically2. Neither strategy works well due to model mis-specification and the curse of dimensionality in practical settings where the sampling does not entirely follow the study design or depends on many factors. In this paper, we propose an alternative strategy to estimate the sampling probabilities based on a varying coefficient model, which attains a balance between robustness and the curse of dimensionality. The complex correlation structure induced by repeated finite risk set sampling makes the standard resampling procedure for variance estimation fail. We propose a perturbation resampling procedure that provides valid interval estimation for the proposed estimators. Simulation studies show that the proposed method performs well in finite samples. We apply the proposed method to the Nurses’ Health Study II to develop and evaluate prediction models using clinical biomarkers for cardiovascular risk.

Keywords: finite population sampling, inverse probability weighting, nonparametric smoothing, resampling, risk prediction

1 ∣. INTRODUCTION

Conducting rigorous biomarker validation studies is an important step in the translation of novel biomarkers into routine clinical practice for medical decision making. Such studies should follow design principles in sample selection to avoid bias3. Large prospective studies, such as the Women’s Health Initiative Study and the Nurses’ Health Study, with exposures captured and biologic samples collected and stored prior to disease onset, can serve as a platform for biomarker research4,5. However, measuring biomarkers for large prospective cohorts is highly resource consuming. To make efficient use of stored samples from a cohort, two-phase sampling designs, including nested case–control (NCC) and case–cohort (CCH) studies, are often adopted as resource-efficient sampling strategies, especially when the outcome of interest is rare6,7,8.

Under the NCC design, new markers are measured for all cases and a subset of controls randomly sampled without replacement from the risk sets of the cases. Sometimes controls are also matched to cases on variables such as gender and age. Many well-known biomarker studies nested in large cohorts have employed the NCC design9,10,11, e.g.. For example, in the Nurses’ Health Study II (NHSII), novel biomarkers, apoA1 concentration in whole plasma (WPA1) and concentration of apoE in whole plasma (apoE), were investigated for predicting the risk of Myocardial Infraction (MI)11. Due to limited resources and low incidence of MI, the biomarkers were measured on a nested case–control set, which included all cases and controls sampled from the 1:1 matched risk set of the cases with matching variables including smoking status, fasting status, age, and timing of blood collection.

To analyze NCC data, conditional logistic regression (CLR) model has traditionally been used when the focus is on the estimation of hazard ratio (HR) parameters, and sometimes the estimation of absolute risk parameters12,13. The CLR provides HR estimators under the Cox model from the full cohort, however cannot be extended to other models. Nor can the methods be used for estimating other parameters, such as the prediction accuracy parameters, which involve the distribution of the markers in the full cohort. For model parameter estimation, fully efficient maximum likelihood estimators (MLE) have also been proposed14,15. The MLE relies on the correct specification of the failure time model and requires that censoring is independent of the novel markers as well as additional modeling assumptions when there are multiple novel biomarkers and routine clinical variables measured on the full cohort.

As a flexible alternative, the inverse probability weighting (IPW) approach has been developed16. Recently, IPW estimators have also been developed for fitting models beyond the Cox model as well as for prediction performance measures including the receiver operating characteristic (ROC) curve, positive predictive value (PPV) and negative predictive value (NPV)17,18,19. Most existing IPW estimators for NCC studies calculates the true IPW (TIPW) sampling weights according to the study design and are consistent provided that the sampling weights are correctly obtained. However, the TIPW estimators may be invalid if the sampling is not implemented exactly according to the design. Such a scenario arises, for example, when the matching criteria are implemented only coarsely during the implementation of the sampling scheme due to practical concerns. Such an imperfect NCC design poses additional analytical challenges for estimating and evaluating risk prediction models. To overcome the bias, one may estimate the sampling weights non-parametrically via kernel smoothing as in Zheng et al.2. Obtaining such a non-parametric augmented IPW (NP) estimator, however, is not feasible when the number of matching variables is not very small due to the curse of dimensionality.

In this paper, we propose a new semi-non-parametric AIPW estimator, where the selection probability is estimated based on a flexible varying coefficient model. The AIPW estimator can incorporate a larger number of matching variables while remaining robust to the deviation from the intended sampling scheme. We derive the the asymptotic properties for the proposed estimators and come up with a resampling method to assess the variability of our proposed AIPW estimators.

The remainder of the paper is organized as follows. In Section 2, we provide model specification and describe the proposed point estimation procedures. Our proposed interval estimation procedure is given in Section 3. In Section 4, we report results of simulation studies to assess the finite sample performance of the proposed method. In Section 5, the data from NHS II is analyzed as illustration. Concluding remarks are given in Section 6. Theoretical studies of the proposed estimators are provided in the Appendix.

2 ∣. ESTIMATING SAMPLING WEIGHTS VIA AIPW

Let T denote the survival outcome of interest and Y=(YoldT,YnewT)T denote the vector of predictors for T, where Yold denotes the vector of routine markers and Ynew denotes the vector of novel biomarkers. Due to censoring, T is only observed up to a bivariate vector X = T Λ C and δ = I(TC), where C is the censoring time. Under the NCC design, Ynew is only measured if V = 1, where V = δ + (1 − δ)V0 and V0 is a binary indicator for whether a subject is sampled into the NCC subcohort as a control. We assume that the sampling of the controls is performed by matching to the cases according to a vector of matching variables M. Suppose that the underling data for the full cohort consists of N independent and identically distributed random vectors, D={Di=(Xi,δi,YiT,MiT)T,i=1,,N}, while the observed data consist of O={Oi=(Xi,δi,Yold,iT,ViYnew,iT,MiT)T,i=1,,N}. Let Ω = {i : 1 ≤ iN} and Ωncc = {i : 1 ≤ iN, Vi = 1} respectively denote the index sets for the full cohort and NCC subcohort.

Under the matched NCC design, for a case with event time Xi and matching variables Mi, m controls are sampled from the matched risk set

RWi={k:XkXi,MkMia0},

where a0 is a predetermined range vector and Wi=(δi,Xi,MiT,Yold,i)T. Let π¯i=P(Vi=1O) and π¯0i=P(Vi=1O,δi=0) denote the true sampling probabilities under possibly imperfect NCC sampling. If the NCC sampling were implemented exactly accordingly to design, then π¯i=δi+(1δi)π¯0i can be calculated as π~i=δi+(1δi)π~0i, where π~0i=π~0(Wi),

π~0(Wi)=1j:jRWi{1mδjRWj1}

and RWi is the size of RWi16. Under the perfect NCC design, TIPW estimators can be constructed by weighting observations with the true weights ω~i=Viπ~i. To improve efficiency and robustness over the TIPW estimators, Zheng et al.2 proposed NP estimators using non-parametrically estimated weights ω^iNP=Viπ^NP(Wi), where π^NP(Wi)=δi+(1δi)π^0NP(Wi),

π^0NP(w)=i=1N(1δi)ViKb(Wiw)i=1N(1δi)Kb(Wiw)

is a non-parametric estimate of π0(w) = P(Vi = 1 ∣ Wi = w, δi = 0), Kb(w)=bqj=1qK(wjb), K(·) is a symmetric density function, and b > 0 denotes the bandwidth.

While the NP method can be used to incorporate imperfect NCC designs, it is infeasible when the dimension of W is not small. To overcome the limitations of TIPW and NP methods, we propose a semi-non-parametric AIPW method by approximating π¯0i via a flexible varying coefficient model

π0i=g{β(π~0i,Xi)TZi}withg(x)=ex1+ex (2.1)

where Zi = (1, Φ1(Yold, i), Φ2(Mi)), Φ1(·) and Φ2(·) are basis functions that allow potential non-linear effects, and β(π, x) is the unknown coefficient function. In practice, we find that the commonly used b-spline or natural splines basis with degree of freedom 3 works well. Equally spaced knots that covers most of the domain of the data are also desirable. We find that our results are not overly sensitive to the choice of the basis functions provided that they are reasonably flexible to capture non-linear effects. Under perfect NCC sampling, β(π~0i,Xi)=(g1(π~0i),0T)T. On the other hand, when the sampling is imperfect, the flexible model provides accurate approximation to the true sampling probabilities while overcoming the curse of dimensionality associated with NP procedures.

To estimate β(π, x), we maximize a local logistic log-likelihood using observed data on {(Vi,Zi,Xi,π~0i):δi=0}. Specifically, for any given (π, x), we estimate β(π, x) as β^(π,x), the solution to the estimating equation

U^π,x(β)=N1i=1nKb(π~0iπ,Xix)(1δi)Zi{Vig(βTZi)}

where Kb(·) = (b1b2)−1K(π/b1)K(x/b2), K(·) is a symmetric density function function, b = (b1, b2) is the bandwidth parameters vector which tend to 0 as N → ∞. With β^(π,x), we estimate the sampling probability for the ith subject as

π^i=δi+(1δi)π^0i,whereπ^0i=g{β^(π~0i,Xi)TZi}. (2.2)

Then we construct our AIPW estimator using the augmented weights ω^i=Viπ^i. Under the correct specification of (2.1), we expect that max1iNπ^iπ¯i0 as N → ∞.

3 ∣. APPLICATION OF AIPW TO ROBUST RISK PREDICTION

In this section, we illustrate the application of the AIPW approach to developing and evaluating risk prediction models. Since one of the major goals of biomarker studies is to evaluate the predictive capacity of novel biomarkers, we consider quantifying the incremental value of Ynew in predicting T above and beyond routine markers Yold.

3.1 ∣. Calibrated Risk Estimate

To predict risks based on Y=(YoldT,YnewT)T and Yold, we fit two proportional hazards (PH) models,

P(TtY)=Sall(t)exp(γallTY), (2.3)
P(TtYold)=Sold(t)exp(γoldTYold), (2.4)

where Sall(·) and Sold(·) are unknown baseline survival functions and γall and γold are the corresponding log hazard ratio parameters. To estimate γall and γold, we note that Ynew is only available for those in the NCC subcohort while Yold is observed for all subjects. Thus, we propose to estimate γall by maximizing weighted log partial likelihood with AIPW weights ω^i:

γ^all=argmaxγi=1Nω^iδi[γTYilog{j=1Nω^jI(XjXi)exp(γTYj)}].

On the other hand, γold can be estimated as the standard maximum partial likelihood estimator, denoted by γ^old. It follows from Lin and Wei20 and the consistency of the sampling probabilities that γ^all and γ^old respectively converge to deterministic vectors γ¯all and γ¯old as N → ∞, regardless of the adequacy of the survival models (2.3) and (2.4). When models (2.3) and (2.4) hold, then γ¯all=γall and γ¯old=γold.

To make a prediction for t-year survival, one may obtain a model-based estimate for P(TtY) and P(TtYold) under (2.3) and (2.4). However, such a risk estimate may not be accurate under model mis-specifications. Following the calibrated risk prediction strategies proposed in Cai et al.21, we predict t-year survival risk given Y and Yold based on

Sall(tRall)P(T>tRall)andSold(tRold)P(T>tRold),

respectively, where Rall=YTγ¯all and Rold=YoldTγ¯old are the limiting risk scores. The calibrated survival risk functions Sall(tr) and Sold(tr) can be non-parametrically estimated as S^all(tr)=exp{Λ^all(tr)} and S^old(tr)=exp{Λ^old(tr)}, where

Λ^all(tr)=0tiω^iKh(R^all,ir)dNi(u)iω^iKh(R^all,ir)I(Xiu),Λ^old(tr)=0tiKh(R^old,ir)dNi(u)iKh(R^old,ir)I(Xiu),

R^all,i=γ^allTYi, R^old,i=γ^oldTYold,i and Ni(t) = I(Xit)δi. The above calibrated risk prediction procedure essentially fits risk models (2.3) and (2.4) to summarize multi-variate risk markers into univariate risk scores, Rall and Rold, and then non-parametrically estimates the t-year risk given the risk score.

3.2 ∣. Evaluating Prediction Performance

The accuracy of the risk prediction based on a given risk score R can be summarized by commonly used time dependent accuracy measures including the true positive rate (TPR), false positive rate (FPR), the receiver operating characteristic (ROC) curve, the positive predictive value (PPV), and the negative predictive value (NPV). These prediction performance measures typically consider the accuracy of a binary classification rule Rr in predicting the t-year survival status Dt = I(Tt). Specifically, the TPR and FPR of Rr in prediction Dt are respectively defined as

TPR(rt)=P(RrT<t),andFPR(rt)=P(RrTt).

The ROC curve, ROC(ut) = TPR{FPR−1(ut)∣t}, summarizes the trade-off between the FPR and TPR as the cut-off value varies. The PPV and NPV of Rr are defined as

PPV(tr)=P(T<tRr),andNPV(tr)=P(TtR<r).

To estimate the prediction accuracy for Rall and Rold, we note that all the aforementioned parameters are functionals of Sall(tr), Sold(tr), Fall(r)=P(Rallr) and Fold(r)=P(Roldr). For example, the TPR and FPR of Rallr can be respectively written as

TPRall(rt)=1Fall(r)rSall(tu)dFall(u)1Sall(tu)dFall(u),andFPRall(rt)=rSall(tu)dFall(u)Sall(tu)dFall(u).

The trade-off between TPRall(rt) and FPRall(rt) can be summarized based on the receiver operating characteristic (ROC) curve ROCall(ut)=TPRall{FPRall1(ut)t)}, where u is any specified FPR level of interest. The marginal distribution functions Fall(r) and Fold(r) can be respectively estimated as

F^all(r)=i=1Nω^iI(R^all,ir)i=1Nω^i,andF^old(r)=N1i=1NI(R^old,ir).

Subsequently, we may construct plug-in estimators for TPRall(rt) and FPRall(rt) as

TPR^all(rt)=1F^all(r)rS^all(tu)dF^all(u)1S^all(tu)dF^all(u)andFPR^all(rt)=rS^all(tu)dF^all(u)S^all(tu)dF^all(u),

respectively. Similar plug-in estimators can be constructed for other accuracy parameters. We may quantify the incremental value (IncV) of Ynew based on the difference between the accuracy of Rall and Rold. For example, the IncV of Ynew with respect to the ROC curve at FPR level of u0 can be estimated as ROC^all(u0t)ROC^old(u0t), where ROC^all(u0t)=TPR^all{FPR^all1(u0t)t} and ROC^old is the estimated ROC curve for Rold.

3.3 ∣. Resampling Based Interval Estimation

To estimate the asymptotic variance of the proposed AIPW estimators, we propose a perturbation resampling procedure. Specifically, let I = (I1, …, IN) be a vector of independent and identically distributed non-negative random variables with mean 1 and variance 1. We first obtain perturbed counterpart of β^(π,x) as β^(π,x), the solution to the estimating equation

U^π,x(β)=N1i=1nKb(π~0iπ,Xix)(1δi)Zi{Vig(βTZi)}Ii.

Then we perturb the AIPW weights as

ω^i={δi+(1δi)V0iπ^0i}Iiwithπ^0i=g{(β^(π~0i,Xi)TZi}.

Subsequently, we perturb all AIPW estimators by replacing ω^i with ω^i. Specifically, we perturb γ^all as

γ^all=argmaxγi=1Nω^iδi[γTYilog{j=1Nω^jI(XjXi)exp(γTYj)}],

and perturb S^all(tr) as S^all(tr)=exp{Λ^all(tr)}, where

Λ^all(tr)=0tiω^iKh(R^all,ir)dNi(u)iω^iKh(R^all,ir)I(Xiu),andR^all,i=Yall,iTγ^all.

The accuracy parameters can be perturbed similarly. For example, we may obtain

TPR^all(rt)=1F^all(r)rS^all(tu)dF^all(u)1S^all(tu)dF^all(u),

where F^all(r)=i=1Nω^iI(R^all,ir)(i=1Nω^i).

For IncV parameters, the estimation of model parameters related to the reduced model only involve full cohort data and thus the perturbation will only involve weighting observations by {Ii}. Specifically, γ^old is perturbed as

γ^old=argmaxγi=1NIiδi[γTYold,ilog{j=1NIjI(XjXi)exp(γTYold,j)}],

and S^old(tr)=exp{Λ^old(tr)}, where

Λ^old(tr)=0tiIiKh(R^old,ir)dNi(u)iIiKh(R^old,ir)I(Xiu),andR^old,i=Yold,iTγ^old.

Similar strategies can be used for accuracy parameters such as TPR^old(ct) and FPR^old(ct).

To obtain variance estimators and construct confidence intervals, we may obtain a large number, say B, of realizations of I. For each realization of I, we obtain the above perturbed estimates. The empirical distribution of the B sets of perturbed estimates can be used for inference. For example, the empirical variance of ROC^all(u0t)ROC^old(u0t) can be used to approximate the variance of ROC^all(u0t)ROC^old(u0t).

4 ∣. NUMERICAL STUDIES

We performed extensive simulations to evaluate the finite sample performance of the proposed estimators and to compare with other estimators under NCC design when the design is carried out perfectly or imperfectly. We generate Y = (Yold, Ynew) from a bivariate normal distribution with zero mean, unit variance and correlation 0.5. Given Y, we generate T from a PH model

P(TtY)=exp[exp{log(0.01t)+log(2)Ynew+log(3)Yold}].

The censoring time was generated from two settings: (I) C ~ CIND = min(Ca, Cb), where Ca ~ Uniform(0.5, 2) and Cb ~ Gamma(shape = 2, rate = 2); (II) CCDEP=min{Ca,Cb(Y)}, where Cb(Y)=exp{(Ynew+Yold)5}+0.5. This leads to covariate independent censoring in (I) and covariate dependent censoring in (II). The censoring rate and event rate (proportion of cases) are around 15% and 5%, respectively. We let N = 2000, and selected the NCC cohort by including all the cases and m = 1 control per case. Under each configuration, results were summarized based on 500 simulated datasets. We obtain estimators for γ¯all=(γ¯1,γ¯2) in model (2.3) and TPRu0, PPVu0, NPVu0 at FPR= u0, with u0 taken to be 0.05, 0.1, 0.2. We also compared the proposed approach with existing methods including the TIPW estimator of Cai and Zheng19, NP estimators of Zheng et al.2 and conditional logistic regression method based estimator, denoted as CLR.

We considered three settings. In the first setting, setting (1), the matching covariate vector M = (M1, M2) with matching window a0 = (0, 0), where M1=l=12I(Yoldyql), yq was the 100qth percentiles of Yold and q1 = 0.33, q2 = 0.66, M2 ~ Bernouli(0.5); In setting (2), matching variable M = (M1, …, M5) with matching window a0 = (0, 2, 2, 5, 0), where M1 is the same as in setting (1), M20.3eN, M35ϕ(Yold+N), M4 ~ ⌊Uniform(0, 10)⌉, and M5 ~ Bernouli(0.5), ϕ is a normal density function, and NN(0,1); In setting (3), matching variable M = (M1, M2, M3, M4) with a varying window in that we intend to match with window a0 = (0, 0, 0, 0) but when the number of subjects is not sufficient in the risk set for some cases, we relax the criterion to matching window a = (0, 0, 2, 2) to select controls in the new risk set. Here M1 is the same as in setting (1), M2=I(Yold+N>0) with NN(0,1), M35ϕ(Yold+N) and M40.2eN.

Results summarizing the performance of the proposed point and interval estimators across settings (1) - (3) are presented in Table 1-3. The point estimators have negligible biases. The average of the standard errors (ASEs) are close to the corresponding empirical standard errors (SEs), and the empirical coverage probabilities (CP) of the 95% confidence intervals are close to the nominal level. These results confirm the validity of the proposed estimation procedures in finite sample.

TABLE 1.

The Bias, empirical standard error (SE) and relative efficiency (RE) of the TIPW estimator, the proposed AIPW estimator, the nonparametric method based estimator (NP) and the CLR based estimator (CLR). For the proposed AIPW estimator, we also calculated the average of the estimated standard error (ASE), empirical coverage probabilities (CP) of the 95% CIs (×100) for settings (1).

Independent censoring (I)
Bias SE RE AIPW
true TIPW AIPW NP CLR TIPW AIPW NP CLR TIPW AIPW NP CLR ASE CP
γ¯1 0.693 0.010 0.022 0.005 0.146 0.096 0.096 0.096 0.166 1.000 0.955 1.020 0.191 0.090 93.0
γ¯2 1.099 −0.005 0.014 −0.022 0.344 0.100 0.095 0.092 0.340 1.000 1.086 1.116 0.043 0.090 93.6
TPR 0.460 −0.011 −0.010 −0.025 0.050 0.044 0.046 1.000 1.296 0.942 0.044 94.2
PPV 0.543 −0.012 −0.005 0.002 0.035 0.032 0.034 1.000 1.285 1.132 0.032 94.4
NPV 0.932 −0.001 −0.002 −0.008 0.011 0.007 0.009 1.000 2.220 0.863 0.008 96.8
TPR 0.596 −0.010 −0.009 −0.027 0.051 0.041 0.044 1.000 1.492 1.010 0.042 94.8
PPV 0.435 −0.009 −0.003 0.005 0.030 0.027 0.030 1.000 1.325 1.056 0.026 94.0
NPV 0.945 −0.001 −0.001 −0.008 0.011 0.007 0.008 1.000 2.586 1.007 0.007 97.0
TPR 0.748 −0.006 −0.005 −0.022 0.046 0.035 0.038 1.000 1.747 1.163 0.036 96.2
PPV 0.326 −0.005 −0.000 0.009 0.024 0.021 0.023 1.000 1.429 0.972 0.021 95.2
NPV 0.961 −0.001 −0.001 −0.006 0.011 0.006 0.007 1.000 2.986 1.229 0.007 97.2
Dependent censoring (II)
Bias SE RE AIPW
true TIPW AIPW NP CLR TIPW AIPW NP CLR TIPW AIPW NP CLR ASE CP
γ¯1 0.693 0.018 0.044 0.007 0.086 0.116 0.111 0.115 0.169 1.000 0.952 1.027 0.382 0.106 91.9
γ¯2 1.099 −0.004 −0.014 −0.019 0.210 0.122 0.105 0.101 0.315 1.000 1.319 1.398 0.104 0.101 93.3
TPR 0.460 −0.010 −0.005 −0.025 0.052 0.046 0.050 1.000 1.303 0.891 0.047 96.0
PPV 0.543 −0.010 −0.002 0.005 0.039 0.033 0.037 1.000 1.483 1.150 0.033 94.8
NPV 0.932 −0.001 −0.002 −0.009 0.010 0.008 0.009 1.000 1.623 0.575 0.008 96.8
TPR 0.596 −0.008 −0.004 −0.026 0.048 0.042 0.044 1.000 1.308 0.886 0.044 94.6
PPV 0.435 −0.006 −0.000 0.008 0.032 0.026 0.029 1.000 1.533 1.137 0.027 95.0
NPV 0.945 −0.001 −0.001 −0.008 0.009 0.007 0.009 1.000 1.578 0.606 0.008 96.0
TPR 0.748 −0.008 −0.004 −0.025 0.042 0.037 0.040 1.000 1.310 0.833 0.038 95.0
PPV 0.326 −0.005 0.000 0.010 0.025 0.021 0.024 1.000 1.560 0.987 0.021 95.6
NPV 0.961 −0.001 −0.001 −0.007 0.008 0.007 0.008 1.000 1.580 0.602 0.007 96.6

TABLE 3.

The Bias, empirical standard error (SE) and relative efficiency (RE) of the TIPW estimator, the proposed AIPW estimator, the nonparametric method based estimator (NP) and the CLR based estimator (CLR). For the proposed AIPW estimator, we also calculated the average of the estimated standard error (ASE), empirical coverage probabilities (CP) of the 95% CIs (×100) for settings (3).

Independent censoring (I)
Bias SE RE AIPW
true TIPW AIPW NP CLR TIPW AIPW NP CLR TIPW AIPW NP CLR ASE CP
γ¯1 0.693 0.028 0.024 0.007 0.152 0.129 0.101 0.099 0.169 1.000 1.614 1.758 0.335 0.105 94.6
γ¯2 1.099 −0.086 −0.002 −0.063 0.335 0.128 0.094 0.092 0.345 1.000 2.716 1.908 0.103 0.103 95.6
TPR 0.460 −0.039 −0.009 −0.036 0.053 0.048 0.050 1.000 1.777 1.130 0.050 94.0
PPV 0.543 0.015 −0.002 0.020 0.040 0.033 0.036 1.000 1.671 1.056 0.037 96.6
NPV 0.932 −0.017 −0.003 −0.017 0.014 0.008 0.011 1.000 6.487 1.131 0.010 98.0
TPR 0.596 −0.043 −0.006 −0.038 0.058 0.048 0.051 1.000 2.266 1.293 0.048 92.8
PPV 0.435 0.020 0.001 0.025 0.036 0.027 0.032 1.000 2.251 1.040 0.031 96.6
NPV 0.945 −0.015 −0.001 −0.015 0.014 0.008 0.011 1.000 6.411 1.182 0.009 96.2
TPR 0.748 −0.045 −0.001 −0.035 0.057 0.040 0.045 1.000 3.244 1.621 0.042 94.6
PPV 0.326 0.022 0.003 0.027 0.033 0.021 0.025 1.000 3.568 1.099 0.025 96.8
NPV 0.961 −0.014 −0.001 −0.013 0.013 0.007 0.010 1.000 6.815 1.307 0.008 96.2
Dependent censoring (II)
Bias SE RE AIPW
true TIPW AIPW NP CLR TIPW AIPW NP CLR TIPW AIPW NP CLR ASE CP
γ¯1 0.693 0.019 0.032 0.003 0.074 0.138 0.115 0.116 0.159 1.000 1.372 1.443 0.629 0.109 92.1
γ¯2 1.099 −0.082 −0.016 −0.061 0.213 0.129 0.106 0.104 0.354 1.000 2.032 1.603 0.136 0.102 93.9
TPR 0.460 −0.032 −0.001 −0.034 0.063 0.052 0.054 1.000 1.816 1.206 0.050 91.9
PPV 0.543 0.022 0.006 0.027 0.045 0.034 0.040 1.000 2.154 1.116 0.036 94.1
NPV 0.932 −0.017 −0.003 −0.019 0.014 0.009 0.012 1.000 5.339 1.012 0.010 97.0
TPR 0.596 −0.036 0.001 −0.035 0.061 0.049 0.049 1.000 2.134 1.427 0.047 93.1
PPV 0.435 0.026 0.008 0.032 0.039 0.028 0.033 1.000 2.692 1.042 0.030 94.5
NPV 0.945 −0.015 −0.002 −0.016 0.013 0.009 0.011 1.000 5.235 1.097 0.009 96.3
TPR 0.748 −0.038 0.003 −0.035 0.053 0.042 0.043 1.000 2.340 1.379 0.040 94.3
PPV 0.326 0.026 0.008 0.033 0.034 0.021 0.027 1.000 3.593 1.009 0.024 96.1
NPV 0.961 −0.014 −0.001 −0.014 0.012 0.008 0.010 1.000 4.634 1.091 0.008 97.6

In setting (1), sampling is correctly carried out and M is low dimensional, and hence all three methods (TIPW, AIPW, NP) are valid. As shown in Table 1, all three estimators have negligible biases, TIPW and NP have comparable efficiency with respect to mean squared error (MSE), and AIPW is a little more efficient than the TIPW and NP estimators. In setting (2), the sampling is carried out correctly but the matching variable is of a higher dimension, which leads to curse of dimensionality for the NP method while the TIPW remains valid. As shown in Table 2, the TIPW and AIPW both have negligible biases, while the NP exhibits higher biases. Setting (3) is a commonly encountered imperfect NCC sampling setting that is similar to the motivating example of the NHS II study. In this case, the TIPW estimator is biased as expected. There is also bias observed for the NP estimators due to the curse of dimensionality, whereas the AIPW estimator still maintains negligible bias. In addition, the AIPW estimator is substantially more efficient than both the TIPW and NP estimators with respect to MSE, with relative efficiency as high as 6 compared to the TIPW estimator and 5 compared to the NP estimator. In all the settings considered, the CLR estimator is either more biased or less efficient compared to other estimators.

TABLE 2.

The Bias, empirical standard error (SE) and relative efficiency (RE) of the TIPW estimator, the proposed AIPW estimator, the nonparametric method based estimator (NP) and the CLR based estimator (CLR). For the proposed AIPW estimator, we also calculated the average of the estimated standard error (ASE), empirical coverage probabilities (CP) of the 95% CIs (×100) for settings (2).

Independent censoring (I)
Bias SE RE AIPW
true TIPW AIPW NP CLR TIPW AIPW NP CLR TIPW AIPW NP CLR ASE CP
γ¯1 0.693 0.014 0.020 −0.009 0.139 0.110 0.105 0.118 0.171 1.000 1.078 0.880 0.254 0.096 91.9
γ¯2 1.099 −0.018 0.013 −0.097 0.343 0.109 0.097 0.145 0.339 1.000 1.263 0.399 0.052 0.096 94.0
TPR 0.460 −0.015 −0.006 −0.052 0.049 0.043 0.057 1.000 1.383 0.446 0.046 95.8
PPV 0.543 −0.003 −0.001 0.054 0.034 0.030 0.041 1.000 1.309 0.261 0.033 96.8
NPV 0.932 −0.004 −0.002 −0.036 0.011 0.008 0.017 1.000 2.124 0.085 0.009 97.2
TPR 0.596 −0.016 −0.006 −0.064 0.048 0.043 0.055 1.000 1.391 0.360 0.043 95.6
PPV 0.435 −0.001 0.000 0.057 0.029 0.025 0.040 1.000 1.302 0.176 0.027 96.6
NPV 0.945 −0.004 −0.002 −0.033 0.010 0.007 0.016 1.000 2.101 0.086 0.008 97.4
TPR 0.748 −0.013 −0.004 −0.072 0.045 0.039 0.051 1.000 1.480 0.287 0.039 94.6
PPV 0.326 0.002 0.002 0.056 0.025 0.021 0.035 1.000 1.444 0.142 0.022 95.6
NPV 0.961 −0.003 −0.001 −0.031 0.009 0.007 0.015 1.000 2.091 0.085 0.008 97.8
Dependent censoring (II)
Bias SE RE AIPW
true TIPW AIPW NP CLR TIPW AIPW NP CLR TIPW AIPW NP CLR ASE CP
γ¯1 0.693 0.024 0.044 −0.001 0.088 0.121 0.115 0.131 0.172 1.000 1.008 0.892 0.408 0.106 91.7
γ¯2 1.099 −0.009 −0.008 −0.098 0.230 0.130 0.110 0.133 0.331 1.000 1.389 0.622 0.104 0.102 93.6
TPR 0.460 −0.010 −0.002 −0.049 0.056 0.050 0.058 1.000 1.302 0.565 0.048 92.8
PPV 0.543 0.001 0.004 0.064 0.040 0.035 0.046 1.000 1.311 0.263 0.034 94.1
NPV 0.932 −0.004 −0.003 −0.038 0.011 0.008 0.017 1.000 2.181 0.085 0.009 95.6
TPR 0.596 −0.012 −0.003 −0.067 0.053 0.045 0.056 1.000 1.426 0.385 0.044 94.9
PPV 0.435 0.002 0.005 0.066 0.034 0.029 0.040 1.000 1.335 0.190 0.028 92.6
NPV 0.945 −0.004 −0.002 −0.036 0.011 0.007 0.016 1.000 2.360 0.080 0.008 97.9
TPR 0.748 −0.011 −0.000 −0.072 0.045 0.036 0.047 1.000 1.631 0.289 0.038 95.3
PPV 0.326 0.004 0.006 0.065 0.027 0.022 0.036 1.000 1.506 0.137 0.022 95.1
NPV 0.961 −0.003 −0.001 −0.033 0.010 0.006 0.014 1.000 2.594 0.079 0.007 97.7

To examine whether our proposed method performs well under settings with a very low event rate, we also generated data under a slight variation of the above PH model with a substantially lower baseline hazard leading to about 0.5% of event rate and independent censoring. We sampled the NCC cohort under setting (3) and obtained estimates as above. As shown in Table 4, the proposed AIPW estimates have small biases and high relative efficiencies.

TABLE 4.

The Bias, empirical standard error (SE) and relative efficiency (RE) of the TIPW estimator, the proposed AIPW estimator, the nonparametric method based estimator (NP) and the CLR based estimator (CLR). For the proposed AIPW estimator, we also calculated the average of the estimated standard error (ASE), empirical coverage probabilities (CP) of the 95% CIs (×100).

Bias SE RE AIPW
true TIPW AIPW NP CLR TIPW AIPW NP CLR TIPW AIPW NP CLR ASE CP
γ¯1 0.693 0.022 0.053 0.006 0.022 0.122 0.103 0.106 0.108 1.000 1.152 1.372 1.275 0.105 93.5
γ¯2 1.099 −0.062 −0.008 −0.027 0.043 0.116 0.087 0.085 0.193 1.000 2.245 2.146 0.440 0.093 94.9
TPR 0.457 −0.044 −0.009 −0.036 0.047 0.038 0.043 1.000 2.623 1.305 0.038 91.5
PPV 0.163 0.004 −0.002 0.002 0.022 0.016 0.020 1.000 1.866 1.291 0.016 94.9
NPV 0.988 −0.003 −0.000 −0.002 0.002 0.001 0.002 1.000 8.090 1.539 0.001 95.9
TPR 0.601 −0.049 −0.008 −0.039 0.045 0.034 0.038 1.000 3.471 1.465 0.036 93.3
PPV 0.113 0.005 −0.000 0.004 0.014 0.010 0.012 1.000 2.334 1.446 0.010 95.9
NPV 0.991 −0.003 −0.001 −0.002 0.002 0.001 0.001 1.000 8.275 1.503 0.001 95.1
TPR 0.757 −0.047 −0.005 −0.035 0.037 0.028 0.031 1.000 4.325 1.631 0.031 96.1
PPV 0.075 0.005 −0.001 0.003 0.008 0.006 0.007 1.000 2.804 1.482 0.006 96.7
NPV 0.994 −0.003 −0.001 −0.002 0.001 0.001 0.001 1.000 8.069 1.548 0.001 93.3

5 ∣. REAL DATA ANALYSIS

High-density lipoprotein (HDL) is a protein-lipid complex which carries a range of proteins. These proteins differ in size and structure, which determines the functional properties and metabolism of HDL22. The plasma total apoA-1 concentration (WPA1) is well known to be strongly and consistently predictive of cardiovascular risk23. In addition to apoA-1, HDL also contains other proteins including apoA2, apoC3 and apoE. ApoC3, present on 8-15% of HDL particles, has been shown to be associated with the risk of obesity and diabetes24,25. To assess the predictiveness of these lipoprotein markers for the risk of developing myocardial infarction (MI), an NCC biomarker study was performed within the NHS II blood cohort consisting of 29,240 registered nurses enrolled around 198926. Among participants who were free of diagnosed cardiovascular disease or cancer at blood draw, 144 women were identified in the cohort with incident MI between blood draw and January 2016. Using a risk-set sampling, 144 controls were to be selected randomly and 1:1 matched on age, fasting (yes, no), smoking (never, past, current <15 cigarettes/day, current > 15 cigarettes/day, resulting in three dummy variables), and month of blood drawn. However, due to the lack of samples satisfying the matching criteria and having sufficient stored plasma for biomarker quantification, NCC design was not followed exactly during the control sampling process, yielding an imperfect NCC design. If the matching criteria is followed, the matching window should be a0 = (0, 0, 0, 0, 2, 2). But if for some case, there is no control in its risk set, the matching criteria is relaxed but not known. For example, the age difference maybe relaxed to 5 years so that there are controls to select from for this case.

The outcome of interest is the time from blood drawn to diagnosis of MI. For an individual without an event, failure time was censored at the earlier date between the last contact date and January 2016. Routine risk factors included smoking, age, diabetes, high cholesterol, and medication for HBP. These factors are available from the full cohort. Measures of the new biomarkers, WPA1 and apoE, are only available for the NCC subcohort. To account for the subcohort sample, we fitted a weighted Cox PH model including WPA1 and apoE and other baseline clinical variables as covariates using the data from the NCC subset. Since the sampling depends on many levels of covariates, it was difficult to estimate the weights using the NP approach (Exiting packages for nonparametric estimation of the selection probability all failed). Due to the additional adjustment in matching criteria, the ‘true’ weights were not retrievable. Therefore, we calculated the weights using the proposed AIPW techniques. As presented in Table 5, more frequent smoking (>15 cig/d), having diabetes or high cholesterol, or medication for high blood pressure, and high values of apoE are significantly associated with high risk of MI. In particular apoE predicts the time to MI beyond clinical factors, with an HR of 1.427 (95% CI: 1.140, 1.786). We also considered fitting the Cox model using the weights calculated strictly from the original protocol, the IPW method. For the variable more frequent smoking (>15 cig/d) versus never smoking, the estimated HR by the AIPW method is significantly above 1 while not significantly above 1 by the IPW method, which does not reflect the findings based on the existing literature. This is as expected, as the weights in this situation do not accurately account for the sampling procedures actually implemented, and this might potentially lead to biased estimates in the main regression model. The results highlight the importance of robust procedures in the calculation of the sampling weights, though the difference between the IPW and AIPW estimators is less pronounced for new markers. The estimated effects of CLR are a little different from the IPW and AIPW estimators.

TABLE 5.

Hazard ratio (HR) estimates for MI risk using sampling weights based on the original protocol (IPW), the proposed AIPW method and CLR method.

covariate IPW (95% CI) AIPW (95% CI) CLR (95% CI)
smoking (past) 0.804 (0.151, 1.457) 1.054 (0.815, 1.363) 0.723 (0.431, 1.214)
smoking <15 cig/d 0.890 (0.240, 1.541) 1.222 (0.854, 1.748) 1.067 (0.838, 1.359)
smoking >15 cig/d 1.160 (0.847, 1.473) 1.253 (1.083, 1.449) NA
age 1.014 (0.502, 1.526) 1.156 (0.852, 1.569) 0.379 (0.077, 1.863)
diabetes 1.573 (1.346, 1.800) 1.359 (1.121, 1.649) 1.335 (1.015, 1.757)
high cholesterol 1.380 (1.073, 1.686) 1.369 (1.073, 1.747) 1.349 (1.049, 1.736)
medication for HBP 1.307 (1.063, 1.550) 1.430 (1.193, 1.714) 1.301 (1.059, 1.599)
WPA1 0.600 (0.101, 1.098) 0.730 (0.500, 1.066) 0.688 (0.496, 0.953)
apoE 1.420 (1.107, 1.733) 1.427 (1.140, 1.786) 1.214 (0.950, 1.550)

We then calculated the in-sample accuracy measures of the model scores for predicting risk of MI by 158 months (t = 158) using the proposed method. The estimates of TPR, PPV, NPV at FPR=0.05, 0.1, 0.2 and AUC for the Cox model with baseline covariates as well as WPA1 and apoE are listed in Table 6 along with the IncV of the corresponding accuracy measures compared to the performance of a Cox model without the biomarkers. Results show that adding WPA1 and apoE to the Cox model with baseline covariates leads to no significant improvement in the accuracy measures, though apoE has a significant association with time to MI.

TABLE 6.

Estimated accuracy measures for a MI risk model with clinical predictors and biomarkers WPA1 and apoE and the incremental values (incV) of WPA1 and apoE over a model with only clinical predictors.

measure FPR est (95% CI) incv (95% CI)
TPR 0.050 0.265 (0.146, 0.385) −0.022 (−0.106, 0.061)
PPV 0.050 0.021 (0.009, 0.033) −0.001 (−0.009, 0.007)
NPV 0.050 0.997 (0.996, 0.998) 0.000 (−0.000, 0.000)
TPR 0.100 0.360 (0.234, 0.487) 0.001 (−0.081, 0.083)
PPV 0.100 0.014 (0.008, 0.021) −0.001 (−0.005, 0.004)
NPV 0.100 0.997 (0.996, 0.998) 0.000 (−0.000, 0.000)
TPR 0.200 0.503 (0.378, 0.627) 0.025 (−0.065, 0.115)
PPV 0.200 0.010 (0.007, 0.014) 0.000 (−0.002, 0.003)
NPV 0.200 0.998 (0.997, 0.998) 0.000 (−0.000, 0.001)
AUC 0.688 (0.610, 0.765) −0.005 (−0.057, 0.047)

6 ∣. DISCUSSION

Cost-effective two-phase sampling designs have been widely adopted in biomarker research in recent years. The nonrandom sampling of the NCC designs introduces complex data structures, which should be dealt with carefully to avoid bias. One well-recognized barrier in the analysis of two-phase designs is that the control selection procedures are often complicated in practical implementation: many matching factors are considered, and the window of selection for each variable might be adjusted in an ad-hoc fashion over the course of study, making it infeasible to retrieve the ‘true’ sampling weights. Robust nonparametric procedures for estimating the weights can consistently recover the weights according to the actual sampling, however they are limited in handling more than a few matching factors. In the case that the number of matching variables and routine markers Yold exceed 5, the NP method of Zheng et al.2 often becomes infeasible both theoretically and practically. On the other hand, our proposed AIPW method leverages the true sampling weights as a reasonable starting point and uses a sufficiently flexible model to estimate the effect on sampling of both variables involved in control selection and other correlated variables. Compared to the NP approach, the proposed AIPW procedure is able to incorporate a larger number of variables to augment the weights, while maintaining reasonable robustness and efficiency. It is important to note that matching on a large number of variables is generally not desirable since it inherently increases the chance of the matched risk sets being empty. We therefore do not recommend that in practice.

There are a couple of future directions/limitations in this line of research. The approach we proposed can easily be extended to other types of two-phase sampling such as a covariate-stratified case–cohort studies. Flexible methods are also needed to account for other practical complications in two-phase sampling. Our methods here assume a NCC study where all cases will be selected due to a low incidence rate. However in practice, due to cases and sample availability, not all cases can be sampled27. This may complicate the inference procedure and warrants future research.

The R code for carrying out the proposed AIPW procedure is available upon request.

ACKNOWLEDGEMENT

This research were funded by U01CA86368 and R01CA236558 awarded by the National Institutes of Health.

APPENDIX

APPENDIX A.

Note that in the appendixes, the derivations are with respect to the whole data and the proposed AIPW estimator, so we omit the subscript ‘all’ for notation convenience.

In this section, we show the asymptotic normality of the proposed AIPW estimator.

Assume C has a finite support [0, τ], P(T > τ) > 0 and the markers Y are continuous and bounded. The limit of γ^, which is γ¯, is in the interior of a compact parameter space Ωγ. Suppose the regularity conditions in Andersen and Gill28 hold. Similarly to Du and Akritas29, we assume the kernel function K is a symmetric probability density function with finite support and bounded second derivative. In addition, we assume the joint density of R=YTγ¯, T, and C has continuous derivatives.

Denote βi=β(π~0i,Xi), we first get the asymptotic expression of N12(β^iβi), which will be used in later derivations. Recalling that

U^π~0i,Xi(βi)=1Nj=1N(1δj)[Vjexp(βiTZj){1+exp(βiTZj)}]ZjKb((π~0i,Xi)(π~0j,Xj)).

The derivative of U^π~0i,Xi(βi) with respect to βi is

U^π~0i,Xi(βi)βi=1Nj=1Nexp(βiTZj){1+exp(βiTZj)}2ZjZjT(1δj)Kb((π~0i,Xi)(π~0j,Xj)),

which converges to Σiπ~0i(1π~0i)E[ZjZjTπ~0j=π~0i, Xj=Xi]f(π~0i,Xi), where f (·, ·) is the density function of (π~0i,Xi). It follows that

N12(β^iβi)=Σi1N12j=1N[Vjexp(βiTZj)1+exp(βiTZj)]Zj(1δj)Kb((π~0i,Xi)(π~0j,Xj))+op(1). (A.1)

For the proposed AIPW estimators with a general form

U^=N12i=1Nω^iUi, (A.2)

where E(Ui)=0, ω^i=Viπ^i and π^i=δi+(1δi)π^0i, we have

U^=N12i=1Nω^iUi=N12i=1NUi+N12i=1N(ω~i1)Ui+N12i=1N(ω^iω~i)UiI1+I2+I3,
whereI3=N12i=1NVi(1π^0i1π~0i)Ui=N12i=1NViπ^0iπ~0iπ^0iπ~0iUi=N12i=1Nω~iπ^0iπ~0iπ^0iUi=N12i=1Nω~iUiπ^0iπ~0iπ~0i+op(1)=N12i=1Nω~iUig.(βiTZi)g(βiTZi)ZiT(β^iβi)+op(1)=N1i=1Nω~iUi(1π~0i)ZiTΣi1N12j=1N(Vjexp(βiTZj)1+exp(βiTZj))Zj(1δj)Kb((π~0i,Xi)(π~0j,Xj))+op(1)=N12j=1NE[UiZiTπ~0i=π~0j,Xi=Xj]E[ZiZiTπ~0i=π~0j,Xi=Xj]1×(ω~j1)Zj(1δj)+op(1)=N12j=1N(ω~j1)(1δj)Πj+op(1),

where Πj=E[UiZiTπ~0i=π~0j, Xi=Xj]E[ZiZiTπ~0i=π~0j, Xi = Xj]−1Zj, which can be regarded as a linear (conditional) projection of Uj onto the space of Zj under the inner product ⟨Xi, Yi⟩ = E(XiYi). Also note that E[UiZiTπ~0i=π~0j,Xi=Xj]E[ZiZiTπ~0i=π~0j,Xi=Xj]1 is the minimizer of

1Ni=1N(UiθZi)2Kb((π~0i,Xi)(π~0j,Xj))

with respect to θ. So E[Zi(UiΠi)π~0i,Xi]=0. Since the first component of Zj is one, we have that E[(UiΠi)π~0i,Xi]=0. So U^ can be rewritten as

U^=N12i=1NUi+N12i=1N(1δi)(ω~i1)(UiΠi)+op(1). (A.3)

It follows from Cai and Zheng1 that U^ is asymptotically normal, with asymptotic variance

ΣU=E(Ui2)+EN1i=1N(1δi)(ω~i1)2(UiΠi)2+op(1)=E(Ui2)+E[(1π~0iπ~0i)(1δi)(UiΠi)2]+op(1).

Because the interaction term is

E[N1ijN(ω~i1)(ω~j1)(UiΠi)(UjΠj)]=(N1)ECov(ω~i{UiΠi},ω~j{UjΠj}D)=m(N1)Nη(t,Xi,δi)η(t,Xj,δj)dΛNCC(t)P(Xt)=0,

where ΛNCC(t)=0tdANCC(u)P(Xu),ANCC(t)=E{Ni(t)} and

η(t,Xi,δi)E[{UiΠi}I(Xit)(1π~0i)π~0i]=E(E[{UiΠi}I(Xit)(1π~0i)π~0iπ~0i,Xi])=0 (A.4)

by the arguments before (A.3) and similar arguments to those of Samuelsen16.

From Cai and Zheng1, we know that the asymptotic variance of the TIPW estimator U^=N12i=1Nω~iUi is

ΣTIPW=E(Ui2)+E(Ui21π~0iπ~0i)mηu(t,Xi,δi)2dΛNCC(t)P(Xt)+op(1)=E(Ui2π~0i)mηu(t,Xi,δi)2dΛNCC(t)P(Xt)+op(1),

where ηu(t,Xi,δi)=E[UiI(Xit)(1π~0i)π~0i].

Comparing these two asymptotic variances, we have

ΣTIPWΣU=E[(1δi)(1π~0iπ~0i){Ui2(UiΠi)2}]mηu(t,Xi,δi)2dΛNCC(t)P(Xt)=E{(1δi)(1π~0iπ~0i)Πi2}mηu(t,Xi,δi)2dΛNCC(t)P(Xt)=var{N12i=1N(1δi)(ω~i1)Πi}0,

where the last equality holds similarly to (A.4). That is, E[UiI(Xit)(1π~0i)π~0i]=E[ΠiI(Xit)(1π~0i)π~0i]. Therefore, the proposed AIPW estimators are more efficient than the true weight based TIPW estimators.

APPENDIX B.

Now we derive the specific forms of Ui in the general form (A.2) for all the related estimators of interest. Then the asymptotic variances of these estimators can be obtained using the results in Appendix A.

For γ^, similarly to Cai and Zheng1, we have that

N12(γ^γ¯)=N12i=1Nω^iUγ¯i+op(1),whereUγ¯i=D(γ¯)1{YiI(1)(t)I(0)(t)}dMi(t),D(γ¯)=N1i=1Nδi{I(2)(Xi)I(0)(Xi)I(1)(Xi)2I(0)(Xi)2},I(k)(t,γ)=N1i=1Nω^iI(Xit)exp(YiTγ)Yik,k=0,1,2,I(k)(t)=N1i=1Nω^iI(Xit)exp(YiTγ¯)Yik,k=0,1,2,Ai(t)=0tI(Xiu)exp(YiTγ¯)dΛ0(u),andMi(t)=Ni(t)Ai(t).

For Λ^(tr), we have

N12{Λ^(tr)Λ(tr)}=N120ti=1Nω^iKh(γ^TYir)dNi(u)i=1Nω^iKh(γ^TYir)I(Xiu)N12Λ(tr)=N120ti=1Nω^iKh(γ^TYir)dMi(u)i=1Nω^iKh(γ^TYir)I(Xiu)=N120ti=1Nω^iKh(γ^TYir)dMi(u)i=1Nω^iKh(γ^TYir)I(Xiu)N120ti=1Nω^iKh(YiTγ¯r)dMi(u)i=1Nω^iKh(YiTγ¯r)I(Xiu)+N120ti=1Nω^iKh(YiTγ¯r)dMi(u)i=1Nω^iKh(YiTγ¯r)I(Xiu)=[0ti=1Nω^iK.h(YiTγ¯r)hYidMi(u)i=1Nω^iKh(YiTγ¯r)I(Xiu)0ti=1Nω^iKh(YiTγ¯r)dMi(u){i=1Nω^iK.h(YiTγ¯r)hYiI(Xiu)}{i=1Nω^iKh(YiTγ¯r)I(Xiu)}2]N12(γ^γ¯)+N120ti=1Nω^iKh(YiTγ¯r)dMi(u)i=1Nω^iKh(YiTγ¯r)I(Xiu).

So the Ui form in (A.2) for Λ^(tr) is

UΛi(tr)=Uγ¯iT[0tN1j=1Nω^jK.h(YjTγ¯r)hYjdMj(u)N1j=1Nω^jKh(YjTγ¯r)I(Xju)][0tj=1Nω^jKh(YjTγ¯r)dMj(u){l=1Nω^lK.h(YjTγ¯r)hYlI(Xlu)}{j=1Nω^jKh(YjTγ¯r)I(Xju)}2]+0tKh(YjTγ¯r)dMj(u)N1j=1Nω^jKh(YjTγ¯r)I(Xju).

Recalling that S^(tr)=exp{Λ^(tr)}, we have that the Ui form in (A.2) for S^(tr) is

USi(tr)=S(tr)UΛi(tr).

Recalling that F^(r)=i=1Nω^iI(R^all,ir)i=1Nω^i, we get that the Ui form in (A.2) for F^(r) is

UFi(r)=I(Rir)F(c)+Dγ¯(r)Uγ¯i,whereDγ¯(r)=E[I(Rir)]γγ=γ¯.

Recalling S^(r,t)=rS^(tu)dF^(u), we have that the Ui form in (A.2) for S^(r,t) is

USi(t,r)=rUSi(tu)dF(u)+rS(tu)dUFi(u).

It follows that of Ui forms for the accuracy parameter estimators are

UTPRti(r)=TPRt(r)USi(t,rl)UFi(r)USi(t,r)1S(t),UFPRti(r)=USi(t,r)FPRt(r)USi(t,rl)S(t),UPPVti(r)={PPVt(r)1}UFi(r)USi(t,r)1F(r),UNPVti(r)=USi(t)USi(t,r)NPVt(r)UFi(r)F(r).

Thus, we get the forms of Ui in (A.2) for the regression parameter estimator γ^ and the accuracy parameter estimators TPR^(ct), FPR^(ct), PPV^(ct), NPV^(ct).

APPENDIX C.

In this section, we show the validity of the proposed resampling technique.

The derivative of U^π~0i,Xi(βi) with respect to βi is

U^π~0i,Xi(βi)βi=1Nj=1NIjexp(βiTZj){1+exp(βiTZj)}2ZjZjT(1δj)Kb((π~0i,Xi)(π~0j,Xj))+op(1),

which also converges to −Σi. It follows that

N12(β^iβi)=Σi1N12j=1NIj(Vjexp(βiTZj)1+exp(βiTZj))Zj(1δj)Kb((π~0i,Xi)(π~0j,Xj))+op(1).

The perturbed form of (A.2) is

U^=N12i=1N[δiIi+(1δi)V0iIiπ^0i]Ui=N12i=1N{Iiδi+(1δi)Ii+(1δi)(V0iπ~0i1)Ii+(1δi)[V0iIiπ^0iV0iIiπ~0i]}Ui=N12i=1NIiUi+N12i=1N(1δi)Ii(V0iπ~0i1)UiN12i=1N(1δi)V0iIiπ~0iπ^0iπ~0iπ^0iUi=N12i=1NIiUi+N12i=1N(1δi)Ii(ω~i1)(UiΠi)+op(1),

where the last equation follows similarly to the derivation of I3 in Appendix A.

From (A.3), we know

U^=N12i=1NUi+N12i=1N(1δi)(ω~i1)(UiΠi)+op(1).

It follows that

U^U^=N12i=1N(Ii1)Ui+N12i=1N(1δi)(Ii1)(ω~i1)(UiΠi)+op(1).

Therefore,

Var(U^U^D)=Var(U^).

References

  • 1.Cai T, Zheng Y. Evaluating prognostic accuracy of biomarkers under nested case-control studies. Biostatistics 2012; 13(1): 89–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Zheng Y, Brown M, Lok A, Cai T, others. Improving efficiency in biomarker incremental value evaluation under two-phase designs. The Annals of Applied Statistics 2017; 11(2): 638–654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Pepe M, Feng Z, Janes H, Bossuyt P, Potter J. Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. Journal of the National Cancer Institute 2008; 100(20): 1432–1438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Johnson SR, Anderson GL, Barad DH, Stefanick ML. The Women’s Health Initiative: rationale, design and progress report. British Menopause Society Journal 1999; 5(4): 155–159. [Google Scholar]
  • 5.Colditz GA, MANSON JE, HANKINSON SE. The Nurses’ Health Study: 20-year contribution to the understanding of health among women. Journal of Women’s Health 1997; 6(1): 49–62. [DOI] [PubMed] [Google Scholar]
  • 6.Prentice RL, Breslow N. Retrospective studies and failure time models. Biometrika 1978: 153–158. [Google Scholar]
  • 7.Breslow NE, Day NE, others. Statistical Methods in Cancer Research. 1 International Agency for Research on Cancer; Lyon. 1980. [Google Scholar]
  • 8.Prentice R A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986; 73(1): 1. [Google Scholar]
  • 9.Martin LJ, Melnichouk O, Huszti E, et al. Serum lipids, lipoproteins, and risk of breast cancer: a nested case-control study using multiple time points. JNCI: Journal of the National Cancer Institute 2015; 107(5). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chambers JC, Loh M, Lehne B, et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. The lancet Diabetes & endocrinology 2015; 3(7): 526–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jensen MK, Rimm EB, Furtado JD, Sacks FM. Apolipoprotein C-III as a potential modulator of the association between HDL-cholesterol and incident coronary heart disease. Journal of the American Heart Association 2012; 1(2): e000232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the Cox regression model. The Annals of Statistics 1992: 1903–1928. [Google Scholar]
  • 13.Langholz B, Borgan Ø. Estimation of absolute risk from nested case-control data. Biometrics 1997: 767–774. [PubMed] [Google Scholar]
  • 14.Scheike TH, Juul A. Maximum likelihood estimation for Cox’s regression model under nested case-control sampling. Biostatistics 2004; 5(2): 193–206. [DOI] [PubMed] [Google Scholar]
  • 15.Zeng D, Lin D, Avery C, North K, Bray M. Efficient semiparametric estimation of haplotype-disease associations in case–cohort and nested case–control studies. Biostatistics 2006; 7(3): 486–502. [DOI] [PubMed] [Google Scholar]
  • 16.Samuelsen SO. A psudolikelihood approach to analysis of nested case-control studies. Biometrika 1997; 84(2): 379–394. [Google Scholar]
  • 17.Lu W, Liu M. On estimation of linear transformation models with nested case–control sampling. Lifetime data analysis 2012; 18(1): 80–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Cai T, Zheng Y. Evaluating prognostic accuracy of biomarkers in nested case-control studies. Biostatistics 2011; 13(1): 89–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Cai T, Zheng Y. Nonparametric evaluation of biomarker accuracy under nested case-control studies. Journal of the American Statistical Association 2011; 106(494): 569–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American statistical Association 1989; 84(408): 1074–1078. [Google Scholar]
  • 21.Cai T, Tian L, Uno H, Solomon SD, Wei L. Calibrating parametric subject-specific risk estimation. Biometrika 2010; 97(2): 389–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Davidson WS, Silva RGD, Chantepie S, Lagor WR, Chapman MJ, Kontush A. Proteomic analysis of defined HDL subpopulations reveals particle-specific protein clusters: relevance to antioxidative function. Arteriosclerosis, thrombosis, and vascular biology 2009; 29(6): 870–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Andrikoula M, McDowell I. The contribution of ApoB and ApoA1 measurements to cardiovascular risk assessment. Diabetes, Obesity and Metabolism 2008; 10(4): 271–278. [DOI] [PubMed] [Google Scholar]
  • 24.Movva R, Rader DJ. Laboratory assessment of HDL heterogeneity and function. Clinical Chemistry 2008; 54(5): 788–800. [DOI] [PubMed] [Google Scholar]
  • 25.Kohan AB. ApoC-III: a potent modulator of hypertriglyceridemia and cardiovascular disease. Current opinion in endocrinology, diabetes, and obesity 2015; 22(2): 119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Colditz GA, Philpott SE, Hankinson SE. The impact of the Nurses’ Health Study on population health: prevention, translation, and control. American journal of public health 2016; 106(9): 1540–1545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kang S Fitting semiparametric accelerated failure time models for nested case–control data. Journal of Statistical Computation and Simulation 2017; 87(4): 652–663. [Google Scholar]
  • 28.Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The annals of statistics 1982: 1100–1120. [Google Scholar]
  • 29.Du Y, Akritas M. Uniform strong representation of the conditional Kaplan-Meier process. Mathematical Methods of Statistics 2002; 11(2): 152–182. [Google Scholar]

RESOURCES