Biomarker Evaluation Under Imperfect Nested Case–control Design

Xuan Wang; Yingye Zheng; Majken Karoline Jensen; Zeling He; Tianxi Cai

doi:10.1002/sim.9012

. Author manuscript; available in PMC: 2021 Aug 15.

Published in final edited form as: Stat Med. 2021 Apr 29;40(18):4035–4052. doi: 10.1002/sim.9012

Biomarker Evaluation Under Imperfect Nested Case–control Design

Xuan Wang ¹, Yingye Zheng ², Majken Karoline Jensen ³, Zeling He ¹, Tianxi Cai ^1,⁴

PMCID: PMC8286316 NIHMSID: NIHMS1705660 PMID: 33915597

Summary

The nested case–control (NCC) design has been widely adopted as a cost-effective sampling design for biomarker research. Under the NCC design, markers are only measured for the NCC subcohort consisting of all cases and a fraction of the controls selected randomly from the matched risk sets of the cases. Robust methods for evaluating prediction performance of risk models have been derived under the inverse probability weighting (IPW) framework. The probabilities of samples being included in the NCC cohort can be calculated based on the study design¹ or estimated non-parametrically². Neither strategy works well due to model mis-specification and the curse of dimensionality in practical settings where the sampling does not entirely follow the study design or depends on many factors. In this paper, we propose an alternative strategy to estimate the sampling probabilities based on a varying coefficient model, which attains a balance between robustness and the curse of dimensionality. The complex correlation structure induced by repeated finite risk set sampling makes the standard resampling procedure for variance estimation fail. We propose a perturbation resampling procedure that provides valid interval estimation for the proposed estimators. Simulation studies show that the proposed method performs well in finite samples. We apply the proposed method to the Nurses’ Health Study II to develop and evaluate prediction models using clinical biomarkers for cardiovascular risk.

Keywords: finite population sampling, inverse probability weighting, nonparametric smoothing, resampling, risk prediction

1 ∣. INTRODUCTION

Conducting rigorous biomarker validation studies is an important step in the translation of novel biomarkers into routine clinical practice for medical decision making. Such studies should follow design principles in sample selection to avoid bias³. Large prospective studies, such as the Women’s Health Initiative Study and the Nurses’ Health Study, with exposures captured and biologic samples collected and stored prior to disease onset, can serve as a platform for biomarker research^4,5. However, measuring biomarkers for large prospective cohorts is highly resource consuming. To make efficient use of stored samples from a cohort, two-phase sampling designs, including nested case–control (NCC) and case–cohort (CCH) studies, are often adopted as resource-efficient sampling strategies, especially when the outcome of interest is rare^6,7,8.

Under the NCC design, new markers are measured for all cases and a subset of controls randomly sampled without replacement from the risk sets of the cases. Sometimes controls are also matched to cases on variables such as gender and age. Many well-known biomarker studies nested in large cohorts have employed the NCC design^{9,10,11, e.g.}. For example, in the Nurses’ Health Study II (NHS_II), novel biomarkers, apoA1 concentration in whole plasma (WPA1) and concentration of apoE in whole plasma (apoE), were investigated for predicting the risk of Myocardial Infraction (MI)¹¹. Due to limited resources and low incidence of MI, the biomarkers were measured on a nested case–control set, which included all cases and controls sampled from the 1:1 matched risk set of the cases with matching variables including smoking status, fasting status, age, and timing of blood collection.

To analyze NCC data, conditional logistic regression (CLR) model has traditionally been used when the focus is on the estimation of hazard ratio (HR) parameters, and sometimes the estimation of absolute risk parameters^12,13. The CLR provides HR estimators under the Cox model from the full cohort, however cannot be extended to other models. Nor can the methods be used for estimating other parameters, such as the prediction accuracy parameters, which involve the distribution of the markers in the full cohort. For model parameter estimation, fully efficient maximum likelihood estimators (MLE) have also been proposed^14,15. The MLE relies on the correct specification of the failure time model and requires that censoring is independent of the novel markers as well as additional modeling assumptions when there are multiple novel biomarkers and routine clinical variables measured on the full cohort.

As a flexible alternative, the inverse probability weighting (IPW) approach has been developed¹⁶. Recently, IPW estimators have also been developed for fitting models beyond the Cox model as well as for prediction performance measures including the receiver operating characteristic (ROC) curve, positive predictive value (PPV) and negative predictive value (NPV)^17,18,19. Most existing IPW estimators for NCC studies calculates the true IPW (TIPW) sampling weights according to the study design and are consistent provided that the sampling weights are correctly obtained. However, the TIPW estimators may be invalid if the sampling is not implemented exactly according to the design. Such a scenario arises, for example, when the matching criteria are implemented only coarsely during the implementation of the sampling scheme due to practical concerns. Such an imperfect NCC design poses additional analytical challenges for estimating and evaluating risk prediction models. To overcome the bias, one may estimate the sampling weights non-parametrically via kernel smoothing as in Zheng et al.². Obtaining such a non-parametric augmented IPW (NP) estimator, however, is not feasible when the number of matching variables is not very small due to the curse of dimensionality.

In this paper, we propose a new semi-non-parametric AIPW estimator, where the selection probability is estimated based on a flexible varying coefficient model. The AIPW estimator can incorporate a larger number of matching variables while remaining robust to the deviation from the intended sampling scheme. We derive the the asymptotic properties for the proposed estimators and come up with a resampling method to assess the variability of our proposed AIPW estimators.

The remainder of the paper is organized as follows. In Section 2, we provide model specification and describe the proposed point estimation procedures. Our proposed interval estimation procedure is given in Section 3. In Section 4, we report results of simulation studies to assess the finite sample performance of the proposed method. In Section 5, the data from NHS II is analyzed as illustration. Concluding remarks are given in Section 6. Theoretical studies of the proposed estimators are provided in the Appendix.

2 ∣. ESTIMATING SAMPLING WEIGHTS VIA AIPW

Let T denote the survival outcome of interest and $Y = (Y_{old}^{T}, Y_{new}^{T})^{T}$ denote the vector of predictors for T, where Y_old denotes the vector of routine markers and Y_new denotes the vector of novel biomarkers. Due to censoring, T is only observed up to a bivariate vector X = T Λ C and δ = I(T ≤ C), where C is the censoring time. Under the NCC design, Y_new is only measured if V = 1, where V = δ + (1 − δ)V₀ and V₀ is a binary indicator for whether a subject is sampled into the NCC subcohort as a control. We assume that the sampling of the controls is performed by matching to the cases according to a vector of matching variables M. Suppose that the underling data for the full cohort consists of N independent and identically distributed random vectors, $D = {D_{i} = (X_{i}, δ_{i}, Y_{i}^{T}, M_{i}^{T})^{T}, i = 1, \dots, N}$ , while the observed data consist of $O = {O_{i} = (X_{i}, δ_{i}, Y_{old, i}^{T}, V_{i} Y_{new, i}^{T}, M_{i}^{T})^{T}, i = 1, \dots, N}$ . Let Ω = {i : 1 ≤ i ≤ N} and Ω_ncc = {i : 1 ≤ i ≤ N, V_i = 1} respectively denote the index sets for the full cohort and NCC subcohort.

Under the matched NCC design, for a case with event time X_i and matching variables M_i, m controls are sampled from the matched risk set

R_{W_{i}} = {k : X_{k} \geq X_{i}, ∣ M_{k} - M_{i} ∣ \leq a_{0}},

where a₀ is a predetermined range vector and $W_{i} = (δ_{i}, X_{i}, M_{i}^{T}, Y_{old, i})^{T}$ . Let ${\bar{π}}_{i} = P (V_{i} = 1 ∣ O)$ and ${\bar{π}}_{0 i} = P (V_{i} = 1 ∣ O, δ_{i} = 0)$ denote the true sampling probabilities under possibly imperfect NCC sampling. If the NCC sampling were implemented exactly accordingly to design, then ${\bar{π}}_{i} = δ_{i} + (1 - δ_{i}) {\bar{π}}_{0 i}$ can be calculated as ${\tilde{π}}_{i} = δ_{i} + (1 - δ_{i}) {\tilde{π}}_{0 i}$ , where ${\tilde{π}}_{0 i} = {\tilde{π}}_{0} (W_{i})$ ,

{\tilde{π}}_{0} (W_{i}) = 1 - \prod_{j : j \in R_{W_{i}}} {1 - \frac{m δ_{j}}{∣ R_{W_{j}} ∣ - 1}}

and $∣ R_{W_{i}} ∣$ is the size of $R_{W_{i}}$ ¹⁶. Under the perfect NCC design, TIPW estimators can be constructed by weighting observations with the true weights ${\tilde{ω}}_{i} = V_{i} ∕ {\tilde{π}}_{i}$ . To improve efficiency and robustness over the TIPW estimators, Zheng et al.² proposed NP estimators using non-parametrically estimated weights ${\hat{ω}}_{i}^{NP} = V_{i} ∕ {\hat{π}}^{NP} (W_{i})$ , where ${\hat{π}}^{NP} (W_{i}) = δ_{i} + (1 - δ_{i}) {\hat{π}}_{0}^{NP} (W_{i})$ ,

{\hat{π}}_{0}^{NP} (w) = \frac{\sum_{i = 1}^{N} (1 - δ_{i}) V_{i} K_{b} (W_{i} - w)}{\sum_{i = 1}^{N} (1 - δ_{i}) K_{b} (W_{i} - w)}

is a non-parametric estimate of π₀(w) = P(V_i = 1 ∣ W_i = w, δ_i = 0), $K_{b} (w) = b^{- q} \prod_{j = 1}^{q} K (w_{j} ∕ b)$ , K(·) is a symmetric density function, and b > 0 denotes the bandwidth.

While the NP method can be used to incorporate imperfect NCC designs, it is infeasible when the dimension of W is not small. To overcome the limitations of TIPW and NP methods, we propose a semi-non-parametric AIPW method by approximating ${\bar{π}}_{0 i}$ via a flexible varying coefficient model

π_{0 i} = g {β ({\tilde{π}}_{0 i}, X_{i})^{T} Z_{i}} with g (x) = \frac{e^{x}}{1 + e^{x}}

(2.1)

where Z_i = (1, Φ₁(Y_{old, i})^⊤, Φ₂(M_i)^⊤)^⊤, Φ₁(·) and Φ₂(·) are basis functions that allow potential non-linear effects, and β(π, x) is the unknown coefficient function. In practice, we find that the commonly used b-spline or natural splines basis with degree of freedom 3 works well. Equally spaced knots that covers most of the domain of the data are also desirable. We find that our results are not overly sensitive to the choice of the basis functions provided that they are reasonably flexible to capture non-linear effects. Under perfect NCC sampling, $β ({\tilde{π}}_{0 i}, X_{i}) = (g^{- 1} ({\tilde{π}}_{0 i}), 0^{T})^{T}$ . On the other hand, when the sampling is imperfect, the flexible model provides accurate approximation to the true sampling probabilities while overcoming the curse of dimensionality associated with NP procedures.

To estimate β(π, x), we maximize a local logistic log-likelihood using observed data on ${(V_{i}, Z_{i}, X_{i}, {\tilde{π}}_{0 i}) : δ_{i} = 0}$ . Specifically, for any given (π, x), we estimate β(π, x) as $\hat{β} (π, x)$ , the solution to the estimating equation

{\hat{U}}_{π, x} (β) = N^{- 1} \sum_{i = 1}^{n} K_{b} ({\tilde{π}}_{0 i} - π, X_{i} - x) (1 - δ_{i}) Z_{i} {V_{i} - g (β^{T} Z_{i})}

where K_b(·) = (b₁b₂)⁻¹K(π/b₁)K(x/b₂), K(·) is a symmetric density function function, b = (b₁, b₂)^⊤ is the bandwidth parameters vector which tend to 0 as N → ∞. With $\hat{β} (π, x)$ , we estimate the sampling probability for the ith subject as

{\hat{π}}_{i} = δ_{i} + (1 - δ_{i}) {\hat{π}}_{0 i}, where {\hat{π}}_{0 i} = g {\hat{β} ({\tilde{π}}_{0 i}, X_{i})^{T} Z_{i}} .

(2.2)

Then we construct our AIPW estimator using the augmented weights ${\hat{ω}}_{i} = V_{i} ∕ {\hat{π}}_{i}$ . Under the correct specification of (2.1), we expect that $\max_{1 \leq i \leq N} ∣ {\hat{π}}_{i} - {\bar{π}}_{i} ∣ \to 0$ as N → ∞.

3 ∣. APPLICATION OF AIPW TO ROBUST RISK PREDICTION

In this section, we illustrate the application of the AIPW approach to developing and evaluating risk prediction models. Since one of the major goals of biomarker studies is to evaluate the predictive capacity of novel biomarkers, we consider quantifying the incremental value of Y_new in predicting T above and beyond routine markers Y_old.

3.1 ∣. Calibrated Risk Estimate

To predict risks based on $Y = (Y_{old}^{T}, Y_{new}^{T})^{T}$ and Y_old, we fit two proportional hazards (PH) models,

P (T \geq t ∣ Y) = S_{_{all}} (t)^{exp (γ_{all}^{T} Y)},

(2.3)

P (T \geq t ∣ Y_{_{old}}) = S_{_{old}} (t)^{exp (γ_{old}^{T} Y_{_{old}})},

(2.4)

where S_all(·) and S_old(·) are unknown baseline survival functions and γ_all and γ_old are the corresponding log hazard ratio parameters. To estimate γ_all and γ_old, we note that Y_new is only available for those in the NCC subcohort while Y_old is observed for all subjects. Thus, we propose to estimate γ_all by maximizing weighted log partial likelihood with AIPW weights ${\hat{ω}}_{i}$ :

{\hat{γ}}_{_{all}} = {argmax}_{γ} \sum_{i = 1}^{N} {\hat{ω}}_{i} δ_{i} [γ^{T} Y_{i} - \log {\sum_{j = 1}^{N} {\hat{ω}}_{j} I (X_{j} \geq X_{i}) exp (γ^{T} Y_{j})}] .

On the other hand, γ_old can be estimated as the standard maximum partial likelihood estimator, denoted by ${\hat{γ}}_{_{old}}$ . It follows from Lin and Wei²⁰ and the consistency of the sampling probabilities that ${\hat{γ}}_{_{all}}$ and ${\hat{γ}}_{_{old}}$ respectively converge to deterministic vectors ${\bar{γ}}_{_{all}}$ and ${\bar{γ}}_{_{old}}$ as N → ∞, regardless of the adequacy of the survival models (2.3) and (2.4). When models (2.3) and (2.4) hold, then ${\bar{γ}}_{_{all}} = γ_{_{all}}$ and ${\bar{γ}}_{_{old}} = γ_{_{old}}$ .

To make a prediction for t-year survival, one may obtain a model-based estimate for P(T ≤ t ∣ Y) and P(T ≤ t ∣ Y_old) under (2.3) and (2.4). However, such a risk estimate may not be accurate under model mis-specifications. Following the calibrated risk prediction strategies proposed in Cai et al.²¹, we predict t-year survival risk given Y and Y_old based on

S_{_{all}} (t ∣ R_{_{all}}) \equiv P (T > t ∣ R_{_{all}}) and S_{_{old}} (t ∣ R_{_{old}}) \equiv P (T > t ∣ R_{_{old}}),

respectively, where $R_{_{all}} = Y^{T} {\bar{γ}}_{_{all}}$ and $R_{_{old}} = Y_{_{old}}^{T} {\bar{γ}}_{_{old}}$ are the limiting risk scores. The calibrated survival risk functions $S_{_{all}} (t ∣ r)$ and $S_{_{old}} (t ∣ r)$ can be non-parametrically estimated as ${\hat{S}}_{_{all}} (t ∣ r) = exp {- {\hat{Λ}}_{_{all}} (t ∣ r)}$ and ${\hat{S}}_{_{old}} (t ∣ r) = exp {- {\hat{Λ}}_{_{old}} (t ∣ r)}$ , where

{\hat{Λ}}_{_{all}} (t ∣ r) = \int_{0}^{t} \frac{\sum_{i} {\hat{ω}}_{i} K_{h} ({\hat{R}}_{_{all, i}} - r) d N_{i} (u)}{\sum_{i} {\hat{ω}}_{i} K_{h} ({\hat{R}}_{_{all, i}} - r) I (X_{i} \geq u)}, {\hat{Λ}}_{_{old}} (t ∣ r) = \int_{0}^{t} \frac{\sum_{i} K_{h} ({\hat{R}}_{_{old, i}} - r) d N_{i} (u)}{\sum_{i} K_{h} ({\hat{R}}_{_{old, i}} - r) I (X_{i} \geq u)},

${\hat{R}}_{_{all, i}} = {\hat{γ}}_{_{all}}^{T} Y_{i}$ , ${\hat{R}}_{_{old, i}} = {\hat{γ}}_{_{old}}^{T} Y_{_{old, i}}$ and N_i(t) = I(X_i ≤ t)δ_i. The above calibrated risk prediction procedure essentially fits risk models (2.3) and (2.4) to summarize multi-variate risk markers into univariate risk scores, R_all and R_old, and then non-parametrically estimates the t-year risk given the risk score.

3.2 ∣. Evaluating Prediction Performance

The accuracy of the risk prediction based on a given risk score R can be summarized by commonly used time dependent accuracy measures including the true positive rate (TPR), false positive rate (FPR), the receiver operating characteristic (ROC) curve, the positive predictive value (PPV), and the negative predictive value (NPV). These prediction performance measures typically consider the accuracy of a binary classification rule R ≥ r in predicting the t-year survival status D_t = I(T ≥ t). Specifically, the TPR and FPR of R ≥ r in prediction D_t are respectively defined as

TPR (r ∣ t) = P (R \geq r ∣ T < t), and FPR (r ∣ t) = P (R \geq r ∣ T \geq t) .

The ROC curve, ROC(u∣t) = TPR{FPR⁻¹(u∣t)∣t}, summarizes the trade-off between the FPR and TPR as the cut-off value varies. The PPV and NPV of R ≥ r are defined as

PPV (t ∣ r) = P (T < t ∣ R \geq r), and NPV (t ∣ r) = P (T \geq t ∣ R < r) .

To estimate the prediction accuracy for R_all and R_old, we note that all the aforementioned parameters are functionals of $S_{_{all}} (t ∣ r)$ , $S_{_{old}} (t ∣ r)$ , $F_{_{all}} (r) = P (R_{_{all}} \leq r)$ and $F_{_{old}} (r) = P (R_{_{old}} \leq r)$ . For example, the TPR and FPR of R_all ≥ r can be respectively written as

{TPR}_{_{all}} (r ∣ t) = \frac{1 - F_{_{all}} (r) - \int_{r}^{\infty} S_{_{all}} (t ∣ u) d F_{_{all}} (u)}{1 - \int S_{_{all}} (t ∣ u) d F_{_{all}} (u)}, and {FPR}_{_{all}} (r ∣ t) = \frac{\int_{r}^{\infty} S_{_{all}} (t ∣ u) d F_{_{all}} (u)}{\int S_{_{all}} (t ∣ u) d F_{_{all}} (u)} .

The trade-off between TPR_all(r∣t) and FPR_all(r∣t) can be summarized based on the receiver operating characteristic (ROC) curve ${ROC}_{_{all}} (u ∣ t) = {TPR}_{_{all}} {{FPR}_{_{all}}^{- 1} (u ∣ t) ∣ t)}$ , where u is any specified FPR level of interest. The marginal distribution functions $F_{_{all}} (r)$ and $F_{_{old}} (r)$ can be respectively estimated as

{\hat{F}}_{_{all}} (r) = \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} I ({\hat{R}}_{_{all, i}} \geq r)}{\sum_{i = 1}^{N} {\hat{ω}}_{i}}, and {\hat{F}}_{_{old}} (r) = N^{- 1} \sum_{i = 1}^{N} I ({\hat{R}}_{_{old, i}} \geq r) .

Subsequently, we may construct plug-in estimators for TPR_all(r∣t) and FPR_all(r∣t) as

{\hat{TPR}}_{_{all}} (r ∣ t) = \frac{1 - {\hat{F}}_{_{all}} (r) - \int_{r}^{\infty} {\hat{S}}_{_{all}} (t ∣ u) d {\hat{F}}_{_{all}} (u)}{1 - \int {\hat{S}}_{_{all}} (t ∣ u) d {\hat{F}}_{_{all}} (u)} and {\hat{FPR}}_{_{all}} (r ∣ t) = \frac{\int_{r}^{\infty} {\hat{S}}_{_{all}} (t ∣ u) d {\hat{F}}_{_{all}} (u)}{\int {\hat{S}}_{_{all}} (t ∣ u) d {\hat{F}}_{_{all}} (u)},

respectively. Similar plug-in estimators can be constructed for other accuracy parameters. We may quantify the incremental value (IncV) of Y_new based on the difference between the accuracy of R_all and R_old. For example, the IncV of Y_new with respect to the ROC curve at FPR level of u₀ can be estimated as ${\hat{ROC}}_{_{all}} (u_{0} ∣ t) - {\hat{ROC}}_{_{old}} (u_{0} ∣ t)$ , where ${\hat{ROC}}_{_{all}} (u_{0} ∣ t) = {\hat{TPR}}_{_{all}} {{\hat{FPR}}_{_{all}}^{- 1} (u_{0} ∣ t) ∣ t}$ and ${\hat{ROC}}_{_{old}}$ is the estimated ROC curve for R_old.

3.3 ∣. Resampling Based Interval Estimation

To estimate the asymptotic variance of the proposed AIPW estimators, we propose a perturbation resampling procedure. Specifically, let I = (I₁, …, I_N)^⊤ be a vector of independent and identically distributed non-negative random variables with mean 1 and variance 1. We first obtain perturbed counterpart of $\hat{β} (π, x)$ as ${\hat{β}}^{*} (π, x)$ , the solution to the estimating equation

{\hat{U}}_{π, x}^{*} (β) = N^{- 1} \sum_{i = 1}^{n} K_{b} ({\tilde{π}}_{0 i} - π, X_{i} - x) (1 - δ_{i}) Z_{i} {V_{i} - g (β^{T} Z_{i})} I_{i} .

Then we perturb the AIPW weights as

{\hat{ω}}_{i}^{*} = {δ_{i} + (1 - δ_{i}) \frac{V_{0 i}}{{\hat{π}}_{0 i}^{*}}} I_{i} with {\hat{π}}_{0 i}^{*} = g {({\hat{β}}^{*} ({\tilde{π}}_{0 i}, X_{i})^{T} Z_{i}} .

Subsequently, we perturb all AIPW estimators by replacing ${\hat{ω}}_{i}$ with ${\hat{ω}}_{i}^{*}$ . Specifically, we perturb ${\hat{γ}}_{_{all}}$ as

{\hat{γ}}_{all}^{*} = {argmax}_{γ} \sum_{i = 1}^{N} {\hat{ω}}_{i}^{*} δ_{i} [γ^{T} Y_{i} - \log {\sum_{j = 1}^{N} {\hat{ω}}_{j}^{*} I (X_{j} \geq X_{i}) exp (γ^{T} Y_{j})}],

and perturb ${\hat{S}}_{_{all}} (t ∣ r)$ as ${\hat{S}}_{_{all}}^{*} (t ∣ r) = exp {- {\hat{Λ}}_{_{all}}^{*} (t ∣ r)}$ , where

{\hat{Λ}}_{all}^{*} (t ∣ r) = \int_{0}^{t} \frac{\sum_{i} {\hat{ω}}_{i}^{*} K_{h} ({\hat{R}}_{_{all, i}}^{^{*}} - r) d N_{i} (u)}{\sum_{i} {\hat{ω}}_{i}^{*} K_{h} ({\hat{R}}_{_{all, i}}^{^{*}} - r) I (X_{i} \geq u)}, and {\hat{R}}_{_{all, i}}^{^{*}} = Y_{_{all, i}}^{^{T}} {\hat{γ}}_{_{all}}^{^{*}} .

The accuracy parameters can be perturbed similarly. For example, we may obtain

{\hat{TPR}}_{_{all}}^{^{*}} (r ∣ t) = \frac{1 - {\hat{F}}_{_{all}}^{^{*}} (r) - \int_{r}^{\infty} {\hat{S}}_{_{all}}^{^{*}} (t ∣ u) d {\hat{F}}_{_{all}}^{^{*}} (u)}{1 - \int {\hat{S}}_{_{all}}^{^{*}} (t ∣ u) d {\hat{F}}_{_{all}}^{^{*}} (u)},

where ${\hat{F}}_{_{all}}^{*} (r) = \sum_{i = 1}^{N} {\hat{ω}}_{i}^{*} I ({\hat{R}}_{_{all, i}}^{*} \leq r) ∕ (\sum_{i = 1}^{N} {\hat{ω}}_{i}^{*})$ .

For IncV parameters, the estimation of model parameters related to the reduced model only involve full cohort data and thus the perturbation will only involve weighting observations by {I_i}. Specifically, ${\hat{γ}}_{_{old}}$ is perturbed as

{\hat{γ}}_{_{old}}^{^{*}} = {argmax}_{γ} \sum_{i = 1}^{N} I_{i} δ_{i} [γ^{T} Y_{_{old, i}} - \log {\sum_{j = 1}^{N} I_{j} I (X_{j} \geq X_{i}) exp (γ^{T} Y_{_{old, j}})}],

and ${\hat{S}}_{_{old}}^{*} (t ∣ r) = exp {- {\hat{Λ}}_{_{old}}^{*} (t ∣ r)}$ , where

{\hat{Λ}}_{_{old}}^{^{*}} (t ∣ r) = \int_{0}^{t} \frac{\sum_{i} I_{i} K_{h} ({\hat{R}}_{_{old, i}}^{^{*}} - r) d N_{i} (u)}{\sum_{i} I_{i} K_{h} ({\hat{R}}_{_{old, i}}^{^{*}} - r) I (X_{i} \geq u)}, and {\hat{R}}_{_{old, i}}^{^{*}} = Y_{_{old, i}}^{^{T}} {\hat{γ}}_{_{old}}^{^{*}} .

Similar strategies can be used for accuracy parameters such as ${\hat{TPR}}_{_{old}}^{*} (c ∣ t)$ and ${\hat{FPR}}_{_{old}}^{*} (c ∣ t)$ .

To obtain variance estimators and construct confidence intervals, we may obtain a large number, say B, of realizations of I. For each realization of I, we obtain the above perturbed estimates. The empirical distribution of the B sets of perturbed estimates can be used for inference. For example, the empirical variance of ${\hat{ROC}}_{_{all}}^{*} (u_{0} ∣ t) - {\hat{ROC}}_{_{old}}^{*} (u_{0} ∣ t)$ can be used to approximate the variance of ${\hat{ROC}}_{_{all}} (u_{0} ∣ t) - {\hat{ROC}}_{_{old}} (u_{0} ∣ t)$ .

4 ∣. NUMERICAL STUDIES

We performed extensive simulations to evaluate the finite sample performance of the proposed estimators and to compare with other estimators under NCC design when the design is carried out perfectly or imperfectly. We generate Y = (Y_old, Y_new)^⊤ from a bivariate normal distribution with zero mean, unit variance and correlation 0.5. Given Y, we generate T from a PH model

P (T \geq t ∣ Y) = exp [- exp {\log (0.01 t) + \log (2) Y_{_{new}} + \log (3) Y_{_{old}}}] .

The censoring time was generated from two settings: (I) C ~ C_IND = min(C_a, C_b), where C_a ~ Uniform(0.5, 2) and C_b ~ Gamma(shape = 2, rate = 2); (II) $C \sim C_{_{DEP}} = min {C_{a}, C_{b}^{'} (Y)}$ , where $C_{b}^{'} (Y) = exp {- (Y_{_{new}} + Y_{_{old}}) ∕ 5} + 0.5$ . This leads to covariate independent censoring in (I) and covariate dependent censoring in (II). The censoring rate and event rate (proportion of cases) are around 15% and 5%, respectively. We let N = 2000, and selected the NCC cohort by including all the cases and m = 1 control per case. Under each configuration, results were summarized based on 500 simulated datasets. We obtain estimators for ${\bar{γ}}_{_{all}} = ({\bar{γ}}_{1}, {\bar{γ}}_{2})$ in model (2.3) and TPR_u₀, PPV_u₀, NPV_u₀ at FPR= u₀, with u₀ taken to be 0.05, 0.1, 0.2. We also compared the proposed approach with existing methods including the TIPW estimator of Cai and Zheng¹⁹, NP estimators of Zheng et al.² and conditional logistic regression method based estimator, denoted as CLR.

We considered three settings. In the first setting, setting (1), the matching covariate vector M = (M₁, M₂) with matching window a₀ = (0, 0), where $M_{1} = \sum_{l = 1}^{2} I (Y_{_{old}} \leq y_{q_{l}})$ , y_q was the 100qth percentiles of Y_old and q₁ = 0.33, q₂ = 0.66, M₂ ~ Bernouli(0.5); In setting (2), matching variable M = (M₁, …, M₅)^⊤ with matching window a₀ = (0, 2, 2, 5, 0), where M₁ is the same as in setting (1), $M_{2} \sim ⌊ 0.3 e^{N} ⌉$ , $M_{3} \sim ⌊ 5 ϕ (Y_{_{old}} + N) ⌉$ , M₄ ~ ⌊Uniform(0, 10)⌉, and M₅ ~ Bernouli(0.5), ϕ is a normal density function, and $N \sim N (0, 1)$ ; In setting (3), matching variable M = (M₁, M₂, M₃, M₄)^⊤ with a varying window in that we intend to match with window a₀ = (0, 0, 0, 0) but when the number of subjects is not sufficient in the risk set for some cases, we relax the criterion to matching window a = (0, 0, 2, 2) to select controls in the new risk set. Here M₁ is the same as in setting (1), $M_{2} = I (Y_{_{old}} + N > 0)$ with $N \sim N (0, 1)$ , $M_{3} \sim ⌊ 5 ϕ (Y_{_{old}} + N) ⌉$ and $M_{4} \sim ⌊ 0.2 e^{N} ⌉$ .

Results summarizing the performance of the proposed point and interval estimators across settings (1) - (3) are presented in Table 1-3. The point estimators have negligible biases. The average of the standard errors (ASEs) are close to the corresponding empirical standard errors (SEs), and the empirical coverage probabilities (CP) of the 95% confidence intervals are close to the nominal level. These results confirm the validity of the proposed estimation procedures in finite sample.

TABLE 1.

The Bias, empirical standard error (SE) and relative efficiency (RE) of the TIPW estimator, the proposed AIPW estimator, the nonparametric method based estimator (NP) and the CLR based estimator (CLR). For the proposed AIPW estimator, we also calculated the average of the estimated standard error (ASE), empirical coverage probabilities (CP) of the 95% CIs (×100) for settings (1).

Independent censoring (I)
		Bias				SE				RE				AIPW
	true	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	ASE	CP
${\bar{γ}}_{1}$	0.693	0.010	0.022	0.005	0.146	0.096	0.096	0.096	0.166	1.000	0.955	1.020	0.191	0.090	93.0
${\bar{γ}}_{2}$	1.099	−0.005	0.014	−0.022	0.344	0.100	0.095	0.092	0.340	1.000	1.086	1.116	0.043	0.090	93.6
TPR	0.460	−0.011	−0.010	−0.025		0.050	0.044	0.046		1.000	1.296	0.942		0.044	94.2
PPV	0.543	−0.012	−0.005	0.002		0.035	0.032	0.034		1.000	1.285	1.132		0.032	94.4
NPV	0.932	−0.001	−0.002	−0.008		0.011	0.007	0.009		1.000	2.220	0.863		0.008	96.8
TPR	0.596	−0.010	−0.009	−0.027		0.051	0.041	0.044		1.000	1.492	1.010		0.042	94.8
PPV	0.435	−0.009	−0.003	0.005		0.030	0.027	0.030		1.000	1.325	1.056		0.026	94.0
NPV	0.945	−0.001	−0.001	−0.008		0.011	0.007	0.008		1.000	2.586	1.007		0.007	97.0
TPR	0.748	−0.006	−0.005	−0.022		0.046	0.035	0.038		1.000	1.747	1.163		0.036	96.2
PPV	0.326	−0.005	−0.000	0.009		0.024	0.021	0.023		1.000	1.429	0.972		0.021	95.2
NPV	0.961	−0.001	−0.001	−0.006		0.011	0.006	0.007		1.000	2.986	1.229		0.007	97.2
Dependent censoring (II)
		Bias				SE				RE				AIPW
	true	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	ASE	CP
${\bar{γ}}_{1}$	0.693	0.018	0.044	0.007	0.086	0.116	0.111	0.115	0.169	1.000	0.952	1.027	0.382	0.106	91.9
${\bar{γ}}_{2}$	1.099	−0.004	−0.014	−0.019	0.210	0.122	0.105	0.101	0.315	1.000	1.319	1.398	0.104	0.101	93.3
TPR	0.460	−0.010	−0.005	−0.025		0.052	0.046	0.050		1.000	1.303	0.891		0.047	96.0
PPV	0.543	−0.010	−0.002	0.005		0.039	0.033	0.037		1.000	1.483	1.150		0.033	94.8
NPV	0.932	−0.001	−0.002	−0.009		0.010	0.008	0.009		1.000	1.623	0.575		0.008	96.8
TPR	0.596	−0.008	−0.004	−0.026		0.048	0.042	0.044		1.000	1.308	0.886		0.044	94.6
PPV	0.435	−0.006	−0.000	0.008		0.032	0.026	0.029		1.000	1.533	1.137		0.027	95.0
NPV	0.945	−0.001	−0.001	−0.008		0.009	0.007	0.009		1.000	1.578	0.606		0.008	96.0
TPR	0.748	−0.008	−0.004	−0.025		0.042	0.037	0.040		1.000	1.310	0.833		0.038	95.0
PPV	0.326	−0.005	0.000	0.010		0.025	0.021	0.024		1.000	1.560	0.987		0.021	95.6
NPV	0.961	−0.001	−0.001	−0.007		0.008	0.007	0.008		1.000	1.580	0.602		0.007	96.6

Open in a new tab

TABLE 3.

Independent censoring (I)
		Bias				SE				RE				AIPW
	true	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	ASE	CP
${\bar{γ}}_{1}$	0.693	0.028	0.024	0.007	0.152	0.129	0.101	0.099	0.169	1.000	1.614	1.758	0.335	0.105	94.6
${\bar{γ}}_{2}$	1.099	−0.086	−0.002	−0.063	0.335	0.128	0.094	0.092	0.345	1.000	2.716	1.908	0.103	0.103	95.6
TPR	0.460	−0.039	−0.009	−0.036		0.053	0.048	0.050		1.000	1.777	1.130		0.050	94.0
PPV	0.543	0.015	−0.002	0.020		0.040	0.033	0.036		1.000	1.671	1.056		0.037	96.6
NPV	0.932	−0.017	−0.003	−0.017		0.014	0.008	0.011		1.000	6.487	1.131		0.010	98.0
TPR	0.596	−0.043	−0.006	−0.038		0.058	0.048	0.051		1.000	2.266	1.293		0.048	92.8
PPV	0.435	0.020	0.001	0.025		0.036	0.027	0.032		1.000	2.251	1.040		0.031	96.6
NPV	0.945	−0.015	−0.001	−0.015		0.014	0.008	0.011		1.000	6.411	1.182		0.009	96.2
TPR	0.748	−0.045	−0.001	−0.035		0.057	0.040	0.045		1.000	3.244	1.621		0.042	94.6
PPV	0.326	0.022	0.003	0.027		0.033	0.021	0.025		1.000	3.568	1.099		0.025	96.8
NPV	0.961	−0.014	−0.001	−0.013		0.013	0.007	0.010		1.000	6.815	1.307		0.008	96.2
Dependent censoring (II)
		Bias				SE				RE				AIPW
	true	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	ASE	CP
${\bar{γ}}_{1}$	0.693	0.019	0.032	0.003	0.074	0.138	0.115	0.116	0.159	1.000	1.372	1.443	0.629	0.109	92.1
${\bar{γ}}_{2}$	1.099	−0.082	−0.016	−0.061	0.213	0.129	0.106	0.104	0.354	1.000	2.032	1.603	0.136	0.102	93.9
TPR	0.460	−0.032	−0.001	−0.034		0.063	0.052	0.054		1.000	1.816	1.206		0.050	91.9
PPV	0.543	0.022	0.006	0.027		0.045	0.034	0.040		1.000	2.154	1.116		0.036	94.1
NPV	0.932	−0.017	−0.003	−0.019		0.014	0.009	0.012		1.000	5.339	1.012		0.010	97.0
TPR	0.596	−0.036	0.001	−0.035		0.061	0.049	0.049		1.000	2.134	1.427		0.047	93.1
PPV	0.435	0.026	0.008	0.032		0.039	0.028	0.033		1.000	2.692	1.042		0.030	94.5
NPV	0.945	−0.015	−0.002	−0.016		0.013	0.009	0.011		1.000	5.235	1.097		0.009	96.3
TPR	0.748	−0.038	0.003	−0.035		0.053	0.042	0.043		1.000	2.340	1.379		0.040	94.3
PPV	0.326	0.026	0.008	0.033		0.034	0.021	0.027		1.000	3.593	1.009		0.024	96.1
NPV	0.961	−0.014	−0.001	−0.014		0.012	0.008	0.010		1.000	4.634	1.091		0.008	97.6

Open in a new tab

In setting (1), sampling is correctly carried out and M is low dimensional, and hence all three methods (TIPW, AIPW, NP) are valid. As shown in Table 1, all three estimators have negligible biases, TIPW and NP have comparable efficiency with respect to mean squared error (MSE), and AIPW is a little more efficient than the TIPW and NP estimators. In setting (2), the sampling is carried out correctly but the matching variable is of a higher dimension, which leads to curse of dimensionality for the NP method while the TIPW remains valid. As shown in Table 2, the TIPW and AIPW both have negligible biases, while the NP exhibits higher biases. Setting (3) is a commonly encountered imperfect NCC sampling setting that is similar to the motivating example of the NHS II study. In this case, the TIPW estimator is biased as expected. There is also bias observed for the NP estimators due to the curse of dimensionality, whereas the AIPW estimator still maintains negligible bias. In addition, the AIPW estimator is substantially more efficient than both the TIPW and NP estimators with respect to MSE, with relative efficiency as high as 6 compared to the TIPW estimator and 5 compared to the NP estimator. In all the settings considered, the CLR estimator is either more biased or less efficient compared to other estimators.

TABLE 2.

Independent censoring (I)
		Bias				SE				RE				AIPW
	true	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	ASE	CP
${\bar{γ}}_{1}$	0.693	0.014	0.020	−0.009	0.139	0.110	0.105	0.118	0.171	1.000	1.078	0.880	0.254	0.096	91.9
${\bar{γ}}_{2}$	1.099	−0.018	0.013	−0.097	0.343	0.109	0.097	0.145	0.339	1.000	1.263	0.399	0.052	0.096	94.0
TPR	0.460	−0.015	−0.006	−0.052		0.049	0.043	0.057		1.000	1.383	0.446		0.046	95.8
PPV	0.543	−0.003	−0.001	0.054		0.034	0.030	0.041		1.000	1.309	0.261		0.033	96.8
NPV	0.932	−0.004	−0.002	−0.036		0.011	0.008	0.017		1.000	2.124	0.085		0.009	97.2
TPR	0.596	−0.016	−0.006	−0.064		0.048	0.043	0.055		1.000	1.391	0.360		0.043	95.6
PPV	0.435	−0.001	0.000	0.057		0.029	0.025	0.040		1.000	1.302	0.176		0.027	96.6
NPV	0.945	−0.004	−0.002	−0.033		0.010	0.007	0.016		1.000	2.101	0.086		0.008	97.4
TPR	0.748	−0.013	−0.004	−0.072		0.045	0.039	0.051		1.000	1.480	0.287		0.039	94.6
PPV	0.326	0.002	0.002	0.056		0.025	0.021	0.035		1.000	1.444	0.142		0.022	95.6
NPV	0.961	−0.003	−0.001	−0.031		0.009	0.007	0.015		1.000	2.091	0.085		0.008	97.8
Dependent censoring (II)
		Bias				SE				RE				AIPW
	true	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	ASE	CP
${\bar{γ}}_{1}$	0.693	0.024	0.044	−0.001	0.088	0.121	0.115	0.131	0.172	1.000	1.008	0.892	0.408	0.106	91.7
${\bar{γ}}_{2}$	1.099	−0.009	−0.008	−0.098	0.230	0.130	0.110	0.133	0.331	1.000	1.389	0.622	0.104	0.102	93.6
TPR	0.460	−0.010	−0.002	−0.049		0.056	0.050	0.058		1.000	1.302	0.565		0.048	92.8
PPV	0.543	0.001	0.004	0.064		0.040	0.035	0.046		1.000	1.311	0.263		0.034	94.1
NPV	0.932	−0.004	−0.003	−0.038		0.011	0.008	0.017		1.000	2.181	0.085		0.009	95.6
TPR	0.596	−0.012	−0.003	−0.067		0.053	0.045	0.056		1.000	1.426	0.385		0.044	94.9
PPV	0.435	0.002	0.005	0.066		0.034	0.029	0.040		1.000	1.335	0.190		0.028	92.6
NPV	0.945	−0.004	−0.002	−0.036		0.011	0.007	0.016		1.000	2.360	0.080		0.008	97.9
TPR	0.748	−0.011	−0.000	−0.072		0.045	0.036	0.047		1.000	1.631	0.289		0.038	95.3
PPV	0.326	0.004	0.006	0.065		0.027	0.022	0.036		1.000	1.506	0.137		0.022	95.1
NPV	0.961	−0.003	−0.001	−0.033		0.010	0.006	0.014		1.000	2.594	0.079		0.007	97.7

Open in a new tab

To examine whether our proposed method performs well under settings with a very low event rate, we also generated data under a slight variation of the above PH model with a substantially lower baseline hazard leading to about 0.5% of event rate and independent censoring. We sampled the NCC cohort under setting (3) and obtained estimates as above. As shown in Table 4, the proposed AIPW estimates have small biases and high relative efficiencies.

TABLE 4.

		Bias				SE				RE				AIPW
	true	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	TIPW	AIPW	NP	CLR	ASE	CP
${\bar{γ}}_{1}$	0.693	0.022	0.053	0.006	0.022	0.122	0.103	0.106	0.108	1.000	1.152	1.372	1.275	0.105	93.5
${\bar{γ}}_{2}$	1.099	−0.062	−0.008	−0.027	0.043	0.116	0.087	0.085	0.193	1.000	2.245	2.146	0.440	0.093	94.9
TPR	0.457	−0.044	−0.009	−0.036		0.047	0.038	0.043		1.000	2.623	1.305		0.038	91.5
PPV	0.163	0.004	−0.002	0.002		0.022	0.016	0.020		1.000	1.866	1.291		0.016	94.9
NPV	0.988	−0.003	−0.000	−0.002		0.002	0.001	0.002		1.000	8.090	1.539		0.001	95.9
TPR	0.601	−0.049	−0.008	−0.039		0.045	0.034	0.038		1.000	3.471	1.465		0.036	93.3
PPV	0.113	0.005	−0.000	0.004		0.014	0.010	0.012		1.000	2.334	1.446		0.010	95.9
NPV	0.991	−0.003	−0.001	−0.002		0.002	0.001	0.001		1.000	8.275	1.503		0.001	95.1
TPR	0.757	−0.047	−0.005	−0.035		0.037	0.028	0.031		1.000	4.325	1.631		0.031	96.1
PPV	0.075	0.005	−0.001	0.003		0.008	0.006	0.007		1.000	2.804	1.482		0.006	96.7
NPV	0.994	−0.003	−0.001	−0.002		0.001	0.001	0.001		1.000	8.069	1.548		0.001	93.3

Open in a new tab

5 ∣. REAL DATA ANALYSIS

High-density lipoprotein (HDL) is a protein-lipid complex which carries a range of proteins. These proteins differ in size and structure, which determines the functional properties and metabolism of HDL²². The plasma total apoA-1 concentration (WPA1) is well known to be strongly and consistently predictive of cardiovascular risk²³. In addition to apoA-1, HDL also contains other proteins including apoA2, apoC3 and apoE. ApoC3, present on 8-15% of HDL particles, has been shown to be associated with the risk of obesity and diabetes^24,25. To assess the predictiveness of these lipoprotein markers for the risk of developing myocardial infarction (MI), an NCC biomarker study was performed within the NHS II blood cohort consisting of 29,240 registered nurses enrolled around 1989²⁶. Among participants who were free of diagnosed cardiovascular disease or cancer at blood draw, 144 women were identified in the cohort with incident MI between blood draw and January 2016. Using a risk-set sampling, 144 controls were to be selected randomly and 1:1 matched on age, fasting (yes, no), smoking (never, past, current <15 cigarettes/day, current > 15 cigarettes/day, resulting in three dummy variables), and month of blood drawn. However, due to the lack of samples satisfying the matching criteria and having sufficient stored plasma for biomarker quantification, NCC design was not followed exactly during the control sampling process, yielding an imperfect NCC design. If the matching criteria is followed, the matching window should be a₀ = (0, 0, 0, 0, 2, 2). But if for some case, there is no control in its risk set, the matching criteria is relaxed but not known. For example, the age difference maybe relaxed to 5 years so that there are controls to select from for this case.

The outcome of interest is the time from blood drawn to diagnosis of MI. For an individual without an event, failure time was censored at the earlier date between the last contact date and January 2016. Routine risk factors included smoking, age, diabetes, high cholesterol, and medication for HBP. These factors are available from the full cohort. Measures of the new biomarkers, WPA1 and apoE, are only available for the NCC subcohort. To account for the subcohort sample, we fitted a weighted Cox PH model including WPA1 and apoE and other baseline clinical variables as covariates using the data from the NCC subset. Since the sampling depends on many levels of covariates, it was difficult to estimate the weights using the NP approach (Exiting packages for nonparametric estimation of the selection probability all failed). Due to the additional adjustment in matching criteria, the ‘true’ weights were not retrievable. Therefore, we calculated the weights using the proposed AIPW techniques. As presented in Table 5, more frequent smoking (>15 cig/d), having diabetes or high cholesterol, or medication for high blood pressure, and high values of apoE are significantly associated with high risk of MI. In particular apoE predicts the time to MI beyond clinical factors, with an HR of 1.427 (95% CI: 1.140, 1.786). We also considered fitting the Cox model using the weights calculated strictly from the original protocol, the IPW method. For the variable more frequent smoking (>15 cig/d) versus never smoking, the estimated HR by the AIPW method is significantly above 1 while not significantly above 1 by the IPW method, which does not reflect the findings based on the existing literature. This is as expected, as the weights in this situation do not accurately account for the sampling procedures actually implemented, and this might potentially lead to biased estimates in the main regression model. The results highlight the importance of robust procedures in the calculation of the sampling weights, though the difference between the IPW and AIPW estimators is less pronounced for new markers. The estimated effects of CLR are a little different from the IPW and AIPW estimators.

TABLE 5.

Hazard ratio (HR) estimates for MI risk using sampling weights based on the original protocol (IPW), the proposed AIPW method and CLR method.

covariate	IPW (95% CI)	AIPW (95% CI)	CLR (95% CI)
smoking (past)	0.804 (0.151, 1.457)	1.054 (0.815, 1.363)	0.723 (0.431, 1.214)
smoking <15 cig/d	0.890 (0.240, 1.541)	1.222 (0.854, 1.748)	1.067 (0.838, 1.359)
smoking >15 cig/d	1.160 (0.847, 1.473)	1.253 (1.083, 1.449)	NA
age	1.014 (0.502, 1.526)	1.156 (0.852, 1.569)	0.379 (0.077, 1.863)
diabetes	1.573 (1.346, 1.800)	1.359 (1.121, 1.649)	1.335 (1.015, 1.757)
high cholesterol	1.380 (1.073, 1.686)	1.369 (1.073, 1.747)	1.349 (1.049, 1.736)
medication for HBP	1.307 (1.063, 1.550)	1.430 (1.193, 1.714)	1.301 (1.059, 1.599)
WPA1	0.600 (0.101, 1.098)	0.730 (0.500, 1.066)	0.688 (0.496, 0.953)
apoE	1.420 (1.107, 1.733)	1.427 (1.140, 1.786)	1.214 (0.950, 1.550)

Open in a new tab

We then calculated the in-sample accuracy measures of the model scores for predicting risk of MI by 158 months (t = 158) using the proposed method. The estimates of TPR, PPV, NPV at FPR=0.05, 0.1, 0.2 and AUC for the Cox model with baseline covariates as well as WPA1 and apoE are listed in Table 6 along with the IncV of the corresponding accuracy measures compared to the performance of a Cox model without the biomarkers. Results show that adding WPA1 and apoE to the Cox model with baseline covariates leads to no significant improvement in the accuracy measures, though apoE has a significant association with time to MI.

TABLE 6.

Estimated accuracy measures for a MI risk model with clinical predictors and biomarkers WPA1 and apoE and the incremental values (incV) of WPA1 and apoE over a model with only clinical predictors.

measure	FPR	est (95% CI)	incv (95% CI)
TPR	0.050	0.265 (0.146, 0.385)	−0.022 (−0.106, 0.061)
PPV	0.050	0.021 (0.009, 0.033)	−0.001 (−0.009, 0.007)
NPV	0.050	0.997 (0.996, 0.998)	0.000 (−0.000, 0.000)
TPR	0.100	0.360 (0.234, 0.487)	0.001 (−0.081, 0.083)
PPV	0.100	0.014 (0.008, 0.021)	−0.001 (−0.005, 0.004)
NPV	0.100	0.997 (0.996, 0.998)	0.000 (−0.000, 0.000)
TPR	0.200	0.503 (0.378, 0.627)	0.025 (−0.065, 0.115)
PPV	0.200	0.010 (0.007, 0.014)	0.000 (−0.002, 0.003)
NPV	0.200	0.998 (0.997, 0.998)	0.000 (−0.000, 0.001)
AUC		0.688 (0.610, 0.765)	−0.005 (−0.057, 0.047)

Open in a new tab

6 ∣. DISCUSSION

Cost-effective two-phase sampling designs have been widely adopted in biomarker research in recent years. The nonrandom sampling of the NCC designs introduces complex data structures, which should be dealt with carefully to avoid bias. One well-recognized barrier in the analysis of two-phase designs is that the control selection procedures are often complicated in practical implementation: many matching factors are considered, and the window of selection for each variable might be adjusted in an ad-hoc fashion over the course of study, making it infeasible to retrieve the ‘true’ sampling weights. Robust nonparametric procedures for estimating the weights can consistently recover the weights according to the actual sampling, however they are limited in handling more than a few matching factors. In the case that the number of matching variables and routine markers Y_old exceed 5, the NP method of Zheng et al.² often becomes infeasible both theoretically and practically. On the other hand, our proposed AIPW method leverages the true sampling weights as a reasonable starting point and uses a sufficiently flexible model to estimate the effect on sampling of both variables involved in control selection and other correlated variables. Compared to the NP approach, the proposed AIPW procedure is able to incorporate a larger number of variables to augment the weights, while maintaining reasonable robustness and efficiency. It is important to note that matching on a large number of variables is generally not desirable since it inherently increases the chance of the matched risk sets being empty. We therefore do not recommend that in practice.

There are a couple of future directions/limitations in this line of research. The approach we proposed can easily be extended to other types of two-phase sampling such as a covariate-stratified case–cohort studies. Flexible methods are also needed to account for other practical complications in two-phase sampling. Our methods here assume a NCC study where all cases will be selected due to a low incidence rate. However in practice, due to cases and sample availability, not all cases can be sampled²⁷. This may complicate the inference procedure and warrants future research.

The R code for carrying out the proposed AIPW procedure is available upon request.

ACKNOWLEDGEMENT

This research were funded by U01CA86368 and R01CA236558 awarded by the National Institutes of Health.

APPENDIX

APPENDIX A.

Note that in the appendixes, the derivations are with respect to the whole data and the proposed AIPW estimator, so we omit the subscript ‘all’ for notation convenience.

In this section, we show the asymptotic normality of the proposed AIPW estimator.

Assume C has a finite support [0, τ], P(T > τ) > 0 and the markers Y are continuous and bounded. The limit of $\hat{γ}$ , which is $\bar{γ}$ , is in the interior of a compact parameter space Ω_γ. Suppose the regularity conditions in Andersen and Gill²⁸ hold. Similarly to Du and Akritas²⁹, we assume the kernel function K is a symmetric probability density function with finite support and bounded second derivative. In addition, we assume the joint density of $R = Y^{T} \bar{γ}$ , T, and C has continuous derivatives.

Denote $β_{i} = β ({\tilde{π}}_{0 i}, X_{i})$ , we first get the asymptotic expression of $N^{1 ∕ 2} ({\hat{β}}_{i} - β_{i})$ , which will be used in later derivations. Recalling that

{\hat{U}}_{{\tilde{π}}_{0 i}, X_{i}} (β_{i}) = \frac{1}{N} \sum_{j = 1}^{N} (1 - δ_{j}) [V_{j} - exp (β_{i}^{T} Z_{j}) ∕ {1 + exp (β_{i}^{T} Z_{j})}] Z_{j} K_{b} (({\tilde{π}}_{0 i}, X_{i}) - ({\tilde{π}}_{0 j}, X_{j})) .

The derivative of ${\hat{U}}_{{\tilde{π}}_{0 i}, X_{i}} (β_{i})$ with respect to β_i is

\frac{\partial {\hat{U}}_{{\tilde{π}}_{0 i}, X_{i}} (β_{i})}{\partial β_{i}} = - \frac{1}{N} \sum_{j = 1}^{N} \frac{exp (β_{i}^{T} Z_{j})}{{1 + exp (β_{i}^{T} Z_{j})}^{2}} Z_{j} Z_{j}^{T} (1 - δ_{j}) K_{b} (({\tilde{π}}_{0 i}, X_{i}) - ({\tilde{π}}_{0 j}, X_{j})),

which converges to $- Σ_{i} ≔ - {\tilde{π}}_{0 i} (1 - {\tilde{π}}_{0 i}) E [Z_{j} Z_{j}^{T} ∣ {\tilde{π}}_{0 j} = {\tilde{π}}_{0 i}$ , $X_{j} = X_{i}] f ({\tilde{π}}_{0 i}, X_{i})$ , where f (·, ·) is the density function of $({\tilde{π}}_{0 i}, X_{i})$ . It follows that

N^{1 ∕ 2} ({\hat{β}}_{i} - β_{i}) = Σ_{i}^{- 1} N^{- 1 ∕ 2} \sum_{j = 1}^{N} [V_{j} - \frac{exp (β_{i}^{T} Z_{j})}{1 + exp (β_{i}^{T} Z_{j})}] Z_{j} (1 - δ_{j}) K_{b} (({\tilde{π}}_{0 i}, X_{i}) - ({\tilde{π}}_{0 j}, X_{j})) + o_{p} (1) .

(A.1)

For the proposed AIPW estimators with a general form

\hat{U} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} {\hat{ω}}_{i} U_{i},

(A.2)

where $E (U_{i}) = 0$ , ${\hat{ω}}_{i} = V_{i} ∕ {\hat{π}}_{i}$ and ${\hat{π}}_{i} = δ_{i} + (1 - δ_{i}) {\hat{π}}_{0 i}$ , we have

\hat{U} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} {\hat{ω}}_{i} U_{i} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} U_{i} + N^{- 1 ∕ 2} \sum_{i = 1}^{N} ({\tilde{ω}}_{i} - 1) U_{i} + N^{- 1 ∕ 2} \sum_{i = 1}^{N} ({\hat{ω}}_{i} - {\tilde{ω}}_{i}) U_{i} \equiv I_{1} + I_{2} + I_{3},

where I_{3} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} V_{i} (\frac{1}{{\hat{π}}_{0 i}} - \frac{1}{{\tilde{π}}_{0 i}}) U_{i} = - N^{- 1 ∕ 2} \sum_{i = 1}^{N} V_{i} \frac{{\hat{π}}_{0 i} - {\tilde{π}}_{0 i}}{{\hat{π}}_{0 i} {\tilde{π}}_{0 i}} U_{i} = - N^{- 1 ∕ 2} \sum_{i = 1}^{N} {\tilde{ω}}_{i} \frac{{\hat{π}}_{0 i} - {\tilde{π}}_{0 i}}{{\hat{π}}_{0 i}} U_{i} = - N^{- 1 ∕ 2} \sum_{i = 1}^{N} {\tilde{ω}}_{i} U_{i} \frac{{\hat{π}}_{0 i} - {\tilde{π}}_{0 i}}{{\tilde{π}}_{0 i}} + o_{p} (1) = - N^{- 1 ∕ 2} \sum_{i = 1}^{N} {\tilde{ω}}_{i} U_{i} \frac{\dot{g} (β_{i}^{T} Z_{i})}{g (β_{i}^{T} Z_{i})} Z_{i}^{T} ({\hat{β}}_{i} - β_{i}) + o_{p} (1) = - N^{- 1} \sum_{i = 1}^{N} {\tilde{ω}}_{i} U_{i} (1 - {\tilde{π}}_{0 i}) Z_{i}^{T} Σ_{i}^{- 1} N^{- 1 ∕ 2} \sum_{j = 1}^{N} (V_{j} - \frac{exp (β_{i}^{T} Z_{j})}{1 + exp (β_{i}^{T} Z_{j})}) Z_{j} (1 - δ_{j}) K_{b} (({\tilde{π}}_{0 i}, X_{i}) - ({\tilde{π}}_{0 j}, X_{j})) + o_{p} (1) = - N^{- 1 ∕ 2} \sum_{j = 1}^{N} E [U_{i} Z_{i}^{T} ∣ {\tilde{π}}_{0 i} = {\tilde{π}}_{0 j}, X_{i} = X_{j}] E [Z_{i} Z_{i}^{T} ∣ {\tilde{π}}_{0 i} = {\tilde{π}}_{0 j}, X_{i} = X_{j}]^{- 1} \times ({\tilde{ω}}_{j} - 1) Z_{j} (1 - δ_{j}) + o_{p} (1) = - N^{- 1 ∕ 2} \sum_{j = 1}^{N} ({\tilde{ω}}_{j} - 1) (1 - δ_{j}) Π_{j} + o_{p} (1),

where $Π_{j} = E [U_{i} Z_{i}^{T} ∣ {\tilde{π}}_{0 i} = {\tilde{π}}_{0 j}$ , $X_{i} = X_{j}] E [Z_{i} Z_{i}^{T} ∣ {\tilde{π}}_{0 i} = {\tilde{π}}_{0 j}$ , X_i = X_j]⁻¹Z_j, which can be regarded as a linear (conditional) projection of U_j onto the space of Z_j under the inner product ⟨X_i, Y_i⟩ = E(X_iY_i). Also note that $E [U_{i} Z_{i}^{T} ∣ {\tilde{π}}_{0 i} = {\tilde{π}}_{0 j}, X_{i} = X_{j}] E [Z_{i} Z_{i}^{T} ∣ {\tilde{π}}_{0 i} = {\tilde{π}}_{0 j}, X_{i} = X_{j}]^{- 1}$ is the minimizer of

\frac{1}{N} \sum_{i = 1}^{N} (U_{i} - θ Z_{i})^{2} K_{b} (({\tilde{π}}_{0 i}, X_{i}) - ({\tilde{π}}_{0 j}, X_{j}))

with respect to θ. So $E [Z_{i} (U_{i} - Π_{i}) ∣ {\tilde{π}}_{0 i}, X_{i}] = 0$ . Since the first component of Z_j is one, we have that $E [(U_{i} - Π_{i}) ∣ {\tilde{π}}_{0 i}, X_{i}] = 0$ . So $\hat{U}$ can be rewritten as

\hat{U} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} U_{i} + N^{- 1 ∕ 2} \sum_{i = 1}^{N} (1 - δ_{i}) ({\tilde{ω}}_{i} - 1) (U_{i} - Π_{i}) + o_{p} (1) .

(A.3)

It follows from Cai and Zheng¹ that $\hat{U}$ is asymptotically normal, with asymptotic variance

Σ_{U} = E (U_{i}^{2}) + E N^{- 1} \sum_{i = 1}^{N} (1 - δ_{i}) ({\tilde{ω}}_{i} - 1)^{2} (U_{i} - Π_{i})^{2} + o_{p} (1) = E (U_{i}^{2}) + E [(\frac{1 - {\tilde{π}}_{0 i}}{{\tilde{π}}_{0 i}}) (1 - δ_{i}) (U_{i} - Π_{i})^{2}] + o_{p} (1) .

Because the interaction term is

E [N^{- 1} \sum_{i \neq j}^{N} ({\tilde{ω}}_{i} - 1) ({\tilde{ω}}_{j} - 1) (U_{i} - Π_{i}) (U_{j} - Π_{j})] = (N - 1) E Cov ({\tilde{ω}}_{i} {U_{i} - Π_{i}}, {\tilde{ω}}_{j} {U_{j} - Π_{j}} ∣ D) = - m (N - 1) ∕ N \int η (t, X_{i}, δ_{i}) η (t, X_{j}, δ_{j}) \frac{d Λ_{N C C} (t)}{P (X \geq t)} = 0,

where $Λ_{N C C} (t) = \int_{0}^{t} d A_{N C C} (u) ∕ P (X \geq u), A_{N C C} (t) = E {N_{i} (t)}$ and

η (t, X_{i}, δ_{i}) E [{U_{i} - Π_{i}} I (X_{i} \geq t) (1 - {\tilde{π}}_{0 i}) ∕ {\tilde{π}}_{0 i}] = E (E [{U_{i} - Π_{i}} I (X_{i} \geq t) (1 - {\tilde{π}}_{0 i}) ∕ {\tilde{π}}_{0 i} ∣ {\tilde{π}}_{0 i}, X_{i}]) = 0

(A.4)

by the arguments before (A.3) and similar arguments to those of Samuelsen¹⁶.

From Cai and Zheng¹, we know that the asymptotic variance of the TIPW estimator $\hat{U} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} {\tilde{ω}}_{i} U_{i}$ is

Σ^{T I P W} = E (U_{i}^{2}) + E (U_{i}^{2} \frac{1 - {\tilde{π}}_{0 i}}{{\tilde{π}}_{0 i}}) - m \int η_{u} (t, X_{i}, δ_{i})^{2} \frac{d Λ_{N C C} (t)}{P (X \geq t)} + o_{p} (1) = E (U_{i}^{2} ∕ {\tilde{π}}_{0 i}) - m \int η_{u} (t, X_{i}, δ_{i})^{2} \frac{d Λ_{N C C} (t)}{P (X \geq t)} + o_{p} (1),

where $η_{u} (t, X_{i}, δ_{i}) = E [U_{i} I (X_{i} \geq t) (1 - {\tilde{π}}_{0 i}) ∕ {\tilde{π}}_{0 i}]$ .

Comparing these two asymptotic variances, we have

Σ^{T I P W} - Σ_{U} = E [(1 - δ_{i}) (\frac{1 - {\tilde{π}}_{0 i}}{{\tilde{π}}_{0 i}}) {U_{i}^{2} - (U_{i} - Π_{i})^{2}}] - m \int η_{u} (t, X_{i}, δ_{i})^{2} \frac{d Λ_{N C C} (t)}{P (X \geq t)} = E {(1 - δ_{i}) (\frac{1 - {\tilde{π}}_{0 i}}{{\tilde{π}}_{0 i}}) Π_{i}^{2}} - m \int η_{u} (t, X_{i}, δ_{i})^{2} \frac{d Λ_{N C C} (t)}{P (X \geq t)} = v a r {N^{- 1 ∕ 2} \sum_{i = 1}^{N} (1 - δ_{i}) ({\tilde{ω}}_{i} - 1) Π_{i}} \geq 0,

where the last equality holds similarly to (A.4). That is, $E [U_{i} I (X_{i} \geq t) (1 - {\tilde{π}}_{0 i}) ∕ {\tilde{π}}_{0 i}] = E [Π_{i} I (X_{i} \geq t) (1 - {\tilde{π}}_{0 i}) ∕ {\tilde{π}}_{0 i}]$ . Therefore, the proposed AIPW estimators are more efficient than the true weight based TIPW estimators.

APPENDIX B.

Now we derive the specific forms of U_i in the general form (A.2) for all the related estimators of interest. Then the asymptotic variances of these estimators can be obtained using the results in Appendix A.

For $\hat{γ}$ , similarly to Cai and Zheng¹, we have that

N^{1 ∕ 2} (\hat{γ} - \bar{γ}) = N^{- 1 ∕ 2} \sum_{i = 1}^{N} {\hat{ω}}_{i} U_{\bar{γ} i} + o_{p} (1), where U_{\bar{γ} i} = D (\bar{γ})^{- 1} \int {Y_{i} - \frac{I^{(1)} (t)}{I^{(0)} (t)}} d M_{i} (t), D (\bar{γ}) = N^{- 1} \sum_{i = 1}^{N} δ_{i} {\frac{I^{(2)} (X_{i}) I^{(0)} (X_{i}) - I^{(1)} (X_{i})^{\otimes 2}}{I^{(0)} (X_{i})^{\otimes 2}}}, I^{(k)} (t, γ) = N^{- 1} \sum_{i = 1}^{N} {\hat{ω}}_{i} I (X_{i} \geq t) exp (Y_{i}^{T} γ) Y_{i}^{k}, k = 0, 1, 2, I^{(k)} (t) = N^{- 1} \sum_{i = 1}^{N} {\hat{ω}}_{i} I (X_{i} \geq t) exp (Y_{i}^{T} \bar{γ}) Y_{i}^{k}, k = 0, 1, 2, A_{i} (t) = \int_{0}^{t} I (X_{i} \geq u) exp (Y_{i}^{T} \bar{γ}) d Λ_{0} (u), and M_{i} (t) = N_{i} (t) - A_{i} (t) .

For $\hat{Λ} (t ∣ r)$ , we have

N^{1 ∕ 2} {\hat{Λ} (t ∣ r) - Λ (t ∣ r)} = N^{1 ∕ 2} \int_{0}^{t} \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} ({\hat{γ}}^{T} Y_{i} - r) d N_{i} (u)}{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} ({\hat{γ}}^{T} Y_{i} - r) I (X_{i} \geq u)} - N^{1 ∕ 2} Λ (t ∣ r) = N^{1 ∕ 2} \int_{0}^{t} \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} ({\hat{γ}}^{T} Y_{i} - r) d M_{i} (u)}{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} ({\hat{γ}}^{T} Y_{i} - r) I (X_{i} \geq u)} = N^{1 ∕ 2} \int_{0}^{t} \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} ({\hat{γ}}^{T} Y_{i} - r) d M_{i} (u)}{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} ({\hat{γ}}^{T} Y_{i} - r) I (X_{i} \geq u)} - N^{1 ∕ 2} \int_{0}^{t} \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) d M_{i} (u)}{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) I (X_{i} \geq u)} + N^{1 ∕ 2} \int_{0}^{t} \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) d M_{i} (u)}{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) I (X_{i} \geq u)} = [\int_{0}^{t} \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} {\dot{K}}_{h} (Y_{i}^{T} \bar{γ} - r) ∕ h Y_{i} d M_{i} (u)}{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) I (X_{i} \geq u)} - \int_{0}^{t} \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) d M_{i} (u) {\sum_{i = 1}^{N} {\hat{ω}}_{i} {\dot{K}}_{h} (Y_{i}^{T} \bar{γ} - r) ∕ h Y_{i} I (X_{i} \geq u)}}{{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) I (X_{i} \geq u)}^{2}}] N^{1 ∕ 2} (\hat{γ} - \bar{γ}) + N^{1 ∕ 2} \int_{0}^{t} \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) d M_{i} (u)}{\sum_{i = 1}^{N} {\hat{ω}}_{i} K_{h} (Y_{i}^{T} \bar{γ} - r) I (X_{i} \geq u)} .

So the U_i form in (A.2) for $\hat{Λ} (t ∣ r)$ is

U_{Λ i} (t ∣ r) = U_{\bar{γ} i}^{T} [\int_{0}^{t} \frac{N^{- 1} \sum_{j = 1}^{N} {\hat{ω}}_{j} {\dot{K}}_{h} (Y_{j}^{T} \bar{γ} - r) ∕ h Y_{j} d M_{j} (u)}{N^{- 1} \sum_{j = 1}^{N} {\hat{ω}}_{j} K_{h} (Y_{j}^{T} \bar{γ} - r) I (X_{j} \geq u)} - \int_{0}^{t} \frac{\sum_{j = 1}^{N} {\hat{ω}}_{j} K_{h} (Y_{j}^{T} \bar{γ} - r) d M_{j} (u) {\sum_{l = 1}^{N} {\hat{ω}}_{l} {\dot{K}}_{h} (Y_{j}^{T} \bar{γ} - r) ∕ h Y_{l} I (X_{l} \geq u)}}{{\sum_{j = 1}^{N} {\hat{ω}}_{j} K_{h} (Y_{j}^{T} \bar{γ} - r) I (X_{j} \geq u)}^{2}}] + \int_{0}^{t} \frac{K_{h} (Y_{j}^{T} \bar{γ} - r) d M_{j} (u)}{N^{- 1} \sum_{j = 1}^{N} {\hat{ω}}_{j} K_{h} (Y_{j}^{T} \bar{γ} - r) I (X_{j} \geq u)} .

Recalling that $\hat{S} (t ∣ r) = exp {- \hat{Λ} (t ∣ r)}$ , we have that the U_i form in (A.2) for $\hat{S} (t ∣ r)$ is

U_{S i} (t ∣ r) = - S (t ∣ r) U_{Λ i} (t ∣ r) .

Recalling that $\hat{F} (r) = \frac{\sum_{i = 1}^{N} {\hat{ω}}_{i} I ({\hat{R}}_{_{all, i}} \leq r)}{\sum_{i = 1}^{N} {\hat{ω}}_{i}}$ , we get that the U_i form in (A.2) for $\hat{F} (r)$ is

U_{F i} (r) = I (R_{i} \leq r) - F (c) + D_{\bar{γ}} (r) U_{\bar{γ} i}, where D_{\bar{γ}} (r) = \partial E [I (R_{i} \leq r)] ∕ \partial γ ∣_{γ = \bar{γ}} .

Recalling $\hat{S} (r, t) = \int_{r}^{\infty} \hat{S} (t ∣ u) d \hat{F} (u)$ , we have that the U_i form in (A.2) for $\hat{S} (r, t)$ is

U_{S i} (t, r) = \int_{r}^{\infty} U_{S i} (t ∣ u) d F (u) + \int_{r}^{\infty} S (t ∣ u) d U_{F_{i}} (u) .

It follows that of U_i forms for the accuracy parameter estimators are

U_{T P R_{t} i} (r) = \frac{T P R_{t} (r) U_{S i} (t, r_{l}) - U_{F i} (r) - U_{S i} (t, r)}{1 - S (t)}, U_{F P R_{t} i} (r) = \frac{U_{S i} (t, r) - F P R_{t} (r) U_{S i} (t, r_{l})}{S (t)}, U_{P P V_{t} i} (r) = \frac{{P P V_{t} (r) - 1} U_{F i} (r) - U_{S i} (t, r)}{1 - F (r)}, U_{N P V_{t} i} (r) = \frac{U_{S i} (t) - U_{S i} (t, r) - N P V_{t} (r) U_{F i} (r)}{F (r)} .

Thus, we get the forms of U_i in (A.2) for the regression parameter estimator $\hat{γ}$ and the accuracy parameter estimators $\hat{TPR} (c ∣ t)$ , $\hat{FPR} (c ∣ t)$ , $\hat{PPV} (c ∣ t)$ , $\hat{NPV} (c ∣ t)$ .

APPENDIX C.

In this section, we show the validity of the proposed resampling technique.

The derivative of ${\hat{U}}_{{\tilde{π}}_{0 i}, X_{i}}^{*} (β_{i})$ with respect to $β_{i}^{*}$ is

\frac{\partial {\hat{U}}_{{\tilde{π}}_{0 i}, X_{i}}^{*} (β_{i}^{*})}{\partial β_{i}^{*}} = - \frac{1}{N} \sum_{j = 1}^{N} I_{j} \frac{exp (β_{i}^{T} Z_{j})}{{1 + exp (β_{i}^{T} Z_{j})}^{2}} Z_{j} Z_{j}^{T} (1 - δ_{j}) K_{b} (({\tilde{π}}_{0 i}, X_{i}) - ({\tilde{π}}_{0 j}, X_{j})) + o_{p} (1),

which also converges to −Σ_i. It follows that

N^{1 ∕ 2} ({\hat{β}}_{i}^{*} - β_{i}) = Σ_{i}^{- 1} N^{- 1 ∕ 2} \sum_{j = 1}^{N} I_{j} (V_{j} - \frac{exp (β_{i}^{T} Z_{j})}{1 + exp (β_{i}^{T} Z_{j})}) Z_{j} (1 - δ_{j}) K_{b} (({\tilde{π}}_{0 i}, X_{i}) - ({\tilde{π}}_{0 j}, X_{j})) + o_{p} (1) .

The perturbed form of (A.2) is

{\hat{U}}^{*} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} [δ_{i} I_{i} + (1 - δ_{i}) V_{0 i} I_{i} ∕ {\hat{π}}_{0 i}^{*}] U_{i} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} {I_{i} δ_{i} + (1 - δ_{i}) I_{i} + (1 - δ_{i}) (\frac{V_{0 i}}{{\tilde{π}}_{0 i}} - 1) I_{i} + (1 - δ_{i}) [\frac{V_{0 i} I_{i}}{{\hat{π}}_{0 i}^{*}} - \frac{V_{0 i} I_{i}}{{\tilde{π}}_{0 i}}]} U_{i} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} I_{i} U_{i} + N^{- 1 ∕ 2} \sum_{i = 1}^{N} (1 - δ_{i}) I_{i} (\frac{V_{0 i}}{{\tilde{π}}_{0 i}} - 1) U_{i} - N^{- 1 ∕ 2} \sum_{i = 1}^{N} (1 - δ_{i}) \frac{V_{0 i} I_{i}}{{\tilde{π}}_{0 i}} \frac{{\hat{π}}_{0 i}^{*} - {\tilde{π}}_{0 i}}{{\hat{π}}_{0 i}^{*}} U_{i} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} I_{i} U_{i} + N^{- 1 ∕ 2} \sum_{i = 1}^{N} (1 - δ_{i}) I_{i} ({\tilde{ω}}_{i} - 1) (U_{i} - Π_{i}) + o_{p} (1),

where the last equation follows similarly to the derivation of I₃ in Appendix A.

From (A.3), we know

\hat{U} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} U_{i} + N^{- 1 ∕ 2} \sum_{i = 1}^{N} (1 - δ_{i}) ({\tilde{ω}}_{i} - 1) (U_{i} - Π_{i}) + o_{p} (1) .

It follows that

{\hat{U}}^{*} - \hat{U} = N^{- 1 ∕ 2} \sum_{i = 1}^{N} (I_{i} - 1) U_{i} + N^{- 1 ∕ 2} \sum_{i = 1}^{N} (1 - δ_{i}) (I_{i} - 1) ({\tilde{ω}}_{i} - 1) (U_{i} - Π_{i}) + o_{p} (1) .

Therefore,

Var ({\hat{U}}^{*} - \hat{U} ∣ D) = Var (\hat{U}) .

References

1.Cai T, Zheng Y. Evaluating prognostic accuracy of biomarkers under nested case-control studies. Biostatistics 2012; 13(1): 89–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Zheng Y, Brown M, Lok A, Cai T, others. Improving efficiency in biomarker incremental value evaluation under two-phase designs. The Annals of Applied Statistics 2017; 11(2): 638–654. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Pepe M, Feng Z, Janes H, Bossuyt P, Potter J. Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. Journal of the National Cancer Institute 2008; 100(20): 1432–1438. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Johnson SR, Anderson GL, Barad DH, Stefanick ML. The Women’s Health Initiative: rationale, design and progress report. British Menopause Society Journal 1999; 5(4): 155–159. [Google Scholar]
5.Colditz GA, MANSON JE, HANKINSON SE. The Nurses’ Health Study: 20-year contribution to the understanding of health among women. Journal of Women’s Health 1997; 6(1): 49–62. [DOI] [PubMed] [Google Scholar]
6.Prentice RL, Breslow N. Retrospective studies and failure time models. Biometrika 1978: 153–158. [Google Scholar]
7.Breslow NE, Day NE, others. Statistical Methods in Cancer Research. 1 International Agency for Research on Cancer; Lyon. 1980. [Google Scholar]
8.Prentice R A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986; 73(1): 1. [Google Scholar]
9.Martin LJ, Melnichouk O, Huszti E, et al. Serum lipids, lipoproteins, and risk of breast cancer: a nested case-control study using multiple time points. JNCI: Journal of the National Cancer Institute 2015; 107(5). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Chambers JC, Loh M, Lehne B, et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. The lancet Diabetes & endocrinology 2015; 3(7): 526–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Jensen MK, Rimm EB, Furtado JD, Sacks FM. Apolipoprotein C-III as a potential modulator of the association between HDL-cholesterol and incident coronary heart disease. Journal of the American Heart Association 2012; 1(2): e000232. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the Cox regression model. The Annals of Statistics 1992: 1903–1928. [Google Scholar]
13.Langholz B, Borgan Ø. Estimation of absolute risk from nested case-control data. Biometrics 1997: 767–774. [PubMed] [Google Scholar]
14.Scheike TH, Juul A. Maximum likelihood estimation for Cox’s regression model under nested case-control sampling. Biostatistics 2004; 5(2): 193–206. [DOI] [PubMed] [Google Scholar]
15.Zeng D, Lin D, Avery C, North K, Bray M. Efficient semiparametric estimation of haplotype-disease associations in case–cohort and nested case–control studies. Biostatistics 2006; 7(3): 486–502. [DOI] [PubMed] [Google Scholar]
16.Samuelsen SO. A psudolikelihood approach to analysis of nested case-control studies. Biometrika 1997; 84(2): 379–394. [Google Scholar]
17.Lu W, Liu M. On estimation of linear transformation models with nested case–control sampling. Lifetime data analysis 2012; 18(1): 80–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Cai T, Zheng Y. Evaluating prognostic accuracy of biomarkers in nested case-control studies. Biostatistics 2011; 13(1): 89–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Cai T, Zheng Y. Nonparametric evaluation of biomarker accuracy under nested case-control studies. Journal of the American Statistical Association 2011; 106(494): 569–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American statistical Association 1989; 84(408): 1074–1078. [Google Scholar]
21.Cai T, Tian L, Uno H, Solomon SD, Wei L. Calibrating parametric subject-specific risk estimation. Biometrika 2010; 97(2): 389–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Davidson WS, Silva RGD, Chantepie S, Lagor WR, Chapman MJ, Kontush A. Proteomic analysis of defined HDL subpopulations reveals particle-specific protein clusters: relevance to antioxidative function. Arteriosclerosis, thrombosis, and vascular biology 2009; 29(6): 870–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Andrikoula M, McDowell I. The contribution of ApoB and ApoA1 measurements to cardiovascular risk assessment. Diabetes, Obesity and Metabolism 2008; 10(4): 271–278. [DOI] [PubMed] [Google Scholar]
24.Movva R, Rader DJ. Laboratory assessment of HDL heterogeneity and function. Clinical Chemistry 2008; 54(5): 788–800. [DOI] [PubMed] [Google Scholar]
25.Kohan AB. ApoC-III: a potent modulator of hypertriglyceridemia and cardiovascular disease. Current opinion in endocrinology, diabetes, and obesity 2015; 22(2): 119. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Colditz GA, Philpott SE, Hankinson SE. The impact of the Nurses’ Health Study on population health: prevention, translation, and control. American journal of public health 2016; 106(9): 1540–1545. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Kang S Fitting semiparametric accelerated failure time models for nested case–control data. Journal of Statistical Computation and Simulation 2017; 87(4): 652–663. [Google Scholar]
28.Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The annals of statistics 1982: 1100–1120. [Google Scholar]
29.Du Y, Akritas M. Uniform strong representation of the conditional Kaplan-Meier process. Mathematical Methods of Statistics 2002; 11(2): 152–182. [Google Scholar]

[R1] 1.Cai T, Zheng Y. Evaluating prognostic accuracy of biomarkers under nested case-control studies. Biostatistics 2012; 13(1): 89–100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Zheng Y, Brown M, Lok A, Cai T, others. Improving efficiency in biomarker incremental value evaluation under two-phase designs. The Annals of Applied Statistics 2017; 11(2): 638–654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Pepe M, Feng Z, Janes H, Bossuyt P, Potter J. Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. Journal of the National Cancer Institute 2008; 100(20): 1432–1438. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Johnson SR, Anderson GL, Barad DH, Stefanick ML. The Women’s Health Initiative: rationale, design and progress report. British Menopause Society Journal 1999; 5(4): 155–159. [Google Scholar]

[R5] 5.Colditz GA, MANSON JE, HANKINSON SE. The Nurses’ Health Study: 20-year contribution to the understanding of health among women. Journal of Women’s Health 1997; 6(1): 49–62. [DOI] [PubMed] [Google Scholar]

[R6] 6.Prentice RL, Breslow N. Retrospective studies and failure time models. Biometrika 1978: 153–158. [Google Scholar]

[R7] 7.Breslow NE, Day NE, others. Statistical Methods in Cancer Research. 1 International Agency for Research on Cancer; Lyon. 1980. [Google Scholar]

[R8] 8.Prentice R A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986; 73(1): 1. [Google Scholar]

[R9] 9.Martin LJ, Melnichouk O, Huszti E, et al. Serum lipids, lipoproteins, and risk of breast cancer: a nested case-control study using multiple time points. JNCI: Journal of the National Cancer Institute 2015; 107(5). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Chambers JC, Loh M, Lehne B, et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. The lancet Diabetes & endocrinology 2015; 3(7): 526–534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Jensen MK, Rimm EB, Furtado JD, Sacks FM. Apolipoprotein C-III as a potential modulator of the association between HDL-cholesterol and incident coronary heart disease. Journal of the American Heart Association 2012; 1(2): e000232. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the Cox regression model. The Annals of Statistics 1992: 1903–1928. [Google Scholar]

[R13] 13.Langholz B, Borgan Ø. Estimation of absolute risk from nested case-control data. Biometrics 1997: 767–774. [PubMed] [Google Scholar]

[R14] 14.Scheike TH, Juul A. Maximum likelihood estimation for Cox’s regression model under nested case-control sampling. Biostatistics 2004; 5(2): 193–206. [DOI] [PubMed] [Google Scholar]

[R15] 15.Zeng D, Lin D, Avery C, North K, Bray M. Efficient semiparametric estimation of haplotype-disease associations in case–cohort and nested case–control studies. Biostatistics 2006; 7(3): 486–502. [DOI] [PubMed] [Google Scholar]

[R16] 16.Samuelsen SO. A psudolikelihood approach to analysis of nested case-control studies. Biometrika 1997; 84(2): 379–394. [Google Scholar]

[R17] 17.Lu W, Liu M. On estimation of linear transformation models with nested case–control sampling. Lifetime data analysis 2012; 18(1): 80–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Cai T, Zheng Y. Evaluating prognostic accuracy of biomarkers in nested case-control studies. Biostatistics 2011; 13(1): 89–100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Cai T, Zheng Y. Nonparametric evaluation of biomarker accuracy under nested case-control studies. Journal of the American Statistical Association 2011; 106(494): 569–580. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American statistical Association 1989; 84(408): 1074–1078. [Google Scholar]

[R21] 21.Cai T, Tian L, Uno H, Solomon SD, Wei L. Calibrating parametric subject-specific risk estimation. Biometrika 2010; 97(2): 389–404. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Davidson WS, Silva RGD, Chantepie S, Lagor WR, Chapman MJ, Kontush A. Proteomic analysis of defined HDL subpopulations reveals particle-specific protein clusters: relevance to antioxidative function. Arteriosclerosis, thrombosis, and vascular biology 2009; 29(6): 870–876. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Andrikoula M, McDowell I. The contribution of ApoB and ApoA1 measurements to cardiovascular risk assessment. Diabetes, Obesity and Metabolism 2008; 10(4): 271–278. [DOI] [PubMed] [Google Scholar]

[R24] 24.Movva R, Rader DJ. Laboratory assessment of HDL heterogeneity and function. Clinical Chemistry 2008; 54(5): 788–800. [DOI] [PubMed] [Google Scholar]

[R25] 25.Kohan AB. ApoC-III: a potent modulator of hypertriglyceridemia and cardiovascular disease. Current opinion in endocrinology, diabetes, and obesity 2015; 22(2): 119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Colditz GA, Philpott SE, Hankinson SE. The impact of the Nurses’ Health Study on population health: prevention, translation, and control. American journal of public health 2016; 106(9): 1540–1545. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Kang S Fitting semiparametric accelerated failure time models for nested case–control data. Journal of Statistical Computation and Simulation 2017; 87(4): 652–663. [Google Scholar]

[R28] 28.Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The annals of statistics 1982: 1100–1120. [Google Scholar]

[R29] 29.Du Y, Akritas M. Uniform strong representation of the conditional Kaplan-Meier process. Mathematical Methods of Statistics 2002; 11(2): 152–182. [Google Scholar]

PERMALINK

Biomarker Evaluation Under Imperfect Nested Case–control Design

Xuan Wang

Yingye Zheng

Majken Karoline Jensen

Zeling He

Tianxi Cai

Summary

1 ∣. INTRODUCTION

2 ∣. ESTIMATING SAMPLING WEIGHTS VIA AIPW

3 ∣. APPLICATION OF AIPW TO ROBUST RISK PREDICTION

3.1 ∣. Calibrated Risk Estimate

3.2 ∣. Evaluating Prediction Performance

3.3 ∣. Resampling Based Interval Estimation

4 ∣. NUMERICAL STUDIES

TABLE 1.

TABLE 3.

TABLE 2.

TABLE 4.

5 ∣. REAL DATA ANALYSIS

TABLE 5.

TABLE 6.

6 ∣. DISCUSSION

ACKNOWLEDGEMENT

APPENDIX

APPENDIX A.

APPENDIX B.

APPENDIX C.

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Biomarker Evaluation Under Imperfect Nested Case–control Design

Xuan Wang

Yingye Zheng

Majken Karoline Jensen

Zeling He

Tianxi Cai

Summary

1 ∣. INTRODUCTION

2 ∣. ESTIMATING SAMPLING WEIGHTS VIA AIPW

3 ∣. APPLICATION OF AIPW TO ROBUST RISK PREDICTION

3.1 ∣. Calibrated Risk Estimate

3.2 ∣. Evaluating Prediction Performance

3.3 ∣. Resampling Based Interval Estimation

4 ∣. NUMERICAL STUDIES

TABLE 1.

TABLE 3.

TABLE 2.

TABLE 4.

5 ∣. REAL DATA ANALYSIS

TABLE 5.

TABLE 6.

6 ∣. DISCUSSION

ACKNOWLEDGEMENT

APPENDIX

APPENDIX A.

APPENDIX B.

APPENDIX C.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases