Abstract
Nested case-control (NCC) is a sampling method widely used for developing and evaluating risk models with expensive biomarkers on large prospective cohort studies. In a typical NCC design, biomarker values are obtained on a sub-cohort, where cases consist of all the events (subjects who experience the event during the follow-up). However, when the number of events is not small, due to the cost and limited availability of bio-specimen, one may select only a subset of events as cases. We refer to such a variation as the untypical NCC. Unfortunately, existing inverse probability weighted (IPW) estimators for the untypical NCC are biased, and they only focus on relative risk parameters under the proportional hazards (PH) model. In this manuscript, we propose new weighting methods that produce consistent IPW estimators for not only relative risk parameters but also several metrics that evaluate a risk model’s predictive performance. We also provide the inference procedure via perturbation resampling, which captures all the variance and between-subject covariance induced by the sampling processes for both case and control selections. In addition, our methods are not limited to the PH model, and they can be applied to the time-specific generalized linear model. Under the typical NCC design, our new weights are equivalent to the weight proposed in Samuelsen; under the untypical NCC, the IPW estimators using our weights have smaller bias and variance than the existing methods. We will demonstrate this improved performance via both analytical and numerical investigations.
Keywords: between-subject covariance, inverse probability weighting, nested case-control, perturbation resampling, time-dependent accuracy measure
1. Introduction
Risk prediction using novel biomarkers plays a vital role in disease prevention and management. The development and evaluation of risk models require rich information from large-scale cohort studies where participants are followed prospectively to the clinical outcome of interest, and their clinical information is collected at baseline. Many cohorts also obtain biological specimens that are used later for investigating new biomarkers to improve the predictive capacity of the risk model. Often, the cost and effort are expensive for ascertaining biomarkers on a large population, and there is a need to preserve precious biological samples. Thus, sub-cohort sampling such as nested case-control (NCC) is often employed, where new biomarkers are measured on the sub-cohort instead of the entire cohort.
An NCC sub-cohort is constructed in two steps: first, cases are selected, and second, for each case, a number of controls are selected from a group of subjects who are event-free at the failure time of the case; this group is referred to as the risk set of the case. For a typical NCC study, all the events (subjects who encounter the event during the follow-up) are included as cases at the first step. However, when the number of events is not small, the above-described practical constraints prevent using all the events to save cost or samples.
To address this challenge, some studies modified the typical NCC design by selecting only a subset of events as cases. We refer to this variation as the untypical NCC. For example, Jakszyn et al. (2006) investigated the association of Helicobacter pylori infection and vitamin C levels with the risk of gastric cancer. They included 229 out of 314 gastric cancer patients as cases, and for each case, two to four controls were selected; the same design was used in Jakszyn et al. (2012). Lü et al. (2018) conducted a multi-center study on fatality, hospitalization costs, and length of stay due to healthcare-associated infection (HAI). They selected 10 HAI events as cases from each of the 51 hospitals, and for each case, one control is selected and matched on several criteria. Besides these applications, Edelmann et al. (2020) and Graziano et al. (2021) compared the untypical NCC with other sub-cohort sampling methods, such as case-cohort, via simulation studies. Unlike these two papers, our manuscript focuses on just NCC, particularly the untypical variation.
For analyzing typical NCC data, conditional logistic regression has been widely used to estimate hazard ratios under the proportional hazards (PH) model (Goldstein and Langholz 1992). Another popular approach is the inverse probability weighted (IPW) estimation (Samuelsen 1997. Cai and Zheng 2012: 2011, Zhou et al. 2015). Compared to conditional logistic regression, the IPW can be applied to a variety of models, including the PH model and time-specific generalized linear model (GLM) (Uno et al. 2007). Samuelsen (1997) provided the weight formula: the inverse of the selection probability. For the typical NCC, since all the events are selected as cases at the first step, their selection probability is 1, and thus, their weights are 1.
What about the untypical NCC? Conditional logistic regression would still be valid, but some studies misused it. For example, Jakszyn et al. (2006) fit an unconditional logistic regression but included the matching variables as the covariates in the regression. Edelmann et al. (2020) and Graziano et al. (2021) employed the IPW method. However, their weights for the events in the sub-cohort are different: Edelmann et al. (2020) still used 1, but Graziano et al. (2021) used , where is the proportion of the events that are selected as cases at the first step.
Unfortunately, our investigations show that these two existing IPW estimators are both biased. It is because they failed to recognize the fact that for the untypical NCC, some events would be selected as cases at the first step, and others would be selected only at the second step through the control selection. To differentiate these two scenarios, we refer to the former as the event case and the latter as the event control. Because of this difference, we need to carefully design the weight for avoiding biased estimation and improving estimation efficiency.
In this manuscript, we propose three weighting methods that all result in consistent IPW estimators. Our first two methods weight event cases and event controls differently based on how they are selected. In contrast, our third weight has the same formula for these two groups, which is the inverse of the overall selection probability: the probability of being ever selected to the sub-cohort as a case or a control. For this weighting method, we formulate the NCC design as a two-stage stratified sampling framework described in Breslow et al. (2009). Under this framework, our third weight follows the method of Horvitz and Thompson (1952). We will prove that the expectation of each weight given the data is either exactly or asymptotically 1, which is the condition of the consistency for the IPW estimator.
Statistical inference is another challenge for the IPW estimation under the untypical NCC. It is difficult to obtain an analytical variance estimate for the IPW estimator because the variance expression is complicated. Standard resampling procedures such as bootstrap fail to capture the correlation induced by the finite-population sampling (Gray 2009). Cai and Zheng (2013) proposed a perturbation resampling method for the typical NCC, and their approach accounts for only between-control correlations. However, for the untypical NCC, sampling processes are involved for both case and control selections. Thus, we propose a new perturbation procedure that captures both between-case and between-control correlations.
In this manuscript, we provide IPW estimators under both the PH model and time-specific GLM. The model parameters characterize relative risks, such as log hazard ratios for the PH model, which are the focus of all the existing works on the untypical NCC. Besides these parameters, we are also interested in several time-dependent accuracy summaries for evaluating the risk model’s predictive performance, such as the true positive rate (TPR), false positive rate (FPR), positive predictive value (PPV), negative predictive value (NPV), and area under a receiver operating characteristic (ROC) curve (AUC) (Heagerty and Zheng 2005, Cai and Zheng 2012; 2011, Zhou et al. 2015). For allowing the public to implement our methods, we create an R package NCCIPW, available at https://github.com/michellezhou2009/NCCIPW.git.
The remaining manuscript is organized as follows. In Section 2, we introduce our three new weights and present the IPW estimators for model and accuracy parameters. In Section 3, we derive and compare the asymptotic properties of the IPW estimators using our weights. We also describe our perturbation resampling procedure for drawing inferences on the parameters. We compare our methods with the existing approaches via simulation studies in Section 4 and a data example in Section 5. Concluding remarks are given in Section 6.
2. IPW Estimation with Three New Weights
In this section, we introduce our three new weights and their resulting IPW estimators for model and accuracy parameters. Under the typical NCC, the three weights are equivalent to the weight of Samuelsen (1997). Under the untypical NCC, their differences are displayed in Table 1, which lists the expressions of each weight for three groups of subjects in the sub-cohort: event cases, event controls, and non-event controls. We have defined event cases and event controls earlier: they have the same disease status but enter the sub-cohort in different ways. Non-event controls are subjects who are event-free during the follow-up and selected as controls. Before these descriptions, we first introduce notation that we will use throughout the manuscript.
Table 1:
Expressions of our three weights: for three groups of subjects in the subcohort under the untypical NCC, i.e., : (i) events cases, events that are selected as cases, (ii) events controls, events that are selected only through the control selection, (iii) non-events controls, non-events that are selected as controls. Note that some event cases might be selected as other cases’ control, and thus, their could be 1 or 0.
| event cases | (1, 1, ) | 1 | ||
| event controls | (1, 0, 1) | 0 | ||
| non-event controls | (0, 0, 1) |
2.1. Notation
An NCC data set includes variables on survival outcome, markers, and sub-cohort sampling indicators. For survival outcome, let denote the time to the event of interest. For some subjects, might not be observed due to censoring; for instance, the subject is lost to follow-up, or the follow-up ends. Let denote a random censoring time, and let denote the follow-up duration. For each subject, we only observe and , where , and is an indicator function. Subjects who experience the event during the follow-up, i.e., with are referred to as events, and those with are referred to as non-events. Let denote a -dimensional vector of clinical markers and biomarkers.
We define two sub-cohort sampling indicator variables: if subject is selected as a case, and if subject is selected as a control. Let indicate whether subject is ever selected to the sub-cohort. In an NCC data set, and clinical markers are observed on the full cohort, but biomarkers are usually available for only subjects with .
Furthermore, we express the control indicator variable as
| (1) |
where is an indictor variable: if subject is a control of case subject , and denotes the risk set for subject . Specifically, consists of all the subjects who have not experienced the event before subject s event time. For some studies, the controls are also matched on some variables. For these situations, the risk set is expressed as , where is a vector of matching variables, and denotes being less than or equal to component-wise.
In addition, as described earlier, is the proportion of events that are selected as cases. When , all the events are cases, and this study is a typical NCC; when , it is an untypical variation.
2.2. Three New Weights
Our first two methods assign different weights to event cases and event controls. Like Edelmann et al. (2020) and Graziano et al. (2021), we consider two different values: 1 or for event cases. To make the expectation of each weight to be 1, we need to weight event controls correspondingly. Specifically, our first weight is defined as
| (2) |
and the second weight is
| (3) |
where
| (4) |
is the probability of subject being selected as a control, is the number of controls for each case, and is the size of the risk set for subject . By Equations (2) and (3), we have: for non-event controls, for event cases, but for event controls, but .
We want to point out that our formula of is different from Equation (2.6) in Samuelsen (1997), given as , which does not contain . We modify this formula because, under the untypical NCC, controls are selected only for selected cases, and not all the events are selected as cases. Thus, is needed. When whenever ; our formula (4) is equivalent to Equation (2.6) in Samuelsen (1997) under the typical NCC.
Remark 1 The probability is usually small for large cohort studies because is often very small relative to the size of the risk set. In addition, if an event, say subject , has a short event time, its probability can be close to zero since there are very few cases that subject is eligible to be included in their risk sets.
Remark 2 Using our notations, the weight in Edelmann et al. (2020) is expressed as
| (5) |
and the weight in Graziano et al. (2021) is expressed as
| (6) |
For non-event controls, ; for both events cases and event controls, but . In Appendix A.1 of Supplementary Material, we show that neither weight has an expectation of 1 when , and consequently, the resulting IPW estimators are biased for the untypical NCC.
Unlike and , our third weight has the same formula for event cases and event controls. This method is inspired by Breslow et al. (2009), which formulated the case-cohort sampling as a two-stage stratified sampling. Following this idea, we render the NCC design as such a framework and, based on it, propose a Horvitz-Thompson’s type of weight.
NCC as a two-stage stratified sampling.
In the first stage, the subjects in the full cohort are classified into strata, where is the number of events, i.e., with being the size of the full cohort. Each stratum includes one event and his/her risk set. At the second stage, a two-step sampling is performed: first, select strata with , and second, within each selected stratum, select the case and subjects from the risk set of this case.
This framework differs from the one described in Breslow et al. (2009) in two aspects. First, the strata in Breslow et al. (2009) are disjoint, but the strata in NCC usually overlap because a subject can be eligible for the risk set of more than one case. Second, the sampling in the untypical NCC is not independent. The strata selection is a finite-population sampling; if one stratum is selected, the others have a lower chance of being selected.
Horvitz-Thompson’s weight.
Our third weight is defined as the inverse probability of being ever selected into the sub-cohort, given as
| (7) |
This expression indicates that for both event cases and event controls, ; for non-event controls, .
2.3. Comparison of Three Weights
When , all three weights have the same expression and are equivalent to the weight proposed in Samuelsen (1997): .
When , the three weights have the same expression for non-events: , but for events, they are completely different. Specifically,
for event cases, , and ; among them, ;
for event controls, , and ; among them, . We also want to point out that the weight complete discards this group of subjects for estimation.
In Appendix A, we derive each weight’s expectation, variance, and covariance. These properties determine the consistency and asymptotic variance of its resulting IPW estimator, which will be delineated in Section 3.
2.4. IPW estimation
We consider two models for characterizing the relationship between the event time and markers Z: (i) the PH model and (ii) the time-specific GLM. Under each model, we present the IPW estimators of the model parameters and accuracy parameters. We want to point out that these parameters are associated with the probability model in Equation (8) below, which describes a larger population where the full cohort is from (Breslow et al. 2009).
Since the IPW estimators using the three weights have a similar expression except for the weight, we will present the estimators using , representing one of the three weights.
2.4.1. Model Parameters Estimation
Both the PH model and time-specific GLM can be expressed in the following form:
| (8) |
where is a link function, and , a -dimensional vector, consists of the relative risk parameters that characterize the effects of the markers on the risk .
PH model.
This model can be expressed as , where is the baseline cumulative hazard function. Based on Equation (8), , which can be estimated by the IPW Breslow’s estimator with the weight (Cai and Zheng 2012). The relative risk parameters are estimated via maximizing the IPW log partial likelihood function with the weight (Samuelsen 1997).
The PH model assumes the relative risks to be constant over time. However, in practice, the biomarkers may have strong effects on the short-term risk but weak for the long-term risk, or vice versa (Zhou et al. 2015). In these situations, the time-specific GLM allows the marker effects to vary over time .
Time-specific GLM.
This model can be expressed as , where both and are functions of . Given a , these parameters can be estimated via double IPW. Each observation is weighted by , where is the sampling weight accounting for the missing values of due to the sub-cohort sampling, and is the censoring weight accounting for the missing disease status due to censoring. The censoring weight is given as , where is a consistent estimator of , the survival function of the censoring time. If the censoring time is independent of both the event time and markers, can be the Kaplan-Meier estimator (Kaplan and Meier 1958). If the censoring time depends on the markers , a PH model can be fit to estimate .
Let and denote the IPW estimates of the model parameters under either the PH model or time-specific GLM. Let be the limiting values of as , and they can be regarded as the true values of the model parameters.
2.4.2. Accuracy Parameters Estimation
The probability in Equation (8) can be used as a risk score that classifies subjects into different risk categories. Given a cut-off value , subjects with are classified as the highrisk group, and the low-risk group consists of subjects with . We evaluate the above models’ predictive accuracy with the following time-dependent TPR, FPR, PPV, and NPV.
These accuracy parameters are based on the population risk score . They are defined as , , and . In addition, the time-dependent AUC is the area under the time-dependent ROC curve, which is a curve of versus over all possible values of . The time-dependent AUC can be expressed as , a conditional probability that, given a pair of an event and a non-event by time , the event has a higher risk score.
With the estimates , we can calculate the estimated risk . The time-dependent accuracy measures described above can be estimated by the double IPW estimators:
and
| (9) |
3. Asymptotic Properties of IPW estimators
In this section, we will present the consistency and asymptotic distribution of the IPW estimators using our three weights: , and .
3.1. Consistency of IPW estimators
In Appendix A.1, we prove that given the data , , and for any . Thus, the IPW estimator using each weight is consistent for both the typical and untypical NCC.
3.2. Asymptotic Distribution of IPW Estimators
In Appendix B, we show that the IPW estimator with each weight can be expressed as a weighted sum of independent zero-mean random variables that are functions of data , shown in Equation (B.1). Thus, they are asymptotically normally distributed, and the asymptotic variance is the sum of two components. The first component is the model-based variance of the estimator as if the full cohort data are available (Equation (B.3)). The second component is the sampling variance, i.e., variability from the sub-cohort sampling, including the variance and covariance of and given the data (Equations (B.4) and (B.5)). Breslow et al. (2009) and Edelmann et al. (2020) expressed their IPW estimator variance by a similar decomposition.
3.3. Comparison of IPW Estimators
Since the three IPW estimators are all consistent, we compare their efficiency. As pointed out in Section 2.3, when , the three weights have the same expression, and thus, their IPW estimators have the same asymptotic variance for the typical NCC.
Our comparison focuses on , i.e., the untypical NCC. The three IPW estimators have the same model-based variance since this part does not involve the sampling weight. Their sampling variance for non-events is also the same since the three weights have the same expression for this group. Therefore, the difference of the total variance is on the sampling variance for the events, mainly controlled by the variance of the sampling weight.
Equations (A.7) - (A.9) in Appendix A.2 express each weight’s variance for events: , and with , where . When , we have:
| (10) |
The first inequality is because the probability is usually very small (Remark 1), leading to , and thus, . The second inequality is because , and thus, . Again, because is small, these two variances are very close. The order in Equation (10) determines that the IPW estimator with is less efficient than the other two, which have similar variances. These analytical conclusions agree with the numerical results of the simulation study in Section 4.2 and the data example in Section 5.
3.4. Inference via Perturbation
To draw inferences on the parameters of interest, we need a variance estimate for the IPW estimator. Samuelsen (1997) provided an analytical variance estimate for the PH model parameters under the typical NCC. However, for the untypical NCC, the IPW estimator’s variance expression is more complicated, as shown above, and therefore, obtaining an analytical estimate is challenging. Resampling procedures, such as bootstrap, fail to estimate the variance accurately because they cannot emulate the between-subject correlations (Gray 2009. Cai and Zheng 2013).
Cai and Zheng (2013) proposed a perturbation resampling procedure for variance estimation under the typical NCC. Their method mimics the variance and covariance of via repeatedly perturbing these indicator variables. However, under the untypical NCC, there is also a sampling process for the case selection, so the perturbation of alone is not sufficient. Thus, we extend this procedure by perturbing both ’s and ’s to recover all the variances and covariances.
Like bootstrap, the perturbation method creates many perturbed counterparts for the estimator. We can calculate their empirical variance, which approximates the finite-sample variance of the estimator.
Perturbed counterpart of IPW estimator.
Here, we use to represent one of our weights. For the model parameters and , the perturbed counterparts of their IPW estimators, denoted by and , are obtained by replacing the sampling weight and censoring weight with their respective perturbed counterparts, and . We will explain how to perturb these two weights later.
For the double IPW estimators of accuracy parameters, their counterparts are obtained using the perturbed weight and the perturbed risk score . For example, the perturbed counterpart of in Equation (9) is given as
Perturbed censoring weight .
Let be independent and identically distributed random variables with mean 1 and variance 1. The perturbed censoring weight is given as , where is the estimate of with each subject weighted by .
Previously, we concluded that the two IPW estimators using and are more efficient than . Thus, we describe how to perturb these two weights.
Perturbed sampling weight .
This perturbed sampling weight is , obtained by replacing the indicator variables and as well as their probabilities and with their perturbed counterparts. Specifically, the perturbed counterpart of the case indicator is . The probability can be written as , and its perturbed counterpart is given as . By Equation (1), the perturbed counterpart of the control indicator variable is given as . The probability in Equation (4) can be written as where , and its perturbed counterpart is .
Perturbed sampling weight .
With the notations above, this perturbed weight is .
It is worth noting that our perturbation procedure is valid for both the typical and untypical NCC. When , and . The perturbed weights and are the same as those proposed in Cai and Zheng (2013) for the typical NCC studies.
4. Simulation Studies
We investigate the proposed methods via the following two simulation studies. The first one focuses on comparing the IPW estimators using our three new weights: , , and and two weights of Edelmann et al. (2020) and Graziano et al. (2021): and in terms of their bias and variability. For the ease of presentation, we use , and to denote the five IPW estimators using the five weights. The second study investigates the validity of the proposed perturbation resampling procedure for estimating the variance of the IPW estimator.
As defined earlier, the true values of the model and accuracy parameters are governed by the underlying data generating mechanism. Appendix C of Supplementary Material explains how to calculate the true parameter values based on the simulation scheme below. In both studies, we use the true parameter values to evaluate our proposed estimation and inference procedures as well as the existing estimators.
4.1. Simulation setting
For both studies, we use the same simulation setting. We consider one clinical marker that is measured on the full cohort and one biomarker that is measured only on the NCC sub-cohort. The marker is first generated from the standard normal distribution . In many situations, biomarkers are correlated with clinical markers, and thus, we simulate depending on , where . Given these two markers, the event time is obtained from
| (11) |
where is generated from an extreme value distribution with the cumulative distribution function . As described in Section 2.1, the event time might be censored by a random censoring time variable and a follow-up duration of years, where is generated from with . Thus, about 90% of the event times are censored by .
We consider two full cohort sizes: and 10,000. To construct the NCC sub-cohort, we set (the percentage of events that are selected as cases) to be 20%, 50%, and 80%. For each selected case, or 3 controls are selected from either (i) the case’s risk set without matching, or (ii) the case’s risk set with exact matching on a variable and matching up to \pm 1 on another variable . The two matching variables and are generated as follows. Let where with , and let be the closest integer to where .
We set the prediction time with the event rate . To estimate the absolute risk , we fit the PH model and the time-specific GLM with a logit link. As explained in Section 2.4.1, both models can be expressed as the general form in Equation (8): . Since we consider only one value, we suppress from the parameters. We also want to point out that the true data generating mechanism in Equation (11) follows the PH assumption, and thus the GLM is a misspecified model.
Under each model, we obtain the IPW estimates of the relative risk parameters . For the PH model, as mentioned earlier, conditional logistic regression (clogit) is an alternative method to obtain the estimates of . Thus, we compare the five IPW estimators with the clogit estimator. In addition, each model is evaluated on the following time-dependent accuracy measures: TPR, PPV, and NPV at a cutoff value making FPR = 0.05, and AUC.
As described in Section 3.2, the total variance of the IPW estimator includes the model-based variance as if the full cohort data are available. To capture this part, we generate 1000 replicates of the full cohort data, and for each replication, an NCC sub-cohort is constructed following one of the sampling schemes described above.
4.2. Study I: Comparison of IPW Estimators using Five Weights
In the first study, we compare the empirical bias and standard deviation (SD) of the five IPW estimators. Tables 2 and 3 report these statistics for the relative risk parameters and under the PH model and time-specific GLM respectively. The summary statistics for each model’s accuracy parameters are shown in Figures 1 and 2. All these results are for the settings where the controls are selected with matching, and those for the without-matching settings are included in Appendix D of Supplementary Material. We also report the square root of mean square error (RMSE) in Tables 4 – 7 of Appendix D.
Table 2:
Simulation Results: Estimation of and under the PH model. For each parameter, the results include the bias and empirical SD (in the parentheses) relative to the true parameter value in 100%. The NCC sub-cohort is constructed with matching.
| clogit | ||||||
|---|---|---|---|---|---|---|
| 1:1 | 9.8 (99.4) | 0.5 (41.4) | 0.5 (41.4) | 19.4 (53.0) | 21.7 (55.9) | 1.0 (56.6) |
| 1:3 | 6.8 (66.9) | 0.6 (32.4) | 1.4 (31.6) | 20.2 (38.1) | 21.2 (40.0) | 1.2 (39.5) |
| 1:1 | 3.6 (53.7) | 2.0 (27.4) | 2.1 (27.2) | 11.2 (30.7) | 11.4 (30.4) | 1.6 (36.2) |
| 1:3 | 2.4 (38.9) | 1.1 (19.8) | 1.3 (19.4) | 10.1 (21.0) | 10.1 (21.1) | 0.1 (25.5) |
| 1:1 | 2.5 (33.1) | 2.1 (21.1) | 2.1 (21.0) | 5.2 (21.9) | 5.2 (21.8) | 2.2 (31.0) |
| 1:3 | 1.3 (22.7) | 1.2 (16.5) | 1.2 (16.3) | 4.1 (16.7) | 4.1 (16.6) | −1.2 (22.7) |
| 1:1 | 8.2 (80.3) | 0.0 (29.1) | 0.6 (29.0) | 18.7 (37.2) | 22.6 (39.4) | 0.3 (38.8) |
| 1:3 | 2.5 (54.0) | −0.5 (21.8) | 0.1 (20.7) | 18.6 (24.9) | 19.7 (26.8) | −0.9 (27.4) |
| 1:1 | −0.9 (47.1) | −0.6 (18.5) | −0.6 (18.3) | 8.2 (20.5) | 8.4 (20.4) | 0.2 (26.3) |
| 1:3 | 2.4 (31.4) | 0.0 (14.7) | 0.2 (14.3) | 9.0 (15.3) | 9.1 (15.3) | −0.9 (18.6) |
| 1:1 | 1.5 (24.1) | 0.4 (14.7) | 0.5 (14.6) | 3.5 (15.2) | 3.5 (15.1) | 0.1 (20.1) |
| 1:3 | −0.7 (18.8) | 0.2 (11.8) | 0.2 (11.6) | 3.1 (11.9) | 3.1 (11.9) | −0.8 (15.9) |
| 1:1 | 11.8 (76.3) | 2.7 (29.6) | 2.7 (29.5) | 23.8 (39.7) | 27.3 (41.8) | 12.1 (46.0) |
| 1:3 | 6.4 (49.9) | 1.4 (22.3) | 1.4 (21.8) | 22.1 (26.2) | 24.0 (28.0) | 5.3 (30.8) |
| 1:1 | 6.3 (39.6) | 1.2 (18.7) | 1.3 (18.7) | 10.9 (21.1) | 11.5 (21.0) | 6.4 (28.8) |
| 1:3 | 1.5 (29.2) | 0.5 (14.6) | 0.6 (14.4) | 10.4 (15.7) | 11.0 (15.7) | 1.8 (21.5) |
| 1:1 | 1.0 (24.7) | 0.0 (14.9) | 0.0 (14.9) | 3.2 (15.6) | 3.2 (15.5) | 4.3 (23.4) |
| 1:3 | 1.2 (16.7) | 0.1 (11.9) | 0.2 (11.8) | 3.4 (12.1) | 3.5 (12.1) | 0.5 (17.7) |
| 1:1 | 6.1 (58.7) | 0.7 (20.8) | 0.6 (20.6) | 19.3 (26.8) | 23.3 (28.1) | 7.3 (31.6) |
| 1:3 | 4.9 (38.2) | 0.1 (15.4) | 0.5 (14.9) | 20.4 (17.7) | 23.8 (19.0) | 3.1 (21.6) |
| 1:1 | 5.1 (34.2) | 1.6 (12.9) | 1.6 (12.8) | 10.9 (14.5) | 11.5 (14.4) | 5.9 (21.1) |
| 1:3 | 0.3 (24.6) | 0.4 (10.8) | 0.5 (10.5) | 10.2 (11.4) | 10.9 (11.2) | 0.9 (15.4) |
| 1:1 | 2.0 (17.8) | 0.8 (10.5) | 0.8 (10.5) | 4.0 (10.9) | 4.0 (10.9) | 4.3 (16.3) |
| 1:3 | 1.0 (13.3) | 0.5 (8.3) | 0.6 (8.1) | 3.8 (8.3) | 3.9 (8.3) | −0.3 (12.4) |
Table 3:
Simulation Results: Estimation of and under the time-specific GLM. For each parameter, the results include the bias and empirical SD (in the parentheses) relative to the true parameter value in 100%. The NCC sub-cohort is constructed with matching. In addition, for the IPW estimator using , its summary statistics are calculated using only the replications for which the GLM converged. The number of replications (out of the total 1000 replications) for which the GLM did not converge is reported as numbers labelled with *.
| 1:1 | 1.8e+14 (9.5e+15); 34* | 2.2 (54.8) | 2.1 (54.2) | 19.5 (64.8) | 18.6 (69.1) |
| 1:3 | 1.3e+13 (8.7e+15); 8* | 3.2 (45.7) | 3.8 (43.0) | 19.7 (47.7) | 19.2 (56.7) |
| 1:1 | 1.8e+14 (4.6e+15); 2* | 3.5 (35.5) | 3.6 (35.2) | 11.7 (38.1) | 11.3 (38.0) |
| 1:3 | 5.8 (57.3); 0* | 2.4 (28.1) | 2.3 (27.1) | 9.3 (28.0) | 8.1 (28.6) |
| 1:1 | 5.2e+13 (1.6e+15); 2* | 4.0 (27.3) | 4.0 (27.3) | 6.7 (27.9) | 6.6 (27.9) |
| 1:3 | 3.6 (33.7); 0* | 2.8 (23.0) | 2.8 (22.8) | 5.0 (23.0) | 4.8 (23.0) |
| 1:1 | −8.3e+14 (1.4e+16); 30* | 1.4 (37.1) | 1.8 (36.3) | 19.0 (44.3) | 19.9 (46.8) |
| 1:3 | −3.2e+14 (7.6e+15); 7* | 0.7 (30.2) | 0.7 (28.6) | 16.6 (31.6) | 14.2 (38.8) |
| 1:1 | 3.6e+13 (1.1e+15); 5* | −0.6 (24.3) | −0.6 (24.1) | 7.2 (25.8) | 6.7 (25.9) |
| 1:3 | 5.5 (43.3); 1* | 0.5 (19.9) | 0.7 (19.1) | 7.8 (19.7) | 6.8 (20.2) |
| 1:1 | 3.1 (34.8); 0* | 1.1 (19.2) | 1.1 (19.1) | 3.7 (19.5) | 3.6 (19.5) |
| 1:3 | 0.0 (25.8); 0* | 0.8 (15.9) | 0.8 (15.7) | 3.1 (15.8) | 2.9 (15.8) |
| 1:1 | 5.3e+13 (6.7e+15); 34* | 4.0 (38.9) | 4.2 (38.3) | 23.4 (47.2) | 23.6 (50.2) |
| 1:3 | 2.5e+14 (6.4e+15); 8* | 3.4 (32.8) | 2.9 (30.6) | 20.3 (34.0) | 18.3 (40.8) |
| 1:1 | −7e+13 (2.4e+15); 2* | 1.8 (25.0) | 2.0 (24.9) | 10.3 (26.8) | 10.1 (26.9) |
| 1:3 | 6.3 (43.5); 0* | 1.9 (21.2) | 2.0 (20.4) | 9.7 (21.2) | 9.1 (21.4) |
| 1:1 | −2.7e+13 (8.5e+14); 2* | 0.7 (19.6) | 0.6 (19.5) | 3.3 (20.0) | 3.3 (20.0) |
| 1:3 | 3.3 (24.5); 0* | 1.1 (17.0) | 1.2 (16.9) | 3.7 (17.1) | 3.6 (17.1) |
| 1:1 | 5.3e+14 (9e+15); 30* | 2.1 (26.4) | 2.1 (25.8) | 19.9 (31.5) | 20.0 (33.7) |
| 1:3 | 4.3e+13 (3.2e+15); 7* | 1.8 (22.1) | 2.2 (21.3) | 19.2 (23.4) | 19.5 (28.7) |
| 1:1 | −2.4e+13 (7.6e+14); 5* | 3.2 (17.0) | 3.2 (17.0) | 11.3 (18.2) | 11.0 (18.3) |
| 1:3 | 3.4 (34.4); 1* | 2.2 (14.8) | 2.2 (14.0) | 9.8 (14.6) | 9.1 (14.7) |
| 1:1 | 4.5 (25.6); 0* | 2.0 (13.3) | 2.0 (13.3) | 4.7 (13.6) | 4.6 (13.6) |
| 1:3 | 2.9 (18.9); 0* | 1.9 (11.1) | 1.9 (10.9) | 4.4 (11.0) | 4.3 (11.0) |
Figure 1:

Simulation Results: Estimation of AUC and TPR, PPV, and NPV at the cut-off value which corresponds to FPR=0.05 under the PH model. For each parameter, the results include the absolute bias (the marker) and empirical SD (half of the bar length) relative to the true parameter value in 100%. The NCC sub-cohort is constructed with matching.
Figure 2:

Simulation Results: Estimation of AUC and TPR, PPV, and NPV at the cut-off value which corresponds to FPR=0.05 under the time-specific GLM. For each parameter, the results include the absolute bias (the marker) and empirical SD (half of the bar length) relative to the true parameter value in 100%. The NCC sub-cohort is constructed with matching. In addition, for the IPW estimator using , its summary statistics are calculated using only the replications for which the GLM converged.
When calculating these summary statistics, we take into account the magnitude of the parameter values: we compute the bias, SD, and RMSE relative to the true parameter value. Specifically, let denote the IPW estimate of a parameter based on the data generated in the -th replication, , and let denote the true value of . The above-mentioned tables and figures present the relative bias, calculated as rBias with , and the relative SD as . The RMSE is an overall metric accounting for both bias and variability. The relative RMSE is calculated as rRMSE . In addition, Figures 1 and 2 plot the absolute value of the relative bias to better compare its magnitude.
Results for PH model.
For estimating the relative risk parameters , we find that a larger full cohort size , a larger , or a larger number of controls per case tend to produce smaller bias and smaller variance. Among the five IPW estimators and the clogit estimator, and perform the best with the least bias and highest efficiency, and therefore, their RMSEs are the smallest. Between them, the variance of is slightly smaller, but their difference is minimal, especially when is close to 1. This observation agrees with our analytical conclusions in Section 3.3.
The IPW estimator has a smaller bias than and . However, its disadvantage is variability, which is the largest among all the estimators, and because of this, the RMSE of is also the largest. As explained in Section 3.3, its large variance is due to small values of . In addition, when decreases, we find that gets less efficient because there are fewer cases that a control can be selected for, which means drops even more.
The results clearly show that and are biased. As explained in Remark 2, the bias results from the expectations of their weights not equal to 1. In addition, as derived in Appendix A.1, when approaches 1, these expectations get closer to 1. Accordingly, we observe that their biases decrease when increases. When is small, for instance, , doubling the full cohort size did not significantly reduce the bias. In terms of variability, their variances are slightly smaller than the clogit estimator. Overall, they have a similar RMSE as the clogit estimator.
For the accuracy parameters, we observe similar results: and perform the best and similarly, has the largest variance, and and have the largest bias. When is closer to 1, these five IPW estimators are more alike because their weights are more similar to each other.
Results for time-specific GLM.
Most of the results are similar to those under the PH model. The main difference is that, besides being inefficient, did not converge in some replications. Table 3 lists the number of the “nonconverged” replications for each scenario. The summary statistics for all the parameters are calculated using only the estimates from the converged replications. However, some of them still yield “abnormal” estimates, which makes the bias and SD go through the roof.
We believe that this issue is caused by event controls that experienced the event early. As explained in Remark 1, for such subjects, their values are close to zero, leading to a huge weight , but their binary outcomes are 1. In addition, when decreases, as explained earlier, would be even smaller, and consequently, this issue gets worse. In comparison, other IPW estimators do not have this problem because their weights are smaller for these subjects.
4.3. Study II: Perturbation Resampling for Inference
The second study examines the validity of the perturbation resampling procedure for two IPW estimators and since they are the best among all the estimators. In the following, we use to denote one of these two estimators. In each replication, besides the estimate , we obtain 500 perturbed counterparts with generated from an exponential distribution with rate 1. The perturbed counterpart is denoted as where indexes the perturbation, and still indexes the replication. We calculate the SD of , which is the standard error (SE) associated with the estimate . This SE is denoted as and used to construct a level confidence interval (CI) for the parameter: , where is the quantile of a standard normal distribution.
We evaluate this procedure with two metrics, reported in Tables 4 and 5. The first metric compares the average SE, i.e., with the empirical SD of the estimates . Specifically, this metric is the ratio of the average SE over the empirical SD. The second metric is the empirical coverage probability of the 95% CI constructed with the perturbation-based SE. We observe that for both estimators, the average SE is close to the empirical SD, and the empirical coverage probability is close to the nominal level 95% for most of the scenarios. These results indicate that the proposed perturbation resampling can accurately capture the variability of the IPW estimator, and thus, the inference procedure is valid.
Table 4:
Simulation Results: the ratios of the average SE to empirical SD and empirical coverage probabilities (in the parentheses) of 95% CIs for relative risk parameters. Both the SEs and 95% CIs are obtained via perturbation resampling. The NCC sub-cohort is constructed with matching.
| PH Model | ||||
| 1:1 | 0.927 (92.8%) | 0.914 (92.5%) | 0.926 (92.5%) | 0.915 (92.5%) |
| 1:3 | 0.963 (94.2%) | 0.934 (93.7%) | 1.005 (95.2%) | 0.976 (94.9%) |
| 1:1 | 0.912 (92.0%) | 0.910 (92.4%) | 0.952 (94.0%) | 0.948 (93.7%) |
| 1:3 | 1.025 (94.5%) | 1.006 (94.8%) | 0.997 (94.5%) | 0.977 (94.1%) |
| 1:1 | 0.959 (94.2%) | 0.959 (94.7%) | 0.967 (94.3%) | 0.966 (94.4%) |
| 1:3 | 1.001 (95.2%) | 1.001 (94.8%) | 0.995 (94.0%) | 0.990 (94.4%) |
| 1:1 | 0.944 (93.6%) | 0.930 (93.6%) | 0.938 (94.1%) | 0.935 (94.1%) |
| 1:3 | 1.012 (94.9%) | 1.006 (95.4%) | 1.021 (94.7%) | 1.000 (94.6%) |
| 1:1 | 0.965 (94.3%) | 0.965 (94.3%) | 0.987 (94.8%) | 0.983 (94.2%) |
| 1:3 | 0.977 (94.4%) | 0.971 (94.1%) | 0.949 (94.2%) | 0.944 (94.1%) |
| 1:1 | 0.981 (94.5%) | 0.986 (94.0%) | 0.978 (94.3%) | 0.978 (94.5%) |
| 1:3 | 0.988 (94.3%) | 0.989 (94.4%) | 1.015 (94.4%) | 1.018 (94.7%) |
| Time-specific GLM | ||||
| 1:1 | 0.963 (94.8%) | 0.955 (94.6%) | 0.976 (95.2%) | 0.972 (94.7%) |
| 1:3 | 0.981 (94.8%) | 0.982 (94.8%) | 0.990 (95.2%) | 1.000 (94.6%) |
| 1:1 | 0.936 (93.5%) | 0.936 (93.0%) | 0.956 (94.5%) | 0.951 (94.0%) |
| 1:3 | 1.012 (94.7%) | 1.012 (95.3%) | 0.969 (94.3%) | 0.967 (94.5%) |
| 1:1 | 0.973 (93.5%) | 0.969 (93.8%) | 0.972 (94.4%) | 0.973 (94.1%) |
| 1:3 | 0.983 (94.6%) | 0.981 (94.9%) | 0.964 (94.3%) | 0.957 (94.3%) |
| 1:1 | 0.993 (94.4%) | 0.996 (94.4%) | 1.003 (95.7%) | 1.008 (95.5%) |
| 1:3 | 1.037 (95.2%) | 1.033 (95.2%) | 1.019 (95.5%) | 1.002 (95.6%) |
| 1:1 | 0.963 (93.9%) | 0.962 (93.4%) | 0.994 (94.6%) | 0.986 (94.0%) |
| 1:3 | 1.000 (94.8%) | 1.010 (95.8%) | 0.975 (94.7%) | 0.992 (95.0%) |
| 1:1 | 0.976 (93.6%) | 0.978 (94.0%) | 1.012 (94.8%) | 1.011 (95.1%) |
| 1:3 | 1.002 (94.7%) | 1.004 (94.8%) | 1.037 (96.2%) | 1.046 (96.1%) |
Table 5:
Simulation Results: the ratios of the average SE to empirical SD and empirical coverage probabilities (in the parentheses) of 95% CIs for accuracy parameters. Both the SEs and 95% CIs are obtained via perturbation resampling. The NCC sub-cohort is constructed with matching.
| AUC | TPR | PPV | NPV | |||||
|---|---|---|---|---|---|---|---|---|
| PH Model | ||||||||
| 1:1 | 0.922 (90.9%) | 0.914 (91.1%) | 1.006 (94.0%) | 1.008 (94.8%) | 1.107 (96.1%) | 1.093 (96.0%) | 0.928 (92.2%) | 0.925 (93.1%) |
| 1:3 | 0.973 (92.8%) | 0.967 (93.9%) | 1.027 (94.4%) | 1.027 (94.7%) | 1.036 (94.9%) | 1.031 (95.2%) | 0.913 (92.6%) | 0.929 (92.5%) |
| 1:1 | 0.949 (93.7%) | 0.949 (94.2%) | 1.011 (94.9%) | 1.007 (93.6%) | 1.027 (94.4%) | 1.036 (95.6%) | 0.911 (91.9%) | 0.908 (92.4%) |
| 1:3 | 0.958 (93.7%) | 0.954 (93.5%) | 1.015 (94.4%) | 1.002 (94.2%) | 0.979 (93.3%) | 0.973 (93.9%) | 0.896 (91.6%) | 0.891 (91.3%) |
| 1:1 | 0.987 (93.6%) | 0.981 (93.3%) | 1.012 (95.3%) | 1.013 (95.1%) | 1.005 (94.9%) | 1.007 (94.8%) | 0.941 (92.9%) | 0.943 (93.3%) |
| 1:3 | 1.024 (95.2%) | 1.019 (95.3%) | 1.011 (95.2%) | 1.014 (94.8%) | 0.973 (93.6%) | 0.986 (94.1%) | 0.860 (90.0%) | 0.857 (90.3%) |
| 1:1 | 0.967 (94.2%) | 0.961 (94.1%) | 1.022 (94.5%) | 1.018 (94.8%) | 1.063 (96.3%) | 1.060 (95.2%) | 0.953 (93.7%) | 0.948 (93.6%) |
| 1:3 | 0.968 (94.5%) | 0.949 (93.5%) | 0.991 (93.5%) | 0.988 (93.3%) | 0.967 (94.3%) | 0.972 (94.7%) | 0.915 (92.3%) | 0.919 (92.8%) |
| 1:1 | 0.957 (93.1%) | 0.955 (93.4%) | 1.016 (95.1%) | 1.021 (95.4%) | 1.034 (95.1%) | 1.046 (95.1%) | 0.912 (91.7%) | 0.911 (91.9%) |
| 1:3 | 0.970 (93.4%) | 0.974 (94.1%) | 0.997 (94.9%) | 1.002 (94.6%) | 0.976 (94.7%) | 0.994 (94.8%) | 0.876 (90.6%) | 0.883 (90.7%) |
| 1:1 | 0.964 (93.8%) | 0.967 (93.7%) | 1.006 (94.8%) | 1.007 (94.7%) | 1.026 (94.8%) | 1.029 (95.3%) | 0.897 (90.7%) | 0.898 (90.5%) |
| 1:3 | 0.982 (95.1%) | 0.985 (94.6%) | 1.031 (95.8%) | 1.050 (96.3%) | 1.008 (95.5%) | 1.024 (95.4%) | 0.833 (89.3%) | 0.838 (89.0%) |
| Time-specific GLM | ||||||||
| 1:1 | 0.916 (91.1%) | 0.908 (90.8%) | 1.029 (95.1%) | 1.023 (95.2%) | 1.107 (95.9%) | 1.109 (96.5%) | 0.938 (93.3%) | 0.925 (93.4%) |
| 1:3 | 0.958 (92.3%) | 0.954 (92.7%) | 1.050 (94.9%) | 1.059 (95.5%) | 1.039 (95.2%) | 1.055 (95.3%) | 0.917 (91.9%) | 0.937 (92.7%) |
| 1:1 | 0.947 (93.5%) | 0.946 (93.9%) | 1.022 (94.5%) | 1.021 (94.6%) | 1.038 (94.4%) | 1.046 (95.0%) | 0.910 (92.0%) | 0.906 (92.2%) |
| 1:3 | 0.956 (93.6%) | 0.952 (93.6%) | 1.014 (94.3%) | 1.003 (95.0%) | 0.974 (93.7%) | 0.972 (94.1%) | 0.894 (92.0%) | 0.891 (91.7%) |
| 1:1 | 0.987 (93.2%) | 0.982 (93.3%) | 1.027 (95.4%) | 1.017 (95.2%) | 1.012 (94.6%) | 1.010 (94.7%) | 0.943 (93.3%) | 0.941 (92.8%) |
| 1:3 | 1.020 (95.2%) | 1.017 (95.2%) | 1.012 (95.5%) | 1.009 (94.6%) | 0.976 (93.6%) | 0.981 (93.4%) | 0.859 (90.2%) | 0.858 (90.1%) |
| 1:1 | 0.962 (94.2%) | 0.956 (93.8%) | 1.027 (94.7%) | 1.036 (95.2%) | 1.058 (95.4%) | 1.072 (95.6%) | 0.954 (93.8%) | 0.955 (94.2%) |
| 1:3 | 0.966 (94.2%) | 0.945 (93.0%) | 1.000 (94.3%) | 1.016 (94.9%) | 0.969 (94.3%) | 0.989 (95.1%) | 0.922 (92.4%) | 0.931 (92.5%) |
| 1:1 | 0.956 (92.9%) | 0.955 (92.9%) | 1.029 (94.9%) | 1.030 (95.5%) | 1.055 (95.7%) | 1.054 (95.7%) | 0.912 (91.8%) | 0.912 (91.8%) |
| 1:3 | 0.968 (93.4%) | 0.973 (94.2%) | 1.004 (94.5%) | 1.011 (95.3%) | 0.981 (95.3%) | 1.005 (95.4%) | 0.875 (91.3%) | 0.880 (91.0%) |
| 1:1 | 0.959 (93.5%) | 0.963 (93.7%) | 1.014 (94.9%) | 1.019 (94.6%) | 1.036 (95.2%) | 1.044 (95.7%) | 0.900 (91.5%) | 0.901 (91.1%) |
| 1:3 | 0.979 (95.0%) | 0.982 (94.4%) | 1.036 (95.8%) | 1.059 (96.6%) | 1.010 (95.9%) | 1.022 (95.7%) | 0.835 (89.7%) | 0.847 (89.9%) |
5. Data Example: the Framingham Offspring Study
To further compare our three new weights with existing methods, we use the Offspring Cohort of Framingham Heart Study (Wawrzyniak 2013) as an example, in which the outcome of interest is a cardiovascular disease (CVD) event. This cohort includes 1501 males and 1644 females; among them, 989 subjects have encountered a CVD event during the follow-up of about 35 years.
To build a risk model for CVD, we consider two markers: the Framingham risk score (FRS), and a biomarker, C-reactive protein (CRP). The FRS was developed by Wilson et al. (1998) to estimate the 10-year CVD risk. This score is gender-specific and based on several risk factors including age, systolic blood pressure, diastolic blood pressure, total cholesterol, high-density lipoprotein cholesterol, current smoking status, and diabetes status. The CRP is an inflammation biomarker and shown to improve the prediction on top of the risk variables from the FRS (Ridker 2003, Cook et al. 2006).
The Offspring Cohort is the full cohort, and we construct NCC sub-cohorts from it and obtain the IPW estimates using our three weights: , and , and two existing weights and . To evaluate the accuracy of these IPW estimators, we use the estimates obtained from the full cohort as the reference values since both FRS and CRP are available for the full cohort. We want to point out that unlike the simulation studies, we cannot compute the true parameter values for this example because the underlying data generating mechanism is unknown. Thus, the full-cohort estimates are the best estimates we could have in this situation.
Specifically, we draw 100 sub-cohorts following each of the two sampling schemes: (i) selecting of the 989 CVD events as cases and 1 control for each case, and (ii) selecting of the 989 CVD events as cases and 3 controls for each case. These two schemes produce a similar sub-cohort size despite different : out of 100 samples, on average, about 904 subjects are selected based on Scheme (i) and 896 based on Scheme (ii).
We consider two prediction times and 30 years. Within 15-year follow-up, 213 subjects experienced a CVD event and 6 subjects were lost to follow-up; the 15-year event risk is estimated to be about 7%; within 30-year follow-up, 805 subjects experienced a CVD event, and 169 subjects were lost-to-follow up; the 30-year event risk is estimated to be about 34%. We fit both the PH model and time-specific GLM with the FRS and log transformed CRP. Figures 3, and 5 include the boxplots of the five IPW estimates for the relative risk and accuracy parameters. In these figures, the horizontal lines represent the full-cohort estimates. In addition, for the relative risk parameters of the PH model, we compare the five IPW estimators with the clogit estimator in Figure 3(a).
Figure 3:

Data Example: Boxplots of the IPW estimates using five weights: , and for the marker effects under the PH model and time-specific GLM based on 100 NCC sub-cohorts. For the PH model, the IPW estimates are also compared with the clogit estimates. The triangle inside each box represents the mean of the estimates; the dashed horizontal lines represent the full-cohort estimates.
Figure 5:

Data Example: Boxplots of the IPW estimates using five weights: , and for the accuracy parameters under the time-specific GLM based on 100 NCC sub-cohorts. The triangle inside each box represents the mean of the estimates; the dashed horizontal lines represent the full-cohort estimates.
We observe similar results with those of the simulation study. The two IPW estimators and perform the best: they are closest to the full-cohort estimates on average and have the smallest variability. In addition, these two estimators perform similarly. By contrast, and are farthest from the full-cohort estimates on average, especially for the accuracy parameters. For , although the GLM estimation converged for all the samples, it still has the largest variance.
From this data example, we also observe that the advantage of and over all the other methods is more obvious for predicting the 15-year risk than the 30-year risk. In addition, each estimator tend to have a larger variance for , compared to , although these sampling schemes lead to a similar sub-cohort size. This phenomenon is more apparent for the 15-year risk prediction. This indicates that if the sub-cohort size is fixed given a budget, selecting more cases, i.e., a larger , and a lower control-to-case ratio would lead to more efficient estimates.
6. Concluding Remarks
In this paper, we are interested in the untypical NCC design where a subset, not all, of the events are selected as cases. Such a design, although not generally considered in the statistical literature, is useful in practice for various reasons. In particular, for biomarker studies, samples from events may be more easily depleted and require careful preservation. To analyze the untypical NCC data with the IPW approach, we need to address two challenges. First, event cases and event controls are selected to the sub-cohort in different ways. Failing to account for this difference would lead to biased estimation, like Edelmann et al. (2020) and Graziano et al. (2021). In contrasts, our two weights and weight these two groups differently based on how an event enters the sub-cohort. Although our third weight has the same formula for both groups, its selection probability for events counts both case and control selections. In addition, when all the events are selected as cases, all these three weights are equivalent to the Samuelsen’s weight.
The second challenge is statistical inference since the IPW estimator for the untypical NCC has a complicated variance structure, including both between-case and between-control correlations induced by the finite-population sampling. We provided a perturbation resampling procedure for drawing inferences on both model and accuracy parameters. Our simulation study has demonstrated that this procedure can well approximate the empirical variance of the IPW estimator, and consequently, the coverage probability of the perturbation-based CI is close to the nominal level.
Among our three IPW estimators, and have a similar performance, and they are more efficient than . We have provided the analytical derivation and numerical evidence. In addition, and perform better than all the existing IPW and clogit estimators. Thus, we recommend these two weighting methods for the IPW estimation under the untypical NCC.
Our proposed framework can be further extended to NCC studies where the cases are sampled via a more complex design, such as stratified sampling (Lü et al. 2018). In such a situation, we can design the following two weights: , where is the probability that subject is selected as a case, and . Together, our proposed work could open the door for more efficient and practical biomarker studies.
In this manuscript, all the weights are design-based because the selection probability is calculated based on the sampling scheme. An alternative approach is to augment the weight by either calibrating or estimating the selection probability from a model with auxiliary variables (Breslow et al. 2009). The model-based weighting method is not the focus of this manuscript, but it is worth future exploration for NCC designs since it has a potential to improve the estimation efficiency.
Supplementary Material
Figure 4:

Data Example: Boxplots of the IPW estimates using five weights: , and for the accuracy parameters under the PH model based on 100 NCC sub-cohorts. The triangle inside each box represents the mean of the estimates; the dashed horizontal lines represent the full-cohort estimates.
Acknowledgement
We appreciate the reviewers for constructive suggestions that lead to the improvement in our methodologies and manuscript. In addition, we acknowledge that the Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or the NHLBI.
The work is supported by grants from R01 HL089778, R01 CA236558 and U01 CA86368 from the National Institutes of Health.
Footnotes
Supplementary Material
The supplementary material consists of four parts. In Appendix A, we derive the expectation, variance, and covariance of our three new weights. Appendix B includes the derivation of the asymptotic variance for the IPW estimators and the justification of the perturbation resampling method for estimating the asymptotic variance. In Appendix C, we explain how to obtain the true values of the model and accuracy parameters based on the simulation setting in Section 4.1, and Appendix D includes additional results of the simulation studies.
References
- Breslow NE, Lumley T, Ballantyne CM, Chambless LE, and Kulich M (2009). Improved horvitz-thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Statistics in biosciences, 1(1):32–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai T and Zheng Y (2011). Nonparametric evaluation of biomarker accuracy under nested case-control studies. Journal of the American Statistical Association, 106:569–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai T and Zheng Y (2012). Evaluating prognostic accuracy of biomarkers under nested case-control studies. Biostatistics, 13:89–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai T and Zheng Y (2013). Resampling procedures for making inference under nested case–control studies. Journal of the American Statistical Association, 108(504):1532–1544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cook NR, Buring JE, and Ridker PM (2006). The effect of including c-reactive protein in cardiovascular risk prediction models for women. Annals of internal medicine, 145(1):21–29. [DOI] [PubMed] [Google Scholar]
- Edelmann D, Ohneberg K, Becker N, Benner A, and Schumacher M (2020). Which patients to sample in clinical cohort studies when the number of events is high and measurement of additional markers is constrained by limited resources. Cancer Medicine, 9(20):7398–7406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldstein L and Langholz B (1992). Asymptotic theory for nested case-control sampling in the Cox regression model. The Annals of Statistics, 20(4):1903–1928. [Google Scholar]
- Gray RJ (2009). Weighted analyses for cohort sampling designs. Lifetime data analysis, 15(1):2440. [DOI] [PubMed] [Google Scholar]
- Graziano F, Valsecchi MG, and Rebora P (2021). Sampling strategies to evaluate the prognostic value of a new biomarker on a time-to-event end-point. BMC medical research methodology, 21(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heagerty P and Zheng Y (2005). Survival model predictive accuracy and ROC curves. Biometrics, 61(1):92–105. [DOI] [PubMed] [Google Scholar]
- Horvitz DG and Thompson DJ (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685. [Google Scholar]
- Jakszyn P, Agudo A, Lujan-Barroso L, Bueno-de Mesquita HB, Jenab M, Navarro C, Palli D, Boeing H, Manjer J, Numans ME, et al. (2012). Dietary intake of heme iron and risk of gastric cancer in the european prospective investigation into cancer and nutrition study. International journal of cancer, 130(11):2654–2663. [DOI] [PubMed] [Google Scholar]
- Jakszyn P, Bingham S, Pera G, Agudo A, Luben R, Welch A, Boeing H, Del Giudice G, Palli D, Saieva C, et al. (2006). Endogenous versus exogenous exposure to n-nitroso compounds and gastric cancer risk in the european prospective investigation into cancer and nutrition (epic-eurgast) study. Carcinogenesis, 27(7):1497–1501. [DOI] [PubMed] [Google Scholar]
- Kaplan EL and Meier P (1958). Nonparametric estimation from incomplete observations. Journal of the American statistical association, 53(282):457–481. [Google Scholar]
- Lü Y, Cai MH, Cheng J, Zou K, Xiang Q, Wu JY, Wei DQ, Zhou ZH, Wang H, Wang C, et al. (2018). A multi-center nested case-control study on hospitalization costs and length of stay due to healthcare-associated infection. Antimicrobial Resistance & Infection Control, 7(1):99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ridker PM (2003). Clinical application of c-reactive protein for cardiovascular disease detection and prevention. Circulation, 107(3):363–369. [DOI] [PubMed] [Google Scholar]
- Samuelsen S (1997). A psudolikelihood approach to analysis of nested case-control studies. Biometrika, 84(2):379–394. [Google Scholar]
- Uno H, Cai T, Tian L, and Wei L (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association, 102(478):527–537. [Google Scholar]
- Wawrzyniak AJ (2013). Framingham Heart Study, pages 811–814. Springer New York, New York, NY. [Google Scholar]
- Wilson PW, D’Agostino RB, Levy D, Belanger AM, Silbershatz H, and Kannel WB (1998). Prediction of coronary heart disease using risk factor categories. Circulation, 97(18):1837–1847. [DOI] [PubMed] [Google Scholar]
- Zhou QM, Zheng Y, Chibnik LB, Karlson EW, and Cai T (2015). Assessing incremental value of biomarkers with multi-phase nested case-control studies. Biometrics, 71(4):1139–1149. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
