Abstract
Nested case-control sampling design is a popular method in a cohort study whose events are often rare. The controls are randomly selected with or without the matching variable fully observed across all cohort samples to control confounding factors. In this article, we propose a new nested case-control sampling design incorporating both extreme case-control design and a resampling technique. This new algorithm has two main advantages with respect to the conventional nested case-control design. First, it inherits the strength of extreme case-control design such that it does not require the risk sets in each event time to be specified. Second, the target number of controls can only be determined by the budget and time constraints and the resampling method allows an under sampling design, which means that the total number of sampled controls can be smaller than the number of cases. A simulation study demonstrated that the proposed algorithm performs well even when we have a smaller number of controls compared with the number of cases. The proposed sampling algorithm is applied to a public data collected for “Thorotrast Study.”
Keywords: cost-effective design, extreme case-control design, nested case-control studies, partial likelihood, resampling method
Introduction
In practice, the large-scale studies are often limited because of time and budget constraints. One popular cohort-based sampling method that reduces the costs of data collection is the nested case-control (NCC) design that selects a set of controls from the risk sets defined in the cohort and then matches them to cases, respectively. For example, Grant et al1 considered the NCC design to reduce the cost and time efforts involved in the covariate ascertainment for the Life Span Study and the Adult Health Study cohorts of the Radiation Effects Research Foundation, in Japan. These NCC sample data having a time-to-event outcome are popularly analyzed using the Cox proportional hazards model based on partial likelihood. Many previous studies discuss the more general estimations and asymptotic properties of parameter estimators obtained from the partial likelihood in the NCC design.2–7
In general, the NCC design is only applicable to retrospective studies because censored subjects in control groups can be selected during the sampling process. However, researchers often want to add new covariates to statistical models (eg, epigenetic studies), such as genetic information and blood test results, as a prospective study. In this case, control groups should be observable subjects in the sense that we can obtain other information from the selected controls. One solution is to sample the controls at the end-point of the observation time period, which will generally be available for further data collection. For example, Yao et al8 considered a new retrospective cohort-based sampling design, called end-point sampling (EPS), to sample the observed controls from a cohort, and then applied the expectation and maximization (EM) method as a parameter estimation. Similarly, Sboner et al9 considered an extreme case-control (ECC) design with a naive statistical method in which the cases were patients who had died from prostate cancer within 10 years after diagnosis and the controls were patients who survived at least 10 years. Although these EPS or ECC designs were initially implemented to improve the efficiency of the design, we can use this sampling idea to reduce the costs of data collection and to observe additional information from the selected samples.
In this article, we propose a new cost-effective extreme case-control design (CECC) utilizing a resampling technique. First, the total number of controls, say was determined by the budget constraints. Second, we drew the controls at the end-point or the final risk set matched to the final event. Here, is an arbitrary number, which means that it can be smaller than the number of cases and is not necessarily a multiple of the number of cases. However, the number of matched controls for each case was conserved as the same number for all of the cases using a resampling method. To adjust the length bias due to the ECC sampling design, we applied Salim et al’s10 weighted partial likelihood for parameter estimation. Thus, our proposed sampling algorithm inherits the main advantage of the ECC design. Also, the new sampling strategy is very effective in saving time and costs, because the total number of controls can be chosen freely from the constraints.
This manuscript is organized as follows. First, we describe the classical NCC sampling design and our proposed CECC sampling design. We conduct simulation studies to illustrate the relative bias and efficiency of our proposed method in comparison to the classical NCC study design. In the real data example section, we applied our new sampling algorithm to the “Thorotrast Study” data which investigated the relationship between liver cancer occurrence and volume of injected thorotrast.11,12 Finally, the concluding remarks and the limitations of our proposed method are provided.
Methods
Basic setup
Consider a cohort study with a size of In the cohort, we observe vectors of variables, where is the time-invariant covariate, is an observed time, and and denote the event time and censoring time for respectively. Also, is an event indicator function given by
Now assume that we have individuals observed to have an event and call them ‘cases,’ with their event times denoted by Let be the risk set, which consists of individuals not being observed as an event and censored up to time For simplicity, we denote as in order of the event times. As the cases specified at time are deleted from the risk set at for , or some subjects can be censored between two consecutive time points we have a monotone decreasing structure on the risk set such that
We also denote as the size of the risk set and it also follows a non-increasing sequence, In this article, we considered cost-effective ECC designs as the sub-sample selection method. From the ECC design, we first obtained a sub-sample including cases and controls in the final risk set. In this ECC sample, we additionally observe an exposure variable of research interest, which was not collected in the cohort sample due to its higher measurement cost.
NCC design
Thomas2 initially suggested that the NCC design can be implemented by taking control samples from each risk set. The theoretical properties using this NCC design were provided by Goldstein and Langholz4 and Borgan et al5 under Cox’s13 proportional hazard model. A conventional NCC design is as follows
Specify the risk set for a case
Randomly select controls from the risk set without replacement.
Repeat (A) and (B) for all cases.
Note that some individuals can be repeatedly selected for different cases because non-censored individuals cannot be specified as the cases are included in all risk sets. Usually, the number of control subjects at event time points is considered in between and Figure 1 illustrates the classical NCC design. Figure 1A specifies the risk sets at each event time point and Figure 1B presents the selected NCC samples that have two controls for each case. Here, the two control subjects are randomly selected from each risk set.
Figure 1.
Sampling procedure of NCC design. (A) Nested case-control sampling design. (B) Nested case-control sample with two control subjects.
Solid dot: case; empty dot: subject at each risk set x: censored subject; : stratum.
For the NCC studies, the individual matching of controls to cases can be applied by adjusting the confounding or background covariates at the control-sampling stages. The matching technique in the NCC study can be used when researchers believe that all controls have the same value of specific characteristics as the corresponding case such as age and gender covariates. The procedure of matching the NCC study design is that at each event time, control subjects are randomly sampled from within strata defined by the matching covariates. Let be a matching covariate. The risk set is restricted to where includes the individuals that have the same category of the covariate as the value for the case
According to the data type of the matching procedure can be implemented in two ways: category and caliper matching designs.14 When the covariate is categorical, the controls and cases are exactly matched with values of the covariate For the continuous or ordered categorical the controls are selected so that the distances of between the cases and controls, denoted by lie in a given tolerance. For a fixed tolerance level steps 1 and 2 in the conventional NCC design are changed by incorporating the matching procedure:
Specify the risk set for a case where or
Randomly select controls from the risk set without replacement.
From the sampling design, there is no variation in exposure status between the cases and controls. The matching design provides efficiency gains relative to simple random sampling when is dichotomous, uncommon, and the interaction is positive.15
CECC design using a resampling method
In this article, we propose a new cost-effective ECC design using a resampling technique that allows us to select an arbitrary number of controls whereas the total number of controls is equal to a multiple of the total number of cases in the classical NCC design. The size of controls is determined as an arbitrary integer number by the budget and time constraints and can be even smaller than the size of the cases, that is, From the selected controls, we reconstructed controls matched to the cases using a resampling technique. As we selected all the cases from the cohort, the design efficiency is proportional to the size of The proposed CECC is presented in Algorithm 1. For convenience, we implicitly assume as the smallest integer number that is larger than
Algorithm 1.
CECC design.
|
Require: the set of cases the final risk set
Input: size of controls Output: matched case-control samples with 1: draw controls from the risk set 2: if randomly assign controls to cases. 3: else if 4: set 5: while do 6: assign controls to cases randomly selected from 7: set and 8: End while 9: randomly select controls from the controls and then randomly assign them to the remaining cases. 10: else if 11: set as the set of controls and 12: while do 13: randomly select controls from and then randomly assign to cases. 14: set and 15: End while 16: set as the size of final and 17: assign the remained controls on cases randomly selected from 18: randomly select controls from the initial controls and then assign then to the remaining cases. 19: End if * is the smallest integer that satisfies |
The main difference compared with the classical NCC design is that the new algorithm only uses the final risk set, that is, all of the cases share the same risk set, while the classical NCC design specifies each risk set for each case (see Figure 2). Figure 2A mimics the NCC design similar to Figure 1A, but the final risk set is only considered as the risk set in the new CECC design. Figure 2B describes the control samples selected in the final risk set corresponding to the cases randomly assigned to each stratum on each case-event time point. This new sampling design is similar to the conditional approach of the ECC proposed by Salim et al,10 apart from the resampling parts.
Figure 2.
Sampling procedure of CECC design. (A) Cost-effective extreme case-control sampling design. (B) Cost-effective extreme case-control sampling design with two control subjects.
Solid dot: case; empty dot: subject at risk set x: censored subject; : stratum.
This new CECC design has some advantages compared with the conventional NCC design: (i) The new algorithm is more cost-effective because the number of selected controls can be smaller than the number of cases and does not have to be a multiple of the number of cases, (ii) it does not need to specify all the risk sets at the sampling stage, and (iii) it does not require censoring information. The advantage (ii) can be applied to practical analysis. Suppose that we are interested in the gene-exposure interaction effects based on cohort samples. We want to apply the NCC design because a gene-association study is expensive or some subjects were lost, and we can get some data from censored subjects. Thus, the proposed method only considers the candidate control subjects as the final risk, and we can then obtain extra bio-markers or genetic information.
For the matching design, we need to separate a given risk set by the matching variable We can mimic the NCC sampling procedure with a resampling technique within the identified strata. The example algorithm for the matching design with the binary and is presented in Algorithm 2.
Algorithm 2.
CECC design matching on
|
Require: the set of cases with
the final risk set
Input: control size and Output: matched case-control samples 1: draw controls from the risk set 2: for do 3: if randomly assign controls to cases. 4: else if set 5: while do 6: assign controls to cases randomly selected from 7: set and 8: End while 9: randomly select controls from the controls and then assign them to the remaining cases. 10: End if 11: End for |
Partial likelihood and parameter estimation
In this article, we implicitly adopt the same assumptions required for the classical NCC designs: (A1) event is rare; (A2) the censoring time is independent of the event time; and (A3) the event time is independent of each other. However, this assumption does not guarantee the consistency in parameter estimation. Thus, we use Salim et al’s10 partial likelihood to adjust our length-biased samples.
Applying Salim et al’s10 partial likelihood, the likelihood function according to our length-biased sample is
| (1) |
where are the covariates of case and are the covariates of control matched to case and
are the adjusting term for sampling bias defined by the fraction of survival times, and between event time and the end-point of observation
Approximating the survival time as the Kaplan-Meier (KM) estimates,16 the weights are estimated by
| (2) |
where is the weighted mean of NCC samples and is the average of cohort samples (see Salim et al10 for more details).
Once we plug-in the estimated to the likelihood function (1), we can obtain the estimator by maximizing the estimated partial likelihood or the logarithm of the estimated partial likelihood. For the matching design on we have for all and so we need to remove a term associated with because they have the same value within in each case.
Results and Discussions
Simulation study
In this section, we first generated 2000 Monte Carlo (MC) cohorts, each of which consists of individuals of To generate survival outcomes for cohort samples, the Cox proportional hazards model equation (3) is applied with constant baseline hazard and discrete covariates which comes from the Bernoulli with a probability of The censoring time is randomly generated from an exponential distribution with mean of Here, is the main covariate of interest which is only available for the NCC samples, and is used as matching variable for the matching design. The considered hazards ratios for and are of and The end-point is determined by the observation time for obtaining or events. The Cox proportional hazards model for the cohort is defined as
| (3) |
where the baseline risk rate, is defined as in this simulation setting. This simulation setup is similar to that of Salim et al.10
In this simulation study, we consider three estimators:
NCC: estimators obtained from the classical NCC design samples;
CECC: estimators obtained from the CECC design samples with constant
CECCW: estimators obtained from the CECC design samples with the estimated in equation (2).
The CECC estimator does not use the sampling bias adjust term, but it does not require any censoring information for parameter estimation. Thus, we will also discuss if this naive CECC estimator without bias correction can be used in practice when the censoring time is not available or accurate. For estimation of the conditional logistic regression and the KM estimates, we used the R function clogit and ncc built in the R packages ‘survival’ and ‘Epi.’
As the simulation outputs, we report the relative bias (R.Bias), standard errors (SE), and mean squared errors (MSE) of the estimators. Here, R.Bias is defined by
Table 1 presents the MC relative biases, standard errors, and mean squared errors of the NCC, CECC, and CECCW estimators for no matching scenario. The column of “Controls” denotes the total number of controls selected for the NCC sample in each design. For example, in the case of CECC, we first select controls for cases and controls for cases, respectively. The selected controls are re-distributed using the proposed CECC algorithm to construct NCC samples. For simplicity, we set or according to the number of initial controls, which is the number of matched controls for each case.
Table 1.
Monte Carlo relative biases (R.Bias), standard errors (SE), and mean squared errors (MSE) of estimators from the NCC, CECC, and CECCW without matching.
| Methods | Controls | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| R.Bias (%) | SE | MSE | R.Bias (%) | SE | MSE | R.Bias (%) | SE | MSE | |||
| NCC | 250:250 | 1:1 | −0.0453 | 0.1810 | 0.0328 | 1.2464 | 0.1867 | 0.0349 | 1.2540 | 0.1967 | 0.0388 |
| CECC | 250:150 | 1:1 | −0.1486 | 0.2172 | 0.0472 | 0.1626 | 0.2202 | 0.0485 | 1.4069 | 0.2239 | 0.0502 |
| 250:250 | 1:1 | 1.7952 | 0.1818 | 0.0331 | 2.0897 | 0.1818 | 0.0331 | 2.3031 | 0.1909 | 0.0367 | |
| 250:400 | 1:2 | 2.9312 | 0.1649 | 0.0272 | 2.9360 | 0.1653 | 0.0275 | 2.2260 | 0.1707 | 0.0294 | |
| CECCW | 250:150 | 1:1 | −1.4886 | 0.2141 | 0.0458 | −1.2853 | 0.2173 | 0.0473 | 0.1989 | 0.2206 | 0.0487 |
| 250:250 | 1:1 | 0.3004 | 0.1791 | 0.0321 | 0.5998 | 0.1790 | 0.0321 | 0.8325 | 0.1878 | 0.0353 | |
| 250:400 | 1:2 | 1.4172 | 0.1624 | 0.0264 | 1.4408 | 0.1628 | 0.0265 | 0.7590 | 0.1680 | 0.0282 | |
| NCC | 500:500 | 1:1 | 1.9980 | 0.1236 | 0.0153 | 0.7447 | 0.1287 | 0.0166 | 1.0367 | 0.1334 | 0.0179 |
| CECC | 500:300 | 1:1 | 4.3142 | 0.1504 | 0.0227 | 6.0110 | 0.1496 | 0.0230 | 4.9052 | 0.1601 | 0.0268 |
| 500:500 | 1:1 | 5.0875 | 0.1239 | 0.0154 | 4.4117 | 0.1286 | 0.0169 | 3.8251 | 0.1353 | 0.0190 | |
| 500:800 | 1:2 | 3.4215 | 0.1163 | 0.0136 | 4.1937 | 0.1173 | 0.0141 | 3.9960 | 0.1202 | 0.0152 | |
| CECCW | 500:300 | 1:1 | 0.5894 | 0.1443 | 0.0208 | 2.5592 | 0.1446 | 0.0210 | 1.2784 | 0.1517 | 0.0231 |
| 500:500 | 1:1 | 1.3928 | 0.1194 | 0.0143 | 0.8520 | 0.1239 | 0.0154 | 0.4505 | 0.1304 | 0.0170 | |
| 500:800 | 1:2 | −0.2131 | 0.1119 | 0.0125 | 0.6494 | 0.1130 | 0.0128 | 0.6140 | 0.1158 | 0.0134 | |
Abbreviation: NCC, nested case-control.
Table 1 shows that the CECCW estimators have better performance than the CECC estimators, because the bias adjusting terms are represented during the estimation. The CECC estimators produce a larger bias when the length bias is increased as the number of cases is increased from 250 to 500. However, the size of biases is not critical in the sense that we may use the naive CECC estimators in practice when the censored time is not available or accurate. Among three estimators, the CECCW estimators are the most efficient under the same number of controls. This because the CECCW estimators incorporate the cohort information through the estimated survival function. We can also confirm that the proposed method performs well even when the number of sampled controls is smaller than the number of cases. Although we need to pay some efficiency loss due to the size of controls, the biases are well controlled for this under sampling design. This is desirable results as the cost-effective NCC design. The efficiency of the proposed estimators depends on the size of total number of controls selected as the CECC sample.
Similar to Table 1, Table 2 also presents the MC relative biases, standard errors, and mean squared errors of the NCC, CECC, and CECCW estimators for matching scenario. The simulation results presented at Table 2 are almost similar to those of no matching case, which implies that when the matching variable is considered, the proposed method also works well. Thus, in summary, all estimators produce nearly unbiased estimates in this simulation setup. However, the CECCW is slightly more efficient compared with other estimators.
Table 2.
Monte Carlo relative biases (R.Bias), standard errors (SE), and mean squared errors (MSE) of estimators from the NCC, CECC, and CECCW with matching on
| Methods | Controls | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| R.Bias (%) | SE | MSE | R.Bias (%) | SE | MSE | R.Bias (%) | SE | MSE | |||
| NCC | 250:250 | 1:1 | 1.6775 | 0.1856 | 0.0345 | 1.0622 | 0.1898 | 0.0360 | 1.0010 | 0.2056 | 0.0423 |
| CECC | 250:150 | 1:1 | 2.8795 | 0.2167 | 0.0470 | 1.9887 | 0.2239 | 0.0502 | 1.4207 | 0.2340 | 0.0549 |
| 250:250 | 1:1 | −1.4748 | 0.1881 | 0.0354 | 2.3676 | 0.1906 | 0.0364 | 2.8904 | 0.1999 | 0.0404 | |
| 250:400 | 1:2 | 1.6386 | 0.1796 | 0.0323 | 2.4780 | 0.1810 | 0.0328 | 2.4021 | 0.1937 | 0.0378 | |
| CECCW | 250:150 | 1:1 | 1.4270 | 0.2135 | 0.0456 | 0.5853 | 0.2208 | 0.0488 | 0.0391 | 0.2306 | 0.0532 |
| 250:250 | 1:1 | −2.8539 | 0.1855 | 0.0344 | 0.9647 | 0.1879 | 0.0353 | 1.5007 | 0.1971 | 0.0390 | |
| 250:400 | 1:2 | 0.2178 | 0.1770 | 0.0313 | 1.0722 | 0.1785 | 0.0319 | 1.0205 | 0.1911 | 0.0366 | |
| NCC | 500:500 | 1:1 | 0.2045 | 0.1309 | 0.0171 | 1.3677 | 0.1331 | 0.0178 | 0.3275 | 0.1386 | 0.0192 |
| CECC | 500:300 | 1:1 | 5.6533 | 0.1546 | 0.0240 | 4.7117 | 0.1597 | 0.0259 | 4.0727 | 0.1616 | 0.0269 |
| 500:500 | 1:1 | 4.2535 | 0.1293 | 0.0168 | 4.4172 | 0.1320 | 0.0177 | 3.9260 | 0.1386 | 0.0199 | |
| 500:800 | 1:2 | 5.0455 | 0.1263 | 0.0160 | 4.6792 | 0.1239 | 0.0157 | 4.1397 | 0.1317 | 0.0182 | |
| CECCW | 500:300 | 1:1 | 2.3357 | 0.1498 | 0.0224 | 1.5596 | 0.1549 | 0.0240 | 1.0381 | 0.1567 | 0.0246 |
| 500:500 | 1:1 | 1.0361 | 0.1253 | 0.0157 | 1.2788 | 0.1281 | 0.0164 | 0.9012 | 0.1344 | 0.0181 | |
| 500:800 | 1:2 | 1.8110 | 0.1225 | 0.0150 | 1.5437 | 0.1203 | 0.0145 | 1.1168 | 0.1278 | 0.0164 | |
Abbreviation: NCC, nested case-control.
Real data example
Andersson et al11,12 studied the association between chronic α-particle irradiation from Thorotrast and the liver cancer incidence. All of the study subjects took cerebral angiography with or without Thorotrast from 1935 to 1947 or from 1946 to 1963, respectively. The ‘thoro’ data are available from R package ‘Epi.’ The modified data for the NCC and the CECC designs are employed. In particular, we consider both the CECC and the CECCW for this data analysis. For simplicity, we consider the small number of variables and a simple model. The variable is as follows:
Sex: 0 for male and 1 for female.
Event: indicator of liver cancer diagnosis.
Exposure: injected volume of thorotrast in milliliter. Control patients have a 0 in this variable.
Censored: 0—not censored and 1—censored.
Incidence age: age of liver cancer diagnosis.
Exit age: age of exit year from the study including the incidence age.
Birth: birth cohort as 0 for birth date earlier than 1920 year and 1 for birth date later than 1920 year.
Time = Exit age – Incidence age.
Tables 3 and 4 show the total number of observation is 2468. We considered the final exit date, February 20, 1992, for data application. The number of liver cancer cases is 130 and censored observations are 40 and the numbers of male and female are 1291 and 1177. The range of birth date is from January 7, 1868 to February 1, 1958. The number of subjects which was born before January 1, 1920 is 1816, otherwise 652. There are 1479 non-exposed and 989 exposed subjects in this study. The median of the follow-up years is approximate 26 years and the median of the incidence age (years) is 62.32 years old. For the age at injection (year), the median value is 40.35 years old.
Table 3.
Characteristics of study population for Thorotrast data 1.
| Number of obs. | % | ||
|---|---|---|---|
| Event | Yes | 130 | 5.3 |
| No | 2338 | 94.73 | |
| Censored | Yes | 40 | 1.62 |
| No | 2428 | 98.38 | |
| Sex | Female | 1177 | 47.69 |
| Male | 1291 | 52.31 | |
| Exposure | Yes | 989 | 40.07 |
| No | 1479 | 59.93 | |
| Birth | 1816 | 73.58 | |
| 652 | 26.42 | ||
Table 4.
Characteristics of study population for Thorotrast data 2.
| Min | 1Q | Median | Mean | 3Q | Max | |
|---|---|---|---|---|---|---|
| Age at injection (years old) | 0.45 | 27.37 | 40.35 | 39.29 | 51.66 | 79.18 |
| Incidence age (year) | 5.36 | 55.51 | 62.32 | 61.81 | 68.63 | 88.16 |
| Follow-up time (year) | 0.0027 | 4.0171 | 22.0534 | 21.0373 | 35.4593 | 53.9877 |
| Exposure | 0.00 | 0.00 | 0.00 | 7.48 | 10.00 | 80.00 |
Table 5 provides estimates of exposure , exponential function of estimates of exposure , and standard errors of estimates of exposure , 1 NCC data, 3 CECCs and CECCWs with ratios of cases and controls as 130:130 (1:1), 130:150 (1:2), and 130:200 (1:2) with or without matching. Here, the matching variable, “Birth,” is considered. Since the sex variable is not significant in the cohort analysis, we ignore the variable. Cohort with matching means the matching variable is used as a covariate in the cohort analysis. Table 5 shows that the CECCW results provide better standard errors than less bias than the NCC and the CECC. The reason is that we conjecture this follow-up time of this cohort is relative long and the cohort size is small.
Table 5.
Comparison among the NCC data, and the CECC and the CECCW data from Thorotrast data.
| Methods | Controls | No matching | Matching | |||||
|---|---|---|---|---|---|---|---|---|
| Cohort | 0.0758 | 1.0787 | 0.0046 | 0.0739 | 1.0767 | 0.0047 | ||
| NCC | 130:130 | 1:1 | 0.0918 | 1.0962 | 0.0172 | 0.0989 | 1.1039 | 0.0171 |
| CECC | 130:130 | 1:1 | 0.0958 | 1.1005 | 0.0167 | 0.1028 | 1.1083 | 0.0174 |
| 130:150 | 1:2 | 0.1002 | 1.1054 | 0.0137 | 0.1017 | 1.1070 | 0.0135 | |
| 130:200 | 1:2 | 0.0973 | 1.1022 | 0.0131 | 0.0912 | 1.0955 | 0.0123 | |
| CECCW | 130:130 | 1:1 | 0.0659 | 1.0681 | 0.0086 | 0.0706 | 1.0731 | 0.0079 |
| 130:150 | 1:2 | 0.0732 | 1.0760 | 0.0067 | 0.0602 | 1.0620 | 0.0053 | |
| 130:200 | 1:2 | 0.0642 | 1.0663 | 0.0062 | 0.0631 | 1.0651 | 0.0058 | |
Abbreviation: NCC, nested case-control.
Conclusions
In this article, we propose a new CECC utilizing a resampling idea to make flexibility the choice of the number of control samples. This new procedure provides two advantages compared with the classical NCC sampling design. It allows an under sampling design, which means the number of controls is less than the number of cases. Also, it does not need to specify all risk sets required for the classical NCC design. Considering these advantages, the new algorithm can be applied to the real data analysis when the budget and time constraints are limited to include new measurements. We can use CECCW estimators when the censored times are all available for cohort and use CECC estimators otherwise.
This study is not free from limitations. When the size of the final risk set is very small with respect to the cohort, that is, most units in the cohort have an event or are censored during the observation time; the proposed algorithm may lead to unintended biases during the estimation. This phenomenon is similar to a coverage problem in the sense that the small final risk set can easily fail to cover the whole range of risk sets used for the candidates of controls. Thus, the proposed sampling technique will efficiently work when we have rare events and observation time is relatively short avoiding too many censored units.
Acknowledgments
The authors would like to thank the editor and reviewers for their careful readings and thoughtful comments.
Footnotes
Funding:The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the National Research Foundation (NRF) of Korea, NRF-2016R1D1A1B03932212 for the author and NRF-2018R1D1A1B07045220 for the second author, respectively.
Declaration of conflicting interests:The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions: All authors contributed equally to this work.
ORCID iD: Young Min Kim
https://orcid.org/0000-0001-6295-1758
References
- 1. Grant EJ, Cologne J, Sharp GB, et al. Bioavailable serum estradiol may alter radiation risk of postmenopausal breast cancer: a nested case-control study. Int J Radiat Biol. 2018;94:97–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Thomas DC. Addendum to: “methods of cohort analysis: appraisal by application to asbestos mining” by Liddell FDK, McDonald JC, Thomas DC. J Roy Stat Soc Ser A. 1977;140:469–491. [Google Scholar]
- 3. Oakes D. Survival times: aspects of partial likelihood (with discussion). Int Stat Rev. 1981;49:235–252. [Google Scholar]
- 4. Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in Cox regression models. Ann Stat. 1992;20:1903–1928. [Google Scholar]
- 5. Borgan O, Goldstein L, Langholz B. Methods for the analysis of sampled cohort data in the Cox proportional hazards model. Ann Stat. 1995;23:1749–1778. [Google Scholar]
- 6. Langholz B, Goldstein L. Risk set sampling in epidemiologic cohort studies. Stat Sci. 1996;11:35–53. [Google Scholar]
- 7. Langholz B, Goldstein L. Conditional logistic analysis of case-control studies with complex sampling. Biostatistics. 2001;2:63–84. [DOI] [PubMed] [Google Scholar]
- 8. Yao Y, Yu W, Chen K. End-point sampling. Stat Sin. 2017;27:415–435. [Google Scholar]
- 9. Sboner A, Demichelis F, Calza S, et al. Molecular sampling of prostate cancer: a dilemma for predicting disease progression. BMC Med Genom. 2010;3:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Salim A, Ma X, Fall K, Andrén O, Reilly M. Analysis of incidence and prognosis from “extreme” case-control designs. Stat Med. 2014;33:5388–5398. [DOI] [PubMed] [Google Scholar]
- 11. Andersson M, Carstensen B, Storm HH. Mortality and cancer incidence after cerebral angiography. Radiat Res. 1995;142:305–320. [PubMed] [Google Scholar]
- 12. Andersson M, Vyberg M, Visfeldt J, Carstensen B, Storm HH. Primary liver tumours among Danish patients exposed to Thorotrast. Radiat Res. 1994;137:262–273. [PubMed] [Google Scholar]
- 13. Cox DR. Regression models and life tables. J Roy Stat Soc Ser B. 1972;34:187–220. [Google Scholar]
- 14. Cochran WG, Rubin DB. Controlling bias in observational studies: a review. Sankhyā. 1973;35:417–446. [Google Scholar]
- 15. Cologne J, Langholz B. Selecting controls for assessing interaction in nested case-control studies. J Epidemiol. 2003;13:193–202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457–481. [Google Scholar]


