Abstract
The nested case–control (NCC) design is a cost-effective sampling method to study the relationship between a disease and its risk factors in epidemiologic studies. NCC data are commonly analyzed using Thomas' partial likelihood approach under Cox's proportional hazards model with constant covariate effects. Here, we are interested in studying the potential time-varying effects of covariates in NCC studies and propose an estimation approach based on a kernel-weighted Thomas' partial likelihood. We establish asymptotic properties of the proposed estimator, propose a numerical approach to construct simultaneous confidence bands for time-varying coefficients, and develop a hypothesis testing procedure to detect time-varying coefficients. The proposed inference procedure is evaluated in simulations and applied to an NCC study of breast cancer in the New York University Women's Health Study.
Keywords: Kernel estimation, Martingale, Nested case–control study, Proportional hazards model, Risk-set sampling, Time-varying coefficient
1. INTRODUCTION
Epidemiologic cohort studies of rare diseases are usually expensive to conduct because a large number of individuals need to be followed up for a long time in order to obtain an adequate number of cases. Moreover, the cost of assembling exposure variables of interest and confounders for the entire cohort can be financially prohibitive. Therefore, the nested case–control (NCC) design (Thomas, 1977) has been widely used as a cost-effective alternative to the full-cohort design. For example, an NCC study was conducted in the New York University Women's Health Study (NYUWHS) to investigate the association between breast cancer risk and genetic variations in the nucleotide excision repair (NER) pathway (Shore and others, 2008). The NYUWHS is a prospective cohort study that enrolled 14 274 healthy women aged 35–65 between 1985 and 1991 at a breast cancer screening center and has followed these women since the enrollment for cancers and other health outcomes. The NER mechanism is important for cells to prevent unwanted mutations by removing DNA damage, and thus genes in the NER pathway are hypothesized to play a role in the development of cancers and other diseases involving DNA damage and genetic mutations. It would have been very costly to ascertain genetic information for the entire cohort. Thus, the NCC design was implemented, and for each of the 612 identified invasive breast cancer cases, one control was selected from the case's risk set. Genotypic information on 2 genes, XPC and ERCC2 in the NER pathway, was obtained for the cases and their matched controls. Covariate information on demographics, smoking status, and pregnancy history was collected from baseline and follow-up questionnaires. Effects of NER genes, environmental exposure of smoking that causes DNA damage, and their interactions on breast cancer risk were of primary interest.
Cox's proportional hazards model (Cox, 1972) has been commonly used to analyze NCC data with the successful implementation of the partial likelihood technique (Thomas, 1977, Oakes, 1981). Under the assumption of proportional hazards, that is, the ratio of hazard functions with different covariate values remains constant over time, the expression of Thomas' partial likelihood function is equivalent to the conditional logistic likelihood for matched case–control studies. Theoretical properties of Thomas' maximum partial likelihood estimator have been formally established using counting process and martingale theory (Goldstein and Langholz, 1992; Borgan and others, 1995).
Due to the nature of long-term observation and complex relationships to be explored in large epidemiologic studies, the proportional hazards assumption may be violated. In fact, researchers have extended Cox's model to improve the modeling flexibility and one popular extension is to allow the coefficients to vary with time. Specifically, the Cox model with time-varying coefficients (Zucker and Karr, 1990, Hastie and Tibshirani, 1993) assumes that the individual hazard function has a multiplicative form
| (1.1) |
where denotes an unspecified baseline hazard function, denotes p covariate processes, and are p processes characterizing the covariates' temporal effects. The estimation and inference for model (1.1) in cohort studies have been studied by many researchers using various techniques: for example, the penalized partial likelihood approach with smoothing splines (Tibshirani and Hastie, 1987, Zucker and Karr, 1990); the sieve maximum partial likelihood approach with histogram sieves (Murphy and Sen, 1991); the integrated Newton–Raphson equation for the cumulative coefficient functions (Martinussen and others, 2002); the kernel-weighted partial likelihood approach (Cai and Sun, 2003; Tian and others, 2005). However, the use of model (1.1) in NCC studies remains limited.
In this article, we are interested in studying model (1.1) with NCC data. We show that the kernel-weighted local polynomial fitting technique (Fan and Gijbels, 1996) can be well coupled with Thomas' partial likelihood to study the time-varying coefficients in NCC studies. The rest of the article is organized as follows. In Section 2, we propose a kernel-weighted partial likelihood estimation approach and establish the asymptotic properties for the proposed estimator. Pertinent to making inference for time-varying coefficients, we develop numerical approaches to constructing simultaneous confidence bands and to testing hypotheses of existing time-invariant coefficients. Furthermore, we consider inference for an extension to incorporate both time-varying and time-invariant coefficients. In Section 3, we present numerical studies including simulations to evaluate the finite-sample performance of our proposed approaches and the analysis of the NYUWHS breast cancer data. We conclude with some remarks in Section 4. All proofs are relegated to the supplementary material available at Biostatistics online.
2. METHODS
Throughout the paper, let denote a random triplet of failure time, right-censoring time, and p-dimensional covariate processes. Consider a cohort of size n and refer the full-cohort data to n independent realizations of , where is the observed failure time and indicates the status of the observed event, taking value of 1 for observing a real failure event and 0 otherwise. At a specific time t, let denote the risk set. When the NCC design is used to sample from the cohort, one identifies cases as subjects with and, for each case, randomly samples controls without replacement from the risk set at the case's failure time excluding the case itself. For a given case i, let denote the indices of the selected controls and define . The complete covariate information is ascertained for all the cases and selected controls.
2.1. Kernel-weighted partial likelihood estimation approach
Consider model (1.1) and assume time-varying coefficients to be smooth functions with continuous first and second derivatives. Locally around a time point t, we approximate by a linearization using the first-order Taylor expansion,
Let and
with ⊗ denoting the Kronecker product. Moreover, define a
counting process representation for the observed failure event as
. Note that, for each case in the NCC data
jumps from 0 to 1 at the case's failure time; for
others, remains at 0 for the entire follow-up period. To estimate
locally around t, we consider the
following kernel-weighted partial likelihood function of -dimensional parameters β,
![]() |
(2.1) |
where is a kernel function, h denotes a bandwidth parameter, is the scaled kernel function, and is the upper bound of follow-up period and satisfies . We assume that is a symmetric density function with bounded support on and for . This kernel function down-weights the contributions from subjects whose event times are far from t and the bandwidth parameter controls the size of local neighborhood. Thus, the local partial likelihood function (2.1) depends only on the case–control sets with case event times in the close vicinity of t.
The score function of (2.1) can be easily derived:
| (2.2) |
where for a set R,
For a vector a,
let and for , and 2, respectively. Furthermore, the Hessian matrix of
(2.1) equals
![]() |
(2.3) |
It is evident that is semi-negative and thus the concavity of (2.1) assures a
unique maximum. The Newton–Raphson method or other gradient-based search algorithms
can be used to find
that maximizes (2.1). We denote the kernel-weighted maximum partial
likelihood estimate of by
(t), which are the first p
components of
.
2.2. Asymptotic properties of
(t)
Because of the risk-set sampling mechanism of NCC design, even when the sample size increases, the size of each sampled risk set for case i is always m, rather than increasing to infinity as the size of risk sets in Cox's partial likelihood function for the full-cohort data. Thus, we adopt similar theoretical arguments for Thomas' maximum partial likelihood estimator used in Goldstein and Langholz (1992) and consider its kernel-weighted local version by using the kernel polynomial fitting technique (Fan and Gijbels, 1996). Let , , where and . Denote
![]() |
Indeed, is the local contribution to the asymptotic information matrix of Thomas' partial likelihood function (Goldstein and Langholz, 1992). Here, we state the main asymptotic results. Regularity conditions A.1–A.5, remarks on the technical device used in Goldstein and Langholz (1992) to make the NCC sampling process predictable, and proofs are given in the supplementary material available at Biostatistics online. Let and for .
PROPOSITION 2.1
As (i) under Conditions A.1–A.4,
(ii) under Conditions A.1–A.5, for ,
2.3. Point-wise confidence interval and simultaneous confidence band
From Proposition 1, it is evident that the optimal bandwidth that minimizes the mean squared error or mean integrated squared error is . This theoretical optimal bandwidth, however, will lead to an asymptotically biased estimator, that is, the bias term is . In this paper, we prefer to use a slightly faster rate for bandwidth , with , to obtain an unbiased estimator. The main reason is to avoid complications due to the estimation bias in constructing point-wise confidence intervals and simultaneous confidence bands. We refer the reader to Härdle and Marron (1991) for more discussion of handling bias in constructing simultaneous confidence bands for general nonparametric curve estimation.
The variance of
(t) can be consistently estimated by the upper left
matrix of
| (2.4) |
where is as specified in (2.3). Thus, the
% point-wise confidence interval for the
jth element of at time t can be constructed as
, where
jj(t) denotes the
jth diagonal element of
(t) and is the th percentile of the standard normal distribution. However,
to make inference for the underlying coefficient function over a specific time interval,
it is desirable and more informative to consider the simultaneous confidence band than the
point-wise confidence interval. In general, it is difficult to derive an analytic form for
simultaneous confidence bands and they are usually estimated using numerical approaches
that mimic the original data structure while assessing variability (Härdle and Marron, 1991; Tian
and others, 2005). Here, we construct simultaneous confidence
bands by approximating the distribution of
| (2.5) |
with the resampling method of Lin and others (1994), where the
weight function can be a data-related positive function that uniformly
converges to a deterministic function. In our numerical studies, we choose
to take the variability into
account. In the proof of Proposition 1, we show that, if and , the kernel-weighted score function (2.2) at the true
coefficient values is asymptotically equivalent to
where and consequently, under standard measurability assumptions
is a local martingale. Substituting
by
and by , where are independent standard normal random variables, we obtain
a randomly perturbed version of , denoted by
. At each specific t, the conditional limiting
distribution of
given the observed data is the
same as the unconditional limiting distribution of (Lin and others, 1994). Therefore, the
distribution of (2.5) can be numerically approximated by its randomly
perturbed counterpart
. Let denote the sample th percentile of a large number of realizations of
j's, and thus the simultaneous
confidence band for over can be constructed as
.
2.4. Inference for mixed Cox model with time-varying and time-invariant coefficients
Model (2.1) is flexible by allowing all coefficients to be time-varying. But when there
are indications of possible time-invariant coefficients for certain covariates, one may
want to consider a mixed model with time-varying and time-invariant coefficients. Without
loss of generality, we consider a mixed model with the first q components
of being constants, that is, , where denotes the first q elements of a vector
a and denotes the remaining elements. Based on the proposed kernel-weighted partial
likelihood estimates
(t), we estimate the constant coefficients γ
by
| (2.6) |
where is a weight function converging to a deterministic function
and also satisfies as an identity matrix. As suggested in Tian and
others (2005), a natural choice for this weight function is the standardized
inverse covariance matrix of
[q](t), that is, , where is the inverse of the upper left submatrix of
(t) defined in (2.4). To make inference for
, by arguments similar to Tian and others
(2005), we can show that
converges weakly to a mean-zero normal distribution for
with , and the limiting covariance matrix can be consistently
estimated by
.
In practice, after we obtain the estimates
(t) and the simultaneous confidence bands, we can
examine each plot for whether the confidence band encloses a horizontal line to check if a
constant assumption for this coefficient is possible. The drawbacks for this procedure
include the slow convergence rate of the local estimates and its low sensitivity, and it
also does not take into account the uncertainty of the constant coefficient estimation. As
shown in the paragraph above, we can consider the cumulative function of the time-varying
coefficient estimates to achieve a better convergence rate (Martinussen and others, 2002). To check whether the
jth component of is independent of time, we consider the process
du, where
(t) and
are the local and integrated
estimates, respectively. Based on a similar resampling method described in Section 2.3, we
can obtain randomly perturbed realizations of
to evaluate this hypothesis graphically and numerically.
More specifically, define
as
(t) with the jth and the
th elements replaced by (
j, 0). When the hypothesis
is true, converges weakly to a mean-zero Gaussian process (Tian
and others, 2005). Its asymptotic distribution can be approximated by
the conditional distribution of the randomly perturbed
process
![]() |
A large number of resampled realizations of
can be plotted with to provide a visual examination for the constant
hypothesis. Furthermore, a statistical test can be constructed based on
, where the critical value can be obtained using the sample
quantile of the resampled counterparts
.
2.5. Bandwidth selection
In practice, we suggest to use the cross-validation method (Hastie and others, 2008) to select the bandwidth parameter h. Specifically, we first randomly split the case–control sets into K subsets. Following Tian and others (2005), we may use minus the logarithm of the partial likelihood function as a measure for the prediction error. For each h and the kth data part,
where
(k)(t) is estimated
using data sets excluding the kth part with
bandwidth h. Then the total prediction error with h is
, and we can choose the bandwidth parameter
h that minimizes .
3. NUMERICAL STUDIES
3.1. Simulations
Simulation studies were carried out to evaluate the performance of the proposed estimation and inference procedures under finite-sample sizes. The failure times were generated from a model with the hazard function of , where the baseline hazard function with to be specified later and covariates and were generated from the binomial distribution with success probability of 0.5. We considered 2 types of time-varying function for : polynomial, where and ; log-sinusoid, where and . We assumed to be a constant coefficient of . In addition, the censoring time was generated as , where was from a uniform distribution on . As NCC studies are commonly implemented when the disease incidence rate is low, our simulated incidence rates were all about 10–. We simulated the NCC data from full cohorts sized 1000 and 2000 and selected 2 controls for each case. We used the Epanechnikov kernel and ran 1000 simulations for each setting. The proposed simultaneous confidence band and testing procedure were carried out with 5000 resampling runs.
Table 1 presents the results regarding the
simultaneous confidence band for the time-varying coefficient . The bandwidth parameter h was set to
change from 0.6 to 1.4 by an increment of 0.2 and the simultaneous band was constructed
over . We found that the performance of simultaneous confidence
band depended on the bandwidth parameter h. Small h led
to small biases but large variance and thus higher coverage probabilities; while large
h led to reverse results. When the bias–variance balance was
reached, for example, when for and for , the simultaneous confidence band yielded satisfactory
coverage probability. The overall performance improved with increased sample size. For
example, when and , the resampling-based simultaneous coverage probabilities
matched the nominal level reasonably well and the empirical quantiles of observed
based on 1000 simulated data sets
were very close to the resampling-based threshold. Furthermore, when
was the quadratic polynomial function, the curvature of
this polynomial function, that is, for all t; while being the log-sinusoid function, its curvature ranged from
to 2.8. Therefore, by comparing the results between these 2
different types of time-varying coefficients, we found that the performance of
simultaneous confidence band for the polynomial coefficient was less sensitive to the
selection of h than with the log-sinusoid coefficient function. This
observation is not surprising because it is usually more difficult to characterize a more
variable function and the bias of local linear fitting depends on the magnitude of
.
Table 1.
Simulation results: simultaneous confidence band for
1(t) over [1, 4]
| a1(t) | h |
N = 1000 |
N = 2000 |
||||||
| S.CP† | E.S‡ | R.S§ | SD(R.S)¶ | S.CP† | E.S‡ | R.S§ | SD(R.S)¶ | ||
| Polynomial | 0.6 | 98.4 | 2.524 | 2.826 | 0.146 | 96.4 | 2.689 | 2.810 | 0.089 |
| 0.8 | 97.9 | 2.526 | 2.755 | 0.126 | 95.5 | 2.712 | 2.740 | 0.082 | |
| 1.0 | 97.2 | 2.511 | 2.698 | 0.116 | 94.8 | 2.709 | 2.686 | 0.078 | |
| 1.2 | 96.0 | 2.529 | 2.661 | 0.113 | 93.8 | 2.724 | 2.653 | 0.078 | |
| 1.4 | 95.3 | 2.609 | 2.640 | 0.112 | 92.6 | 2.806 | 2.635 | 0.078 | |
| Sinusoid | 0.6 | 97.8 | 2.494 | 3.078 | 9.185 | 96.0 | 2.728 | 2.803 | 0.127 |
| 0.8 | 97.1 | 2.514 | 2.733 | 0.166 | 95.2 | 2.695 | 2.732 | 0.107 | |
| 1.0 | 96.5 | 2.568 | 2.682 | 0.150 | 94.5 | 2.689 | 2.673 | 0.099 | |
| 1.2 | 95.2 | 2.643 | 2.645 | 0.140 | 92.3 | 2.866 | 2.632 | 0.093 | |
| 1.4 | 92.7 | 2.739 | 2.622 | 0.132 | 87.3 | 3.083 | 2.607 | 0.089 | |
S.CP: Simultaneous 95% coverage probability
E.S: Empirical 95% quantile of
in 1000 simulations
R.S: average of resampling-based thresholds
SD(R.S): standard deviation of resampling-based thresholds.
Figures 1 and 2 show the estimated time-varying coefficient curves with sample size of 2000, the average 95% point-wise confidence intervals over 1000 simulations, and 95% confidence envelope constructed using the point-wise 2.5% and 97.5% quantiles of the estimated curves over 1000 simulations. We found that when h was small, the estimated curves were very close to the true curves but the point-wise confidence intervals were wide. As h increased, the estimated curves showed biases at “the valley” but the point-wise confidence intervals were narrower.
Fig. 1.
Estimated coefficient curves with different bandwidth parameters for the polynomial curve. 1: the true underlying coefficient; 2: the average of 1000 estimated curves; 3: the median of 1000 estimated curves; 4: the average of 95% point-wise confidence intervals; 5: confidence envelope of 2.5% and 97.5% quantiles of 1000 estimated curves.
Fig. 2.
Estimated coefficient curves with different bandwidth parameters for log-sinusoid curve.1: the true underlying coefficient; 2: the average of 1000 estimated curves; 3: the median of 1000 estimated curves; 4: the average of 95% point-wise confidence intervals; 5: confidence envelope of 2.5% and 97.5% quantiles of 1000 estimated curves.
We summarize the estimation results for the constant coefficient
in Table 2. We
report the bias, the sample standard deviation (SD) of the estimates over 1000
simulations, the average standard error (SE) using the asymptotic approximation, and the
95% coverage probability (CP) of Wald-type confidence intervals. The biases were all small
and the SDs and SEs decreased as the sample size increased. Overall, the SEs and SDs
matched well and the 95% CPs were close to the nominal level. The performance of the
integrated estimator
for the constant coefficient was stable with respect to
h. Such an observation confirms the theoretical result that the
integrated estimator has rate of and is independent of h (given the rate of
h in certain range).
Table 2.
Simulation results: estimation of constant coefficient γ
| a(t) | h |
N = 1000 |
N = 2000 |
||||||
| Bias | SD† | SE‡ | CP§ | Bias | SD† | SE‡ | CP§ | ||
| Polynomial | 0.6 | 0.002 | 0.247 | 0.252 | 0.964 | − 0.007 | 0.168 | 0.167 | 0.948 |
| 0.8 | − 0.002 | 0.244 | 0.248 | 0.967 | − 0.006 | 0.167 | 0.167 | 0.949 | |
| 1.0 | − 0.003 | 0.242 | 0.248 | 0.967 | − 0.005 | 0.166 | 0.169 | 0.953 | |
| 1.2 | − 0.002 | 0.241 | 0.251 | 0.966 | − 0.004 | 0.165 | 0.172 | 0.960 | |
| 1.4 | − 0.002 | 0.240 | 0.255 | 0.971 | − 0.003 | 0.164 | 0.175 | 0.965 | |
| Sinusoid | 0.6 | 0.005 | 0.262 | 0.276 | 0.969 | 0.001 | 0.183 | 0.180 | 0.951 |
| 0.8 | − 0.002 | 0.257 | 0.270 | 0.969 | − 0.001 | 0.179 | 0.180 | 0.955 | |
| 1.0 | − 0.004 | 0.253 | 0.270 | 0.969 | − 0.002 | 0.177 | 0.182 | 0.959 | |
| 1.2 | − 0.005 | 0.250 | 0.272 | 0.973 | − 0.002 | 0.177 | 0.185 | 0.959 | |
| 1.4 | − 0.005 | 0.249 | 0.276 | 0.976 | − 0.002 | 0.177 | 0.188 | 0.963 | |
SD: Sample standard deviation of the proposed estimates in 1000 simulations
SE: average of standard error estimates
CP: coverage probability of the 95% Wald-type confidence interval.
Finally, we assessed the performance of our proposed testing procedure for testing time-varying coefficients. Table 3 reports 5% error rates for testing and , respectively, with sample size of 2000. The empirical threshold was estimated by the 95th quantile of the sample test statistics from 1000 runs of simulations, and the average resampling-based threshold was defined as the mean of resampling-based 1000 thresholds. For the constant coefficient, the proposed testing procedure showed good error rate control; for the time-varying coefficient, it yielded reasonable power. Similar to the observation in Table 1, the power was higher for detecting the log-sinusoid coefficient.
Table 3.
Simulation results: identifying time-varying coefficients with N = 2000
| a(t) | h | Constant coeficient |
Time-varying coeficient |
||||||
| 5% rate | E.T† | R.T‡ | SD(R.T)§ | 5% rate | E.T† | R.T‡ | SD(R.T)§ | ||
| Polynomial | 0.6 | 0.065 | 1.220 | 1.111 | 0.160 | 0.412 | 2.053 | 1.123 | 0.160 |
| 0.8 | 0.058 | 1.099 | 1.038 | 0.140 | 0.401 | 1.797 | 1.050 | 0.148 | |
| 1.0 | 0.050 | 1.084 | 1.008 | 0.137 | 0.363 | 1.649 | 1.020 | 0.147 | |
| 1.2 | 0.042 | 1.043 | 0.996 | 0.136 | 0.314 | 1.605 | 1.008 | 0.146 | |
| 1.4 | 0.039 | 1.024 | 0.993 | 0.134 | 0.277 | 1.562 | 1.004 | 0.143 | |
| Sinusoid | 0.6 | 0.097 | 1.614 | 1.337 | 0.182 | 0.897 | 4.558 | 1.358 | 0.182 |
| 0.8 | 0.068 | 1.369 | 1.237 | 0.157 | 0.882 | 3.820 | 1.263 | 0.155 | |
| 1.0 | 0.061 | 1.259 | 1.171 | 0.146 | 0.850 | 3.403 | 1.208 | 0.149 | |
| 1.2 | 0.055 | 1.171 | 1.124 | 0.141 | 0.798 | 2.997 | 1.171 | 0.149 | |
| 1.4 | 0.048 | 1.086 | 1.093 | 0.139 | 0.732 | 2.653 | 1.146 | 0.151 | |
E.T: Empirical 95% quantile of the test statistics in 1000 simulations
R.T: average of resampling-based thresholds
SD(R.T): standard deviation of resampling-based thresholds.
3.2. Breast cancer study in the NYUWHS
The details of this NCC study have been reported in Shore and others (2008). As an illustration, we estimated the gene–environmental interaction effects of gene XPC with the smoking exposure which can induce DNA damage. Based on the results of Shore and others (2008), we assumed a recessive model for gene XPC and fitted model (1.1) with 4 covariates: Ethnicity (1 for Caucasian; 0 otherwise), XPC-Smoking-10 (1 for XPC-PAT allele nonsmokers; 0 otherwise), XPC-Smoking-01 (1 for XPC-PAT allele or smokers; 0 otherwise), and XPC-Smoking-11 (1 for XPC-PAT allele smokers; 0 otherwise). Because there were very few incident cases before age 45, we focused our analysis on the NCC data with age of diagnosis between 45 and 75.
We fitted the kernel-weighted partial likelihood approach using the Epanechnikov kernel function with bandwidth parameter . The estimated coefficient curves, the point-wise confidence intervals, and the simultaneous confidence bands are presented in the top panel of Figure 3. We found that the Caucasian group had lower risk in early age than the non-Caucasian group but then the risk increased and peaked around 60–65. Both the 90% point-wise confidence interval and the simultaneous confidence band excluded zero. There was no significantly increased risk in the XPC-PAT +/+ nonsmokers group (XPC-Smoking-10) comparing to the reference group of XPC wild-type nonsmokers as the point-wise and simultaneous confidence intervals all included zero. In XPC-Smoking-01 group, the risk seemed to be elevated at early ages then diminished but the overall effect was not significant according to the simultaneous confidence band. Lastly, the risk of breast cancer in the group of XPC-PAT+/+ smokers was uniformly elevated across all ages, and the effect was borderline significant, which agreed with the findings in Shore and others (2008)
Fig. 3.
Analysis results of the breast cancer NCC study. The solid curves are the estimated coefficients; the dashed lines are 90% point-wise confidence intervals; the dotted lines are 90% simultaneous confidence bands.
We next applied the proposed resampling testing procedure to examine whether each covariate effect can be sufficiently described by a constant. We plot and 10 resampling realizations in the lower panel of Figure 3, and the p-values based on 5000 resampling runs are also presented in the plots. The constant coefficient assumption for Ethnicity variable was rejected at 0.05 level (p-value), but this assumption seemed to be reasonable for all other covariates. Therefore, we further estimated the constant coefficients using the proposed integrated estimator (2.6). Comparing to XPC wild-type nonsmokers, there was no significantly increased risk in XPC-Smoking-10 group (OR, 95% CI: 0.57–1.76, p) or in XPC-Smoking-01 group (OR, 95% CI: 0.68–1.30, p), but a significant increase in XPC-Smoking-11 group (OR, 95% CI: 1.15–3.25, p). Again, these observations confirmed the results of Shore and others (2008).
4. DISCUSSION
As a way of enhancing modeling flexibility and providing an alternative and diagnostic tool for Cox's proportional hazards model, we developed inference procedures for the Cox regression model with time-varying coefficients in NCC studies. The NCC design is a member of a general class of cohort risk-set sampling designs that include counter-matching design (Langholz and Borgan, 1995), quota-matching design, and many others. Borgan and others (1995) studied Cox's proportional hazards model in this general framework using the marked point process to characterize the failure time process and the sampling scheme simultaneously. As a referee pointed out, the proposed inference procedure for NCC studies may be generalized to accommodate other types of cohort risk-set sampling designs by a formulation based on marked point process theory (Brémaud, 1981).
In cohort studies, the data can often present not only right censoring but also left
truncation. For example, left-truncated data occur when only subjects who are disease-free
enter the study because the disease can only happen afterward. To incorporate the left
truncation, controls need to be drawn from an adjusted risk set defined as
, where denotes the left-truncation time for subject
i. The extension to accommodate left-truncated and right-censored NCC
data is certainly of interest and may be handled in a unified manner using the marked point
process and martingale theory (Brémaud,
1981). We shall further investigate this extension elsewhere.
Our analysis of the breast cancer data in the NYUWHS has confirmed the finding of interaction between gene XPC in the NER pathway and smoking, a DNA-damaging agent (Shore and others, 2008). In addition, the finding that the Caucasian group has lower risk at earlier age and then elevated risk at older age compared to the non-Caucasian group is consistent with the well-described black-to-white ethnic crossover in breast cancer incidence rate (Gray and others, 1980; Joslyn and others, 2005, Anderson and others, 2008) and demonstrates the issue of race disparity in disease risk. In practice, it is also common to add covariate-by-time interaction terms to Cox's proportional hazards model to examine possible time-varying coefficients. We have also considered this approach by adding an interaction term of ethnicity with time and fitted the model with Thomas' partial likelihood estimation approach. However, the interaction term did not reach statistical significance (p-value). One possible reason is that as shown in Figure 3, the shape of the ethnic effect is like a U-shape rather than linear or monotone, and the simple covariate–time interaction term may not be able to catch such a trend. In summary, the Cox model with time-varying coefficients can elucidate the effect of risk factors on the disease and provide intuitive tool to visualize such effects.
SUPPLEMENTARY MATERIAL
Supplementary material is available at http://biostatistics.oxfordjournals.org.
FUNDING
National Cancer Institute (CA098661, CA091892, CA16087, CA140632); Department of Defense (DAMD17-01-1-0578); Susan G. Komen Breast Cancer Research Foundation (BCTR 2000 685).
Supplementary Material
Acknowledgments
The authors would like thank the co-editor, an associate editor, and a referee for their valuable suggestions. Conflict of Interest: None declared.
References
- Anderson WF, Rosenberg PS, Menashe I, Mitani A, Pfeiffer RM. Age-related crossover in breast cancer incidence rates between black and white ethnic groups. Journal of National Cancer Institute. 2008;100:1804–1814. doi: 10.1093/jnci/djn411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brémaud P. Point Processes and Queues: Martingale Dynamics. New York: Springer; 1981. [Google Scholar]
- Borgan Ø, Goldstein L, Langholz B. Methods for the analysis of sampled cohort data in the Cox proportional hazards model. Annals of Statistics. 1995;23:1749–1778. [Google Scholar]
- Cai ZW, Sun YQ. Local linear estimation for time-dependent coefficients in Cox's regression models. Scandinavian Journal of Statistics. 2003;30:93–111. [Google Scholar]
- Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
- Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. London: Chapman and Hall; 1996. [Google Scholar]
- Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the Cox regression model. Annals of Statistics. 1992;20:1903–1928. [Google Scholar]
- Gray GE, Henderson BE, Pike MC. Changing ratio of breast cancer incidence rates with age of black females compared with white females in the United States. Journal of National Cancer Institute. 1980;64:461–463. [PubMed] [Google Scholar]
- Härdle W, Marron JS. Bootstrap simultaneous error bars for nonparametric regression. Annals of Statistics. 1991;19:778–796. [Google Scholar]
- Hastie T, Tibshirani R. Varying-coefficient models. Journal of the Royal Statistical Society, Series B. 1993;55:757–796. [Google Scholar]
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2008. [Google Scholar]
- Joslyn SA, Foote ML, Nasseri K, Coughlin SS, Howe HL. Racial and ethnic disparities in breast cancer rates by age: NAACCR breast cancer project. Breast Cancer Research and Treatment. 2005;92:97–105. doi: 10.1007/s10549-005-2112-y. [DOI] [PubMed] [Google Scholar]
- Langholz B, Borgan Ø. Counter-matching: a stratified nested case-control sampling method. Biometrika. 1995;82:69–79. [Google Scholar]
- Lin DY, Fleming TR, Wei LJ. Confidence bands for survival curves under the proportional hazards model. Biometrika. 1994;81:73–81. [Google Scholar]
- Martinussen T, Scheike TH, Skovgaard IM. Efficient estimation of fixed and time-varying covariate effects in multiplicative intensity models. Scandinavian Journal of Statistics. 2002;29:57–74. [Google Scholar]
- Murphy SA, Sen PK. Time-dependent coefficients in a Cox-type regression-model. Stochastic Processes and Their Applications. 1991;39:153–180. [Google Scholar]
- Oakes D. Survival times: aspects of partial likelihood. International Statistical Review. 1981;49:235–252. [Google Scholar]
- Shore RE, Zeleniuch-Jacquotte A, Currie D, Mohrenweiser H, Afanasyeva Y, Koenig KL, Arslan AA, Toniolo P, Wirgin I. Polymorphisms in XPC and ERCC2 genes, smoking and breast cancer risk. International Journal of Cancer. 2008;122:2101–2105. doi: 10.1002/ijc.23361. [DOI] [PubMed] [Google Scholar]
- Thomas DC. Addendum to “Methods of Cohort Analysis—Appraisal by Application to Asbestos Mining” by Liddell, F.D.K., McDonald, J.C., and Thomas, D.C., J.R. Journal of the Royal Statistical Society, Series A. 1977;140:469–491. [Google Scholar]
- Tian L, Zucker D, Wei LJ. On the Cox model with time-varying regression coefficients. Journal of the American Statistical Association. 2005;100:172–183. [Google Scholar]
- Tibshirani R, Hastie T. Local likelihood estimation. Journal of the American Statistical Association. 1987;82:559–568. [Google Scholar]
- Zucker DM, Karr AF. Nonparametric survival analysis with time-dependent covariate effects—a penalized partial likelihood approach. Annals of Statistics. 1990;18:329–353. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.








