Summary
Length-biased time-to-event data are commonly encountered in applications ranging from epidemiologic cohort studies or cancer prevention trials to studies of labor economy. A longstanding statistical problem is how to assess the association of risk factors with survival in the target population given the observed length-biased data. In this paper, we demonstrate how to estimate these effects under the semiparametric Cox proportional hazards model. The structure of the Cox model is changed under length-biased sampling in general. Although the existing partial likelihood approach for left-truncated data can be used to estimate covariate effects, it may not be efficient for analyzing length-biased data. We propose two estimating equation approaches for estimating the covariate coefficients under the Cox model. We use the modern stochastic process and martingale theory to develop the asymptotic properties of the estimators. We evaluate the empirical performance and efficiency of the two methods through extensive simulation studies. We use data from a dementia study to illustrate the proposed methodology, and demonstrate the computational algorithms for point estimates, which can be directly linked to the existing functions in S-PLUS or R.
Keywords: Cox model, Dependent censoring, Estimating equation, Length-biased
1. Introduction
In observational studies, such as studies of unemployment duration in the labor economy (Lancaster, 1979; de Una-Alvarez, Otero-Giraldez, and Alvarez-Liorente, 2003), cancer screening trials (Zelen and Feinleib, 1969; Zelen, 2004), and HIV prevalent cohort studies (Lagakos, Barraj, and De Gruttola, 1988), one often encounters right-censored time-to-event data subject to length-biased sampling. Length-biased sampling is a special case of left truncation. Following the terminology in the literature, length-biased data are defined for left-truncated and right-censored data under the stationarity assumption, which assumes that the initiation times follow a stationary Poisson process. As a result, the probability of observing a failure time t is proportional to t itself.
Length-bias is one of the major biases that are difficult to remove by trial design and may confound the interpretation of disease-specific survival. In a randomized cancer screening trial, the observed survival benefit for individuals whose disease is detected by screening versus by symptoms can be confounded with length bias. This is because cancers with a longer duration of preclinical disease are more likely to be detected by screening examination, and are thus overrepresented among the screen-detected cases. Moreover, cancers with a longer preclinical duration (i.e., with slower growing tumors) are often associated with more favorable prognoses. Another example of length-biased data can be seen in a study of dementia among elderly people. In the Canadian Study of Health and Aging (CSHA), a total of 14,026 subjects who were 65 years or older were randomly selected throughout Canada, and 10,263 agreed to participate in this multicenter epidemiologic study (Wolfson et al., 2001). Among the participants, 1,132 were identified as having dementia and were followed until the end of the study. The investigators noted that patients who had experienced a longer duration of dementia symptoms at the time of recruitment to the CSHA tended to live longer (Wolfson, et al., 2001). That is, the sampled cases were subject to length-bias. It is of great interest to investigate how different types of dementia may impact long-term survival in a regression analysis after adjusting for length-biased sampling.
In the aforementioned example, the observed time from the diagnosis of the disease to the subsequent event (death) is subject to length-biased sampling. The outcome of interest is time from disease diagnosis to death. The data set we considered is a prevalent cohort consisting of subjects with the disease of interest at the examination time who were then followed for a subsequent terminal event (e.g., death). The data on each subject in the cohort include an initiating event (e.g., diagnosis of the disease) and a failure event (disease recurrence or death) for those subjects who have been accrued. Apparently, the failure times will be left truncated if the initiating event is not observed or the failure event occurs before sampling time, and the failure times can be right censored during follow-up. Length-biased sampling occurs in this setting because the “observed” time intervals from initiation to failure in the prevalent cohort tend to be longer than those in the target population.
Methodology development has focused on nonparametric estimation of the length-biased distribution in one-sample problems (Turnbull, 1976; Vardi, 1982, 1985, 1989; Gill, Vardi, and Wellner, 1988; Lagakos et al., 1988; Wang, 1989; Asgharian, MLan, and Wolfson, 2002; Asgharian and Wolfson, 2005). One complication in analyzing such data is the potential dependence between the failure time and the right-censoring time, measured from the initiating event (diagnosis) to the event of interest. The informative censoring induced by the sampling scheme has been avoided by prohibiting right censoring (Vardi, 1985; Wang, 1996) or by simply ignoring it. A second complication occurs when evaluating the covariate effects on the time interval measured from the diagnosis of the disease to the event of failure for the target population. This evaluation proves to be difficult because the model structure assumed for the target population is often different from the one for the observed length-biased data. Recently, Shen, Ning, and Qin (2009) proposed some methods for modeling covariate effects for length-biased data under transformation and accelerated failure time models. Bergeron, Asgharian, and Wolfson (2008) assessed the bias induced for covariate estimates under length-biased sampling using a full-likelihood approach with a parametric model.
Coxs proportional hazards model has been widely used to model the risk factors of a failure time in classical survival analyses (Cox, 1972). On the other hand, it has been noted in the literature that conventional regression methods, such as the standard Cox partial likelihood method, may produce biased estimators if right-censored data is subject to biased sampling. The partial likelihood approach proposed for left-truncated data can be applied to estimate the covariate effects for length-biased data under the Cox model (e.g., Kalbfleisch and Lawless, 1991; Keiding, 1992; Wang, Brookmeyer, and Jewell, 1993). However, the efficiency of the estimators may not be ideal because the important information pertaining to the stationary Poisson process is not utilized. Wang (1996) was the first to use the semiparametric proportional hazards model to estimate covariate effects when the observed failure times were length-biased; Wang used a bias-adjusted risk set for the construction of the pseudo-likelihood. However, a major restriction in her approach was the assumption that the length-biased data are not subject to right censoring. More recently, Ghosh (2008) proposed an estimating equation approach that allows right censoring of the length-biased data under a proportional hazards model. However, Ghosh assumed that the cross-sectional data did not have any follow-up. Therefore, Ghoshs proposed method may not be general enough or valid if there are follow-up data subject to right censoring.
Given the popularity and importance of the Cox regression model for analyzing survival data, the aims of this work are to propose two inference methods to assess the covariate effects under the semiparametric Cox model for length-biased data subject to right censoring, and to compare the proposed methods with the conditional approach for left-truncated data. The proposed methods are based on the generalized estimating equations. One major advantage of the proposed methods is computational simplicity. The estimation algorithms can be directly linked to existing S-PLUS, R, or SAS codes for the Cox model by adding appropriate weights for the linear predictor in the function. The remainder of this paper is organized as follows. In Section 2 we introduce the basic notation and the estimating equations and provide inference procedures and theoretical properties for the proposed estimators. In Section 3 we evaluate the performance of the proposed estimators and compare them with existing methods through simulation studies. We also illustrate the methods through application to the demential data example and demonstrate the computational algorithms for point estimates, which can be directly linked to the existing functions in S-PLUS or R. We provide concluding remarks in Section 4 and proofs of the theorems in the Appendix.
2. Estimation Methods
2.1 Data and Model
Assume T̃ failure, to be the duration from the initiating event (diagnosis or onset of the disease) to A to be the duration from the initiating event to examination, V to be the duration from examination to failure, and C to be the duration from examination to censoring. Under length-biased sampling, one can only observe T among those T̃ > A. Let T = A + V be a positive lifetime random variable, where A is the truncation variable (or backward recurrence time), V is the residual survival time (or forward recurrence time), and X is the baseline covariate vector. It is reasonable to assume that C and (A, V ) are independent, and that the censoring distribution is independent of covariate X.
For a random sample of n independent subjects, the observed data consist of {(Ai, Yi, δi, Xi), i = 1, … , n}, where Yi = min(Ti, Ai + Ci), Ti = Ai + Vi, and δi = I(Vi ⩽ Ci). Let f represent the unbiased density for T̃ , and g represent the length-biased density (conditional on T̃ > A). Then, for the observed length-biased data T , its density function g is associated with the unbiased density f, as follows:
Given the covariates, X = x, the density of T can be expressed as
where g(t|x) and f(t|X) denote the covariate-specific length-biased sampling density and the population (unbiased) density. Assume that failure times in the target population (unbiased), T̃, follow the proportional hazards model
| (1) |
where λ0(t) is an unspecified baseline hazards function and β0 is a vector-valued unknown regression coefficient for X.
Likelihood principal
In order to better understand the structure of length-biased data, we start with the bivariate observation for A and T. Given the covariate X = x, the joint density of (A, T ) can be decomposed as a product of the marginal distribution of A and the conditional distribution of T given A. Such a formulation has been utilized in analyzing left-truncated data (e.g., Andersen et al., 1993, pp 166-167; Wang, 1989):
where SU(t|x) is the survival distribution for the unbiased failure time given x. Given truncation time A = a, the conditional likelihood of Y is proportional to
| (2) |
As described in detail by Wang et al. (1993), LC can be further expressed as the product of a partial likelihood and the residual likelihood:
where
| (3) |
and the residual likelihood, LR(β,λ0) is referred to as an “ancillary” term by Wang et al. (1993), which includes the baseline hazard function λ0 and β. Under the Cox model, LP has an expression similar to that of the partial likelihood function for traditional survival data (without left truncation) except for the definition of the risk sets R(y) = {j : aj ⩽ y ⩽ yj}. Intuitively, the ignored information for covariates (i.e., β) contained in the marginal distribution of A and the residual likelihood may lead to a loss of efficiency.
2.2 Estimating Equation Approaches
We start with a special case in modeling covariate effects for length-biased data without right censoring. Let the marginal density function of covariate X be denoted as h(x). Then the conditional distribution of X given T = t follows:
Thus, under the proportional hazards model, the conditional expectation of x is
The second equation holds because λ0(t) is canceled out. Using the fact that
| (4) |
we obtain
Therefore, we can construct the following unbiased estimating equation to estimate β :
In fact, the above estimating equation is the same as the score equation derived from the pseudo-likelihood function by Wang (1996). By generalizing the above derivations to length-biased data with right censoring, we propose two estimating equation approaches.
Estimating Equation I
When length-biased failure time T is subject to right censoring, a natural extension of the above estimating equation can be proposed as follows. Recall that the joint density distributions of (A, V ) and (A, T ) given covariate X = x have the same formula without censoring (Asgharian and Wolfson, 2005):
With potential censoring, the probability of observing a pair of uncensored data is
| (5) |
where SC is the survival distribution for censoring variable C, assuming that the right-censoring variable C is independent of covariate X. Using a concept similar to (4), we have the following conditional expectation when there is right censoring:
| (6) |
In addition, utilizing equation (6), we can replace SU(y|X)/μ(X) the following conditional expectation:
| (7) |
Combining (6) and (7) leads to the following estimating equation:
| (8) |
When SC is unknown, we can replace it with its consistent Kaplan-Meier estimator for residual censoring time, which leads to an asymptotic unbiased estimating equation we call EE-I.
Estimating Equation II
An alternative estimating equation approach with a different weight can be proposed. Given (5), we can express the probability of observing the length-biased failure time at y by integrating out a:
| (9) |
where . Based on (9), we have
| (10) |
Therefore, we can replace SU (y|X)/μ(X) in equation (7) for the conditional expectation of X with the corresponding observed data to construct the following unbiased estimating equation to estimate β :
| (11) |
By replacing wc(t) with its consistent estimator, we have an asymptotic unbiased estimating equation, which we call EE-II.
Estimating Equation LT
Clearly the above two estimating equations require the information of the distribution function for censoring variable C. In contrast, an approach proposed for delayed-entry/left-truncated data does not require estimating the survival function for the censoring variable. Conditional on X, one has
Therefore, when censoring variable C is independent of covariate X, wc(.) is canceled out on the right side of the following equation: Conditional on X, one has
Similar to the constructions of the first two estimating equations, one can construct the estimating equation
| (12) |
which is the score equation of the partial likelihood of (3) for left-truncated data under the Cox model (Kalbfleisch and Lawless, 1991; Andersen et al., 1993; Wang et al., 1993). We call this equation EE-LT. Unlike EE-I and EE-II, the summations in the fraction terms of EE-LT can include both failure and censored times as long as the pair (aj, yj) satisfies the inequality condition. The large sample properties for the above estimating equation have been explored in the literature (Wang, 1989; Wang et al., 1993).
2.3 Asymptotic Properties
The consistency and weak convergence of β can be established for estimating equations EE-I and EE-II under the regularity conditions stated in the Appendix. Using the counting process notation of Andersen et al. (1993), for the ith subject, define risk set Ri(t) = I{Yi ⩾ t}δi and Ni(t) = I{Yi ⩽ t, Ci ⩾ Yi − Ai}. Define
where k = 1 for EE-I and k = 2 for EE-II, l = 0, 1, 2, and
Also, let
ek(β, t) be the expectation of Ek(β, t) sk(l)(β, t) be the expectation of Sk(l)(β, t).
Estimating Equation I for β̂1
By generalizing the theoretical formulation of Wang (1996) to the setting with right censoring, we can construct
and prove it to be a mean zero stochastic process. Specifically, using equation (6),
Therefore, if SC is a known function, estimating equation (8) can be asymptotically represented by the following independent and identically-distributed (i.i.d.) summation of the mean zero process:
| (13) |
To obtain an estimator for β, we replace SC(t) with its consistent Kaplan-Meier estimator ŜC(t) for the censoring time in (13). We then have the estimating equation
| (14) |
where
In the Appendix and Section 1 of the Supplementary Materials, we show that under the regularity conditions there exists a unique solution to the equations Ũ1(β) = 0, and . Moreover, n−1/2Ũ1(β) converges weakly to a mean zero Gaussian process with a variance-covariance function ∑1. Let β̂1 be the solution to equation (14). By Taylor series expansion,
where β* is on the line segment between β̂1 and β0, and Γ1(β) = −n−1∂Ũ1(β)/∂β.
Given the estimated β̂1 for β0, a natural estimator for the cumulated baseline hazard function that is similar to Breslows estimator can be proposed:
The variance-covariance of β1 can be consistently estimated by , where
, and Λ̂C(t) is the Nelson-Aalen estimator for residual survival time C.
Estimating Equation II for β̂2
Similarly, we can prove that the following stochastic process has a mean of zero:
Using equation (10)
Therefore, if SC is a known function, estimating equation (11) can be asymptotically represented by the following i.i.d. summation of the mean zero processes:
By replacing wc(t) with ŵc(t), we can solve for β using the following estimating equation:
where
In the Appendix and Section 2 of the Supplementary Materials, we prove that n−1/2Ũ2(β0) converges weakly to a p-vector mean-zero Gaussian process with a covariance matrix ∑2. In addition, the solution to the equation Ũ2(β0) = 0 is consistent and unique. Using the Taylor series expansion,
where β* is on the line segment between β̂2 and β0, Γ2(β) = −n−1∂Ũ2(β)/∂β, and ∑2 is the covariance matrix of limn→∞Ũ2(β0).
The covariance matrix of β̂ can be consistently estimated by , where
, and
Remark
The weight functions W1i and W2i in the two estimating equations play a similar role in adjusting for the dependent censoring distribution in length-biased data. Theoretically, both weight functions are valid choices. In the next section, we will investigate the empirical performance of the estimators solved from EE-I and EE-II under various scenarios.
3. Numerical Studies
3.1 Simulations
We carried out a series of simulation studies to assess the efficiencies of the two proposed estimators relative to the estimator from UL(β) and the estimator from the estimating equation by Ghosh (2008), which is specified as follows, EE-III
| (15) |
Note that (15) seems to have the same components of EE-I in (8), but the term SC(Yi − Ai) in (8) is replaced by SC(Yi) and information for the left-truncation time A is not used.
We generated unbiased failure times T̃i from the proportional hazards model
where β = (α1, α2) = (1, 1) or (2, 2), the binary covariate X1 ~ Bernoulli(1,0.5), the continuous covariate X2 ~ uniform(−0.5,0.5), and the baseline hazard function is 2t. The left-truncation time Ai was independently generated from a uniform distribution (0, τ0), and the pairs (Ai, T̃i) with Ai < T̃i were kept. We chose τ0 larger than the upper bound of T̃i to ensure the stationarity assumption. When τ0 was large enough, and we let approximate to zero if τ0 → ∞,
The censoring variables measured from the examination time were independently generated from uniform distributions corresponding to various censoring percentages: 20%, 35%, and 50%. The censoring indicator was obtained by δi = I(Ti ⩽ Ci + Ai). For each scenario, we repeated the simulation 1000 times with cohorts of size n = 200 or n = 400.
The simulation results are summarized in Table 1. When the right-censoring rate varies from low to moderate (20 to 35%), both EE-I and EE-II have unbiased estimators and reasonable coverage probabilities. In contrast, the estimators from (15) led to severe bias and poor coverage probabilities in all the scenarios we investigated. When the censoring percentage is small, which is equivalent to the censoring variable C being subject to heavy censoring, the proposed variance estimators derived from the weak convergence of Ũ2 may slightly overestimate the true variance of β̂, which leads to an overestimated coverage probability. The reason for this is that the weight ŵc(t) may be slightly overestimated under the heavy censoring of variable C. The overestimation for the variance estimator disappears when the right-censoring proportion increases. In application, this concern can be eliminated by using the bootstrap variance estimator of β̂ if the censoring proportion is very small.
Table 1.
Empirical estimators and coverage probabilities of the 95% confidence interval under three estimating equations; C%: censoring %; EE: estimating equation; 95% CP: 95% coverage probability; RE: ratios of empirical variances (e.g for α̂1, ratio of variance estimators from EE-II and EE-I: )
| (α1, α2) | C% | EE | (α̂1, α̂2) | 95% CP | RE | (α̂1, α̂2) | 95% CP | RE | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n = 200 | n = 400 | |||||||||||||
| (1,1) | 20 | I | 0.991 | 0.972 | .974 | .959 | .89 | .89 | 0.989 | 0.970 | .978 | .956 | .92 | .81 |
| II | 1.020 | 1.016 | .985 | .970 | 1.0 | 1.0 | 1.010 | 1.010 | .984 | .978 | 1.0 | 1.0 | ||
| LT | 1.008 | 1.015 | .934 | .948 | .56 | .67 | 1.003 | 1.004 | .953 | .946 | .58 | .65 | ||
| III | 0.346 | 0.362 | .078 | .381 | 0.345 | 0.349 | .001 | .071 | ||||||
| 35 | I | 0.936 | 0.920 | .930 | .912 | .79 | .78 | 0.934 | 0.901 | .920 | .895 | .73 | .59 | |
| II | 1.024 | 1.026 | .974 | .958 | 1.0 | 1.0 | 1.011 | 0.999 | .973 | .973 | 1.0 | 1.0 | ||
| LT | 1.008 | 1.013 | .948 | .952 | .72 | .85 | 1.006 | 1.000 | .961 | .951 | .76 | .71 | ||
| III | 0.293 | 0.315 | .068 | .425 | 0.297 | 0.296 | .000 | .099 | ||||||
| 50 | I | 0.838 | 0.813 | .841 | .866 | .70 | .71 | 0.841 | 0.835 | .762 | .838 | .56 | .61 | |
| II | 1.019 | 1.003 | .953 | .944 | 1.0 | 1.0 | 1.016 | 1.007 | .958 | .956 | 1.0 | 1.0 | ||
| LT | 1.003 | 1.021 | .961 | .945 | .92 | .90 | 1.002 | 1.001 | .950 | .945 | .96 | .77 | ||
| III | 0.201 | 0.230 | .072 | .423 | 0.218 | 0.240 | .000 | .136 | ||||||
| (2,2) | 20 | I | 1.989 | 1.951 | .968 | .961 | 1.0 | .90 | 2.017 | 1.960 | .976 | .962 | 1.0 | .95 |
| II | 2.013 | 2.001 | .972 | .971 | 1.0 | 1.0 | 2.044 | 2.018 | .974 | .968 | 1.0 | 1.0 | ||
| LT | 2.024 | 2.031 | .958 | .956 | .68 | .69 | 2.055 | 2.029 | .951 | .954 | .67 | .72 | ||
| III | 0.585 | 0.691 | .005 | .016 | 0.593 | 0.672 | .007 | .008 | ||||||
| 35 | I | 1.961 | 1.839 | .963 | .920 | .99 | .82 | 1.938 | 1.810 | .953 | .895 | .94 | .74 | |
| II | 2.042 | 2.018 | .971 | .962 | 1.0 | 1.0 | 2.019 | 1.980 | .972 | .963 | 1.0 | 1.0 | ||
| LT | 2.058 | 2.052 | .945 | .947 | .63 | .79 | 2.022 | 1.998 | .950 | .966 | .63 | .79 | ||
| III | 0.539 | 0.569 | .006 | .009 | 0.538 | 0.556 | .003 | .008 | ||||||
| 50 | I | 1.833 | 1.674 | .894 | .837 | .90 | .75 | 1.858 | 1.664 | .892 | .826 | .90 | .73 | |
| II | 2.016 | 1.992 | .962 | .955 | 1.0 | 1.0 | 2.045 | 1.987 | .956 | .961 | 1.0 | 1.0 | ||
| LT | 2.039 | 2.015 | .948 | .951 | .71 | .81 | 2.055 | 2.005 | .941 | .934 | .70 | .75 | ||
| III | 0.435 | 0.422 | .006 | .018 | 0.434 | 0.392 | .006 | .017 | ||||||
When the right-censoring percentage is high (50%), the estimator of β from EE-I can be biased due to the instability of weight function W1 in the denominator. This phenomenon remains when the total sample size increases from 200 to 400. In contrast, the estimators from EE-II are much more robust to various censoring percentages, its biases are small, and its coverage probabilities are close to the nominal level, especially when the sample size increases. The relative ratios for the variance estimators between the estimators from EE-II and from EE-I show a loss of efficiency between 1 and 44% for EE-I. The largest loss of efficiency for EE-I relative to EE-II occurs when there is heavy censoring.
As expected, the estimators obtained from EE-LT have larger variances than the estimators from EE-I and EE-II, especially with small to moderate censoring, because EE-LT ignores part of the information for β contained in the residual likelihood. The relative ratios for the variance estimators between the estimators from EE-LT and EE-II show a loss of efficiency of up to 42%. However, EE-LT has the advantage of not requiring an estimate of the censoring distribution of C, which leads to a more robust estimation procedure for different censoring distributions.
As suggested by a referee, we also performed a small simulation study to assess the bias in the proposed estimators when the stationarity assumption is violated. Because the proposed estimating equation approaches are derived for length-biased data, biases are expected in the estimators obtained from EE-I and EE-II when the data do not satisfy the stationarity assumption (shown in Table 2). For both EE-I and EE-II, the bias and the mean square error increase with the percent of censoring. In contrast, the less efficient EE-LT approach proposed for delayed-entry/left-truncated data does not rely on the stationarity assumption; therefore, it leads to unbiased estimators.
Table 2.
Simulation results without stationarity assumption with sample size 200
| (α1, α2) | C% | EE | (α̂1, α̂2) | 95% CP | MSE | |||
|---|---|---|---|---|---|---|---|---|
| (1,1) | 20 | I | 0.946 | 0.898 | 0.952 | 0.942 | 1.170 | 1.367 |
| II | 0.938 | 0.882 | 0.965 | 0.948 | 1.184 | 1.388 | ||
| LT | 1.016 | 1.013 | 0.948 | 0.930 | 1.045 | 1.163 | ||
| 35 | I | 0.894 | 0.839 | 0.884 | 0.869 | 1.305 | 1.587 | |
| II | 0.920 | 0.853 | 0.934 | 0.932 | 1.238 | 1.505 | ||
| LT | 1.016 | 1.029 | 0.945 | 0.939 | 1.056 | 1.167 | ||
| 50 | I | 0.813 | 0.766 | 0.805 | 0.844 | 1.532 | 1.834 | |
| II | 0.917 | 0.853 | 0.901 | 0.926 | 1.272 | 1.565 | ||
| LT | 1.009 | 1.002 | 0.943 | 0.942 | 1.090 | 1.279 | ||
3.2 Example: Dementia Study
The Canadian Study of Health and Aging was a multicenter epidemiologic study that has been described in the literature (Wolfson et al., 2001; Asgharian et al., 2002). In the first phase of the study, a total of 14,026 subjects who were 65 years or older were randomly selected from throughout Canada to receive an invitation to participate in a health survey. A total of 10,263 Canadians agreed to participate. The participants were screened for dementia in 1991. From that cohort, 1,132 participants were identified as having dementia. The dates of disease onset were ascertained from the participants medical records, and their dates of death or right censoring were collected prospectively during the second phase until the end of the study in 1996. After excluding participants for whom the date of disease onset or the classification of dementia subtype was missing, there were 818 participants left. Their dementias were classified into the following three categories: probable Alzheimer's disease, n=393; possible Alzheimer's disease, n=252; and vascular dementia, n=173. At the end of the study, 638 participants had died, and the others were right censored. The purpose of this study was to assess whether the subtype of dementia at diagnosis had any effect on overall survival.
We first checked the correlation between the type of dementia and the censoring distribution C and found no statistically significant association. The stationarity assumption for the length-biased data was carefully validated by Asgharian, Wolfson, and Zhang (2006). To perform a semiparametric regression analysis, we used the category of probable Alzheimer's disease as the baseline and defined two indicator variables for possible Alzheimer's disease and vascular dementia under the Cox model. The estimated covariate coefficients for the three methods that adjust for length-biased sampling and the naive analysis that does not adjust for length-biased sampling are summarized in Table 3. The estimated standard errors for the covariate coefficients are obtained from the proposed consistent estimators for EE-I, EE-II. It is clear from the results that the subtype of dementia yields little difference in long-term survival under the methods that adjust for length-biased sampling. This finding is consistent with the nonparametric survival estimators provided by Wolfson et al. (2001). Note that the overestimated coefficient obtained from the naive Cox model suggests a marginally significant prolonged survival for the category of possible Alzheimers disease compared with that for the category of probable Alzheimers disease.
Table 3.
Estimates (standard errors) of regression coefficients for dementia data under Cox model, using “probable Alzheimers disease” as baseline
| Length-bias adjusted analyses |
Naive Cox model | |||
|---|---|---|---|---|
| EE-I | EE-II | EE-LT | ||
| Vascular dementia | 0.137 (.101) | 0.074 (.101) | 0.087 (.103) | 0.076 (.103) |
| Possible Alzheimers disease | −0.109 (.093) | −0.134 (.091) | −0.037 (.093) | −0.182 (.093) |
3.3 Computation Algorithms
Estimating covariate effects under the Cox model for right-censored failure time data is easy for the end user with either S-PLUS (R) or SAS, via the function “coxph.” Our aim in this section is to illustrate how to use existing software for traditional right-censored data to analyze length-biased right-censored data under Coxs proportional hazards model. Specifically, we describe slightly modified commands in S-PLUS (R) for the two proposed estimating equations, EE-I and EE-II. For length-biased data, we can use the function “coxph,” but with the “offset” option to add a linear predictor in the Cox model with a known coefficient of one for the weight. For the estimating equation EE-I or EE-II, we use the estimated weight for W1i or W2i, respectively, as the input for “offset” in “coxph,”. To illustrate, using the previously described example, we define vectors XV and XP as indicators of Vascular dementia and possible Alzheimer's disease, and then apply
where “futime” is the observed failure times, m is the total number of the observed failure times, “fdata” is the subset of the whole data matrix among subjects with observed failure times only, and Ŵ1 is a vector consisting of the estimated weight of each failure time for EE-I. Note that Ŵ1 is estimated by the Kaplan-Meier estimator of censoring variable C on all Yi − Ai and i = 1, ⋯, m,
Similarly, for EE-II we can use Ŵ2i = {ŵc(Yi)}−1 to give the input for the “offset” term in “coxph.”
The Newton-Raphson method is used to solve the partial likelihood equation for estimating β for conventional right-censored data in both S-PLUS and SAS. For the purpose of comparison, we also used the Newton-Raphson iterative algorithm to solve β from EE-I and EE-II. It is not surprising that the numerical solution obtained from the Newton-Raphson method for EE-I (or EE-II) is the same as the one obtained from the command “coxph” with the “offset” option using log(Ŵ1) (or log(Ŵ2)), because EE-I (or EE-II) is identical to the estimating equation of the ordinary Cox model treating Ŵ1 (or Ŵ2) as a fixed “offset” term and restricting only the failure times. Note that EE-I (8) can also be expressed as
which is the same as the score equation used in the ordinary Cox model with a linear predictor log(W1) restricting among the observed failure times. The censoring distribution for C enters into the estimating equations (EE-I or EE-II) only through the estimated weights Ŵ1 or Ŵ2.
4. Discussion
The methodology for estimating covariate effects under a proportional hazards model for classical survival data has been implemented in standard software and is widely used in S-PLUS, SAS, and other statistical packages. An advantage for our proposed inference methods for length-biased right-censored data is that we can use the existing functions under the Cox model in S-PLUS or SAS to carry out the point estimation by providing only an estimated weight. Specifically, we only need to estimate weight W1i from the Kaplan-Meier estimator for residual censoring times SC(t) and W2i, which is an integral of SC(t). We can then use the “offset” option in S-PLUS (R) to incorporate the weights for the regular “coxph” function. The consistent variance-covariance estimator of β̂k can be obtained by for k = , 2, as proposed in Section 2.3 for the estimating equations. Alternatively, one may use the bootstrap approach to obtain the corresponding standard errors using the existing functions “coxph” in S-PLUS (R) or “PROC PHREG” in SAS.
For the two proposed estimating equation approaches, the desired asymptotic properties were derived under mild regularity conditions. For both types of estimating equations, we proposed two estimators for the baseline hazards function for Λ0(t), which can lead to the prediction of covariate-specific survival. However, the establishment of the asymptotic properties of the Breslows estimators for the cumulative baseline hazards function is not a focus of this work. When the censoring distribution of C depends on covariate X, we may use the covariate-dependent weights ŜC(t|Xi), which may follow a semiparametric model, a parametric model, or a fully nonparametric Kaplan-Meier estimator for discretized covariates.
Of the three estimating equation approaches we investigated, we found that EE-II may be the most promising choice in general because it is robust to different censoring distributions and utilizes all of the available information. The estimators obtained from EE-I are comparable to those obtained from EE-II when the censoring percentages are small to moderate, but can be unstable when censoring is heavy. This is because the impact of an inverse of SC(t) → 0 can be large at the tail, whereas the integral of SC(t) will not go to zero at the tail. The estimating equation from the left-truncation model has the advantages of not requiring an estimate for the censoring distribution and working for general left-truncated data, which include length-biased data as a special case. However, the ignored component from the full likelihood causes a loss of information in the estimating equation EE-LT, which can lead to a loss of efficiency of up to 42% compared with that from EE-II in the scenarios that were investigated. Using the estimating equation by Ghosh (2008) may lead to a severely biased estimator for general right-censored length-biased data, while the method was proposed for length-biased data without follow-up data. In this case, the informative censoring mechanism cannot be accounted adequately. We hope that our results will facilitate applications of the most commonly used semiparametric regression model, Coxs proportional hazards model, in analyzing length-biased failure time data.
Supplementary Material
ACKNOWLEDGEMENTS
The authors thank Professor David Zucker, the associate editor, and the referee for their constructive comments. This work was supported in part by grants CA79466 and CA016672 from the National Institutes of Health. We thank Professor Masoud Asgharian and the investigators from the Canadian Study of Health and Aging (CSHA) for providing us with the dementia data from the CSHA. The data reported in the example were collected as part of the CSHA. The CSHA study was funded by the Seniors' Independence Research Program through the National Health Research and Development Program (NHRDP) of Health Canada (Project no.6606-3954-MC(S)). Additional funding was provided by Pfizer Canada Incorporated through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP Project 6603-1417-302(R), Bayer Incorporated, and the British Columbia Health Research Foundation Projects 38 (93-2) and 34 (96-1). The study was coordinated through the University of Ottawa and the Division of Aging and Seniors, Health Canada.
APPENDIX
The regularity conditions for the large sample properties are:
(Ai, Yi, δi, Xi) (i = 1, ⋯ , n) are independent and identically distributed;
P(Ci + Ai ⩾ τ) > 0, where τ is a predetermined constant;
is positive definite for k = 1 or 2, where β0 and Λ0(t) are the true underlying values of β and the baseline hazard function;
0 < wc(τ) < ∞ and , where SV(t) = pr(Y − A > t).
Consistency and uniqueness of β̂1
Consider a likelihood function
Note that ∂Li(β)/∂β = n−1Ũ1(β) and
By the strong law of large numbers and under the specified regularity conditions, L1(β) converges almost surely to
for any β, and Γ̂1(β0) converges almost surely to Γ1, as n goes to infinity, where
is assumed to be positive definite. Therefore, L1(β) is concave for β, which leads to a unique solution to Ũ1(β). The consistency of β̂1 also follows.
Weak Convergence of Ũ1(β)
By decomposing Ũ1(β) approximately by the following two components,
| (16) |
where
where , ΛC(t) is the hazard function for C, and π(t) = SC(t)SV(t). The second term in (16) is derived using the fact that the Kaplan-Meier estimator can be approximated by a sum of martingale integrals,
More details can be found in Section 1 of the Supplementary Materials. Under the regularity conditions (a)-(d), n−1/2Ũ1(β) converges weakly to a Gaussian process with mean zero and variance-covariance matrix Σ1, where
Consistency and uniqueness of β2
Consider a likelihood function
The rest of the derivation is then similar to that for β1.
Weak Convergence of Ũ2(β)
Similar to the asymptotic representation for (8), the estimating equation (11) can be approximated by the following i.i.d. summation of a mean zero stochastic process, Ũ2(β) can be approximated by the sum of i.i.d mean zero process,
| (17) |
where ,
and . The second term in (17) explains the uncertainty induced by ŵc(t). The last equation holds because all wc(t) and ŵc(t) are canceled out inside E2 and Ê2, and (ŵc(Yk) − wc(Yk)) can be expressed as an i.i.d. sum of martingales (Pepe & Fleming, 1991),
More details can be found in Section 2 of the Supplementary Materials.
Footnotes
Supplementary Materials
Web Appendices referenced in Section 2.3 and the APPENDIX are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.
References
- Andersen PK, Borgan O, Gill RD, Keiding N. Statistical models based on counting processes. Springer-Verlag Inc.; 1993. [Google Scholar]
- Asgharian M, M'Lan CE, Wolfson DB. Length-biased sampling with right censoring: an unconditional approach. J. Am. Statist. Assoc. 2002;97:201–209. [Google Scholar]
- Asgharian M, Wolfson DB. Asymptotic behavior of the unconditional NPMLE of the length-biased survivor function from right censored prevalent cohort data. Ann. Statist. 2005;33:2109–2131. [Google Scholar]
- Asgharian M, Wolfson DB, Zhang X. Checking stationarity of the incidence rate using prevalent cohort survival data. Stat. Med. 2006;25:1751–1767. doi: 10.1002/sim.2326. [DOI] [PubMed] [Google Scholar]
- Bergeron P-J, Asgharian M, Wolfson DB. Covariate bias induced by length-biased sampling of failure times. J. Am. Statist. Assoc. 2008;103:737–742. [Google Scholar]
- Cox DR. Regression models and life-tables (with discussion) J. R. Statist. Soc. B. 1972;34:187–220. [Google Scholar]
- de Una-Alvarez J, Otero-Giraldez MS, Alvarez-Llorente G. Estimation under length-bias and right-censoring: an application to unemployment duration analysis for married women. J. Applied Statist. 2003;30:283–291. [Google Scholar]
- Gill RD, Vardi Y, Wellner JA. Large-sample theory of empirical distributions in biased sampling models. Ann. Statist. 1988;16:1069–1112. [Google Scholar]
- Ghosh D. Proportional hazards regression for cancer studies. Biometrics. 2008;64:141–148. doi: 10.1111/j.1541-0420.2007.00830.x. [DOI] [PubMed] [Google Scholar]
- Kalbfleisch JD, Lawless JF. Regression models for right truncated data with applications to AIDS incubation times and reporting lags. Statistica Sinica. 1991;1:19–32. [Google Scholar]
- Keiding N. Independent delayed entry. In: Klein JP, Goel PK, editors. Survival Analysis: State of the Art. Kluwer Academic Publishers Group; Norwell, Massachusetts: 1992. pp. 309–325. [Google Scholar]
- Lagakos SW, Barraj LM, De Gruttola V. Nonparametric analysis of truncated survival data, with applications to AIDS. Biometrika. 1988;75:515–523. [Google Scholar]
- Lancaster T. Econometric methods for the duration of unemployment. Econometrica. 1979;47:939–956. [Google Scholar]
- Pepe MS, Fleming TR. Weighted Kaplan -Meier statistics: large sample and optimality considerations. J. Roy. Statist. Soc. B. 1991;53:341–352. [Google Scholar]
- Shen Y, Ning J, Qin J. Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J. Am. Statist. Assoc. 2009 doi: 10.1198/jasa.2009.tm08614. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Statist. Soc. B. 1976;38:290–295. [Google Scholar]
- Vardi Y. Nonparametric estimation in the presence of length bias. Ann. Statist. 1982;10:616–620. [Google Scholar]
- Vardi Y. Empirical distributions in selection bias models. Ann. Statist. 1985;13:178–203. [Google Scholar]
- Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density: Nonparametric estimation. Biometrika. 1989;76:751–761. [Google Scholar]
- Wang MC. A semiparametric model for randomly truncated data. J. Am. Statist. Assoc. 1989;84:742–748. [Google Scholar]
- Wang MC. Hazards regression analysis for length-biased data. Biometrika. 1996;83:343–354. [Google Scholar]
- Wang MC, Brookmeyer R, Jewell NP. Statistical models for prevalent cohort data. Biometrics. 1993;49:1–11. [PubMed] [Google Scholar]
- Wolfson C, Wolfson DB, Asgharian M, M'Lan CE, Ostbye T, Rockwood K, Hogan DB, For the Clinical Progression of Dementia Study Group A reevaluation of the duration of survival after the onset of dementia. New Engl. J. Med. 2001;344:1111–1116. doi: 10.1056/NEJM200104123441501. [DOI] [PubMed] [Google Scholar]
- Zelen M. Forward and backward recurrence times and length biased sampling: age specific models. Lifetime Data Anal. 2004;10:325–334. doi: 10.1007/s10985-004-4770-1. [DOI] [PubMed] [Google Scholar]
- Zelen M, Feinleib M. On the theory of screening for chronic diseases. Biometrika. 1969;56:601–614. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
