Abstract
We obtain a pseudo-partial likelihood for proportional hazards models with biased-sampling data by embedding the biased-sampling data into left-truncated data. The log pseudo-partial likelihood of the biased-sampling data is the expectation of the log partial likelihood of the left-truncated data conditioned on the observed data. In addition, asymptotic properties of the estimator that maximize the pseudo-partial likelihood are derived. Applications to length-biased data, biased samples with right censoring and proportional hazards models with missing covariates are discussed.
Some key words: em Algorithm, Left truncation, Length-biased data, Missing covariate, Right censoring
1. Introduction
The partial likelihood function of Cox (1975) has been mainly used for proportional hazards models with censored data (Cox 1972). For more complicated incomplete data, no unified method exists to find a partial likelihood for inference on the parameters of the proportional hazards models. Dempster et al. (1977) developed the em algorithm to obtain maximum likelihood estimators for incomplete data. Originally used for fully parametric models, the em algorithm was subsequently extended successfully to many nonparametric problems. In survival analysis, there is substantial literature generalizing the em algorithm to frailty models (Andersen et al. 1993, § 9), missing covariates (Paik Tsai 1997; Qi et al. 2005) and interval-censored data (Betensky et al. 1999).
In semiparametric models, one usually obtains, through a conditioning argument, an objective function with finitely many parameters of interest to which the em algorithm can be readily applied. The present paper gives an analogous pseudo-partial likelihood for proportional hazards models with biased-sampling data that can be used without intensive computation.
Under the proportional hazards models for biased-sampling data, the conditional probability density function of an observed nonnegative random variable T, given covariates z(t) and x, can be expressed as
| (1) |
where W(t,x) is a completely known nonnegative weight function, z(t) = { z1(t), …, zp(t)}T is a p-dimensional time-dependent covariate, x = (x1, …, xq)T is a q-dimensional time-independent covariate, f(t ∣ z) denotes a population conditional density function given z(s) for s ⩽ t and α (x, z) is a normalization constant making h(· ∣ x, z) a genuine probability density function. Furthermore, we will assume a proportional hazards model with
where
is the conditional survival function.
Biased-sampling data arise naturally in complex surveys. For example, in large-scale population-based surveys with multi-stage sampling, the complex design results in a set of probability weights for each subject. The weight function W(Ti, xi) represents the probability that the ith observation (Ti, xi, zi) was sampled from the population. Binder (1992) and Lin (2000) have proposed and studied a method for estimating the parameters of proportional hazards models from such survey data. For W(t, x) = t, the data are referred to as length-biased data. Wang (1996) proposed statistical inference for length-biased data based on Cox's model. For W(t, x) = I(x ⩽ t), density (1) becomes a conditional probability density for left-truncated data. This problem has been extensively studied in the literature. Wang et al. (1986) used a classical approach to study the properties of the nonparametric maximum likelihood estimator. Keiding Gill (1990) used counting process techniques to study the properties of the same estimator.
The following four real datasets illustrate different types of biased-sampling data.
Example 1
Shrub data. Muttlak McDonald (1990) presented widths of 46 shrubs. Wang (1996) assumed that the probability of observing a shrub is proportional to the shrub's width, so that the sampling is length-biased. Wang (1996) analyzed the data with a proportional hazards model.
Example 2
Channing House data. Channing House is a retirement centre in Palo Alto, California. Hyde (1980) reported ages at entry and at death of 462 retirees, 365 females and 97 males, who were in residence between January 1964 and July 1975. The individuals who left Channing House or were still in the centre at the end of the study were censored. The data can be viewed as left-truncated with right censoring since the individual's death age must be greater than the entry age. The entry age serves as the left-truncation time.
Example 3
Stanford heart transplant data. Crowley Hu (1977) gave information on 103 potential heart transplant recipients who were enrolled in the Stanford heart transplant programme from October 1967 to April 1974. The data include age, waiting time to transplantation, survival or censoring time from acceptance to the programme, and three mismatch scores. Among the 103 potential heart transplant recipients, there were 69 patients who underwent the heart transplant operation. Later, Miller Halpern (1982) updated the data by reporting the survival or censoring times and ages of 184 patients who were enrolled in the same programme and had received heart transplants from October 1967 to February 1980. If we are only interested in analyzing the transplant patients, then the data of Crowley & Hu are left-truncated and right-censored data, with transplant waiting time as a random left-truncation variable. However, because Miller & Halpern did not report the transplant waiting times, their data can be viewed as biased-sampling data with right censoring. The weight function is the distribution of the transplant waiting time random variable.
Example 4
Mouse leukaemia data. Kalbfleisch Prentice (2002) reported the survival of 204 mice. The mice were followed up for two years for mortality due to thymic or nonthymic leukaemia. The two covariates of interest were the GPD1 phenotype and the level of endogenous murine leukaemia virus. There were 175 mice whose levels of endogenous murine leukaemia virus were recorded. The GPD1 phenotype was determined only on a subgroup of the 100 mice that survived 400 days; thus, the probability of missing covariates clearly depends on the follow-up time. The complete-case analysis, which uses only the mice with complete information, is clearly biased since the selection probability depends on the survival time outcome variable. In fact, the complete cases comprise biased-sampling data with the weight function equal to the probability of selecting complete cases.
2. Partial likelihood
2.1. General approach
Let χ be a sample space, and let x ∈ χ be a realization of the random vector X with density fX(x;ϕ) depending on a vector parameter ϕ = (β, η), in which β is of interest and η is a nuisance parameter. In some applications, the dimension of η may increase with the sample size and the application of maximum likelihood estimation may lead to spurious results. However, suppose that x = (c1, x1, …, cn, xn) and that the full likelihood factorizes into
![]() |
(2) |
where di = (c1, x1, …, ci−1, xi−1) and ei = (c1, x1, …, ci−1, xi−1, ci). The second product on the right-hand side of (2) is the partial likelihood of β based on (x1, …, xn) in the sequence (ci, xi)(i = 1, …, n). Cox (1975) argued that inference based only on the partial likelihood would be acceptable if the information about β, contained in the first factor, was small.
A complication that sometimes occurs is that one observes a function
, instead of observing x ∈ χ. Therefore, inferences about β must be based on y. The following algorithm is a simple generalization of the em algorithm for partial likelihood. The em algorithm applied to gamma frailty models discussed in Andersen et al. (1993, § 9) is a special case of this generalization.
For incomplete data, the em algorithm finds maximum likelihood estimates for β and η through the iterative maximization of
![]() |
where β(c) and η(c) denote the current estimates. Therefore,
![]() |
are, respectively, a log pseudo-partial likelihood function and a pseudo-partial score function of β for the observed data y, where Ui(β) = ∂ log fβ(xi ∣ ei)/∂ β is the score function of the partial likelihood for complete data. Unfortunately, Up still involves the nuisance parameter η in many applications. For example, in frailty models Up is also a function of the frailty parameters and, therefore, cannot be used directly. However, Up is a function of β alone in some applications. In other situations, the maximization of lp(β, η, y) with respect to β and η has a simple solution. In § 3, we show two pseudo-partial likelihood functions for Cox's models with biased-sampling data in these two situations.
2.2. Partial likelihood for left-truncated and right-censored data
Let the lifetime T0i have distribution function Fi, and let the truncation time and censoring time (Vi, Ci) have joint distribution function Gi and joint probability density function gi. It is assumed that T0i and (Vi, Ci) are mutually independent. Moreover, we assume that there is a positive probability that T0i ⩾ Vi and Ci ⩾ Vi. We do not sample from the joint distribution, but from the conditional distribution given the event { T0 ⩾ V, C ⩾ V}. Let (Vi, T0i, Ci) (i = 1, …, n) be a sample of n independent triples from this conditional distribution. Then, our left-truncated and right-censored sample is (V1, T1, D1), …, (Vn, Tn, Dn), where Ti = min (T0i, Ci), Di = I(Ti = T0i) and I(·) is an indicator function. Note that Ti ⩾ Vi(i = 1, …, n). We let Ni(t) = I(Ti ⩽ t, Di = 1) and Yvi (t) = I(Vi ⩽ t)Yi(t) be, respectively, the indicator of whether or not the ith individual failed before time t and the indicator of whether or not the ith individual is at risk just before time t, where Yi(t) = I(Ti ⩾ t). Furthermore, for given time-dependent covariates zi(t) = { z1i(t), z2i(t), …, zpi(t)}T, we assume that T0i follows a proportional hazards model, i.e.
where λ (t ∣ zi) is the conditional hazard function of T0i given zi(t), β is a p × 1 vector of unknown regression coefficients and λ0(t) is the underlying or baseline hazard function. For convenience of notation, if it is unambiguous, the dependence of zi on t will be suppressed. As shown by Andersen et al. (1993, §§ 3.3 and 3.4), (N1, …, Nn) is a multivariate counting process that has an intensity process (λ (t ∣ z1)Yv1(t), …, λ (t ∣ zn)Yvn (t)) with respect to the filtration
which is defined in Andersen et al. (1993, p. 153). Let
be the cumulative underlying hazard function. The log partial likelihood, which is the conditional likelihood given V, can be written as
![]() |
(3) |
see equations (3.3.3) and (7.2.2) of Andersen et al. (1993), where
![]() |
and Δ H(t) = H(t+) − H(t−) for any function H(t). The partial derivative of (3) with respect to Δ Λ0(t) is {Δ N(t)/Δ Λ0(t)} − nS(0) (β, t), where
is the number of observed failures up to time t. Therefore, for a fixed value of β, we would estimate λ0(t) by the Nelson–Aalen estimator
![]() |
(4) |
where
. Inserting (4) into (3), we obtain the log profile partial likelihood
. Here,
![]() |
(5) |
is the generalized log partial likelihood, originally derived by Cox (1972, 1975) for the case of censored survival data. Thus, the score function of the generalized Cox partial likelihood is
where S(1) (β, t) = ∂ S(0) (β, t)/∂ β.
We will treat equations lc1 and lc2 as our working log partial likelihood for the complete data. Two log pseudo-partial likelihoods for biased-sampling data will be derived, respectively, based on lc1 and lc2 in the next section. The notation Ni(t), N(t) and Yi(t) will be used throughout with obvious adjustments for data with no censoring and/or no truncation. A parameter with subscript zero will denote the true parameter.
3. Data from biased sampling
3.1. Embedding the data into the left-truncation model
First we assume that W(·, x) is a distribution function for every fixed x; this assumption will be relaxed later. Let V and T0 be nonnegative random variables with conditional distribution functions pr (V < t ∣ x) = W(t, x) and pr { T0 < t ∣ z(·)} = 1 − S(t ∣ z), respectively. We assume that, conditional on x and z, V and T0 are independent. We observe (V, x, T0, z) only if T0 ⩾ V. Therefore, the conditional density of observing (V, T0) given (x, z) is proportional to I(t ⩾ v)w(v, x)f(t ∣ z), where w(t, x) = ∂ W(t, x)/∂ t. Hence, the marginal density of observed T, given z and x, is proportional to
, which is proportional to the conditional probability density function given in (1). Consequently, we may treat (V, x, T0, z) as the complete data vector and (x, T0, z) as the incomplete data with the truncation time V completely missing.
3.2. Pseudo-partial likelihoods
According to the argument in § 2.1 and from the Cox log partial likelihood lc2 of equation (5) in § 2.2, the following function, which is the conditional expectation of lc2 given the observed data, can be considered as a log pseudo-partial likelihood for the observed data (xi, T0i, zi) (i = 1, …, n):
![]() |
(6) |
If we condition on T0 = t, X = x, then the random variable V has conditional distribution W{min (v, t), x}/ W(t, x). Therefore, the second term of (6) is a function of (T01, x1, z1, …, T0n, xn, zn) and β, which does not involve the nuisance parameter λ0(t). We may, therefore, use the Monte Carlo method to compute it. For example, let Vjk (j = 1, …, n, k = 1, …, m) be random samples from the distribution W{min (v, T0j), xj}/ W(T0j, xj). Then the second term of (6) can be approximated by
![]() |
(7) |
We substitute (7) into the second term of (6) and obtain an approximate loglikelihood
![]() |
(8) |
In particular, for length-biased data, i.e. W(t, x) = t, the maximum approximate loglikelihood estimator based on (8) is asymptotically equivalent to the improved estimator proposed by Wang (1996). In addition, for m = 1, the approximate loglikelihood
is identical to the log-pseudolikelihood described by Wang (1996).
The disadvantages of using lc2 as the working log partial likelihood for complete data are as follows: we must assume that W(t,x) is a nondecreasing function in t for every fixed x; the underlying cumulative hazard function must be estimated by other methods; it requires intensive computation to obtain the log pseudo-partial likelihood
; and the loglikelihood lc2 contains less information about the parameters than the loglikelihood lc1. Therefore, we may take lc1 as our working log partial likelihood and apply the same procedure to (3). The resulting log pseudo-partial likelihood can be written as
![]() |
(9) |
where
![]() |
For a fixed value of β, maximization of (9) with respect to Δ Λ0(t) leads to
. Therefore, for a fixed value of β, we would estimate λ0(t) by the Nelson–Aalen estimator
![]() |
(10) |
where
. Inserting (10) into (8), we obtain the following log profile pseudo-partial likelihood depending only on β:
![]() |
(11) |
which is also a generalized Cox log partial likelihood. The vector of score statistics is given as
![]() |
(12) |
where
. We shall base our estimator of β on (12), and the value of β which maximizes (11) will be denoted by
. As pr { W(T0i, xi) = 0} = 0, W(t, xi)/ W(T0i, xi) is well defined.
The log pseudo-partial likelihood (11) is identical to the usual log partial likelihood of the models λ (t ∣ zi, z*i) = λ0(t) exp {βT
zi + z*i (t)}, where z*i (t) = log { W(t, xi)/ W(T0i, xi)}. In particular, when W(t, x) = W(t), z*i (t) can be simplified to − log { W(T0i)}. Standard statistical software, such as sas, can be used to obtain the estimate
of β by setting the regression coefficient of z*i to be 1, using the option offset = z* in procedure phreg. The robust sandwich covariance matrix estimator from sas is consistent whereas the model-based covariance matrix is inconsistent because the score (12) is not a martingale. For computing estimates of parameters and variances for general weight functions, readers can use the author's R subroutine downloadable from www.columbia.edu/∼wt5/.
3.3. Censoring
We may assume that the data are subject not only to biased sampling, but also to right censoring. Let Vi, T0i and Ci be, respectively, truncation, survival time and censoring time. Recall that when we define the left-truncated and right-censored data, we assume that (Vi, Ci) and T0i are mutually independent. Hence, the joint probability density function of (T0i, Vi, Ci) can be expressed as fi(t)gi(v, c), where fi is the probability density function of T0i and gi is the joint probability density function of (Vi, Ci). We only identify two types of censoring mechanism based on different censoring and truncation mechanisms and assumptions about gi. However, there are possible applications to other types of censoring.
The first type of censoring assumes that the given covariates (xi, zi), original truncated time, survival time and censoring time are mutually independent. However, we observe the data (Vi, Ti, Di) only if Vi ⩽ Ti, where Ti = min (T0i, Ci) and Di = I(T0i ⩽ Ci). The biased censored data (Ti, Di) are obtained by applying the censoring mechanism to the data before the data are sampled with bias. In the embedding, we first apply the censoring mechanism to the survival time and then apply the truncation mechanism to the observed censored data. This type of censoring is equivalent to assuming that
where g2i(t) is the probability density function of the censoring time Ci. We may also say that Vi and Ci are quasi-independent in the region {(v, c) ∣ v ⩽ c}. The observed data (Vi, Ti, Di) comprise a special case of the standard left-truncated and right-censored data as defined in § 2.2. Hence, in the first type of censoring, the conditional probability density function of the observed data (Vi, Ti, Di), given (xi, zi) = (x, z), is
![]() |
where
is the survival function of the censoring time. Therefore,
. The procedures proposed in §§ 3.2 and 3.3 are still valid because E{ I(Vi ⩽ t)Yi(t) ∣ Ti, Di, xi, zi} still equals Yi(t)W(t, xi)/ W(Ti, xi). The formulae of the log pseudo-partial likelihood will be the same for this type of censoring with T0i replaced by Ti and Ni(t) defined by Ni(t) = I(Ti ⩽ t, Di = 1).
The second type of censoring comprises the censoring of residual lifetime after the data are sampled with bias. Let R0i = T0i − Vi and Rci = Ci − Vi be, respectively, the residual lifetime and the residual censoring time of the ith individual. Given covariates (xi, zi) and Ci ⩾ Vi, we assume that Rci and (R0i, Vi) are independent. We observe (Vi, Ti, Di) only if Vi ⩽ Ti, where Ti = Vi + Ri, Ri = min (R0i, Rci) and Di = I(R0i ⩽ Rci). The censoring time Ci and truncation time Vi are not independent in this type of censoring. Let g3i(t) be the probability density function of the residual censoring time Rci. The second type of censoring is equivalent to assuming that
where g3i(t) is the probability density function of the residual censoring time Rci. The conditional density function of the observed data (Vi, Ti, Di) given (xi, zi) = (x, z) is
![]() |
where
is the survival function of the residual censoring time. Hence,
![]() |
![]() |
Unfortunately, under the second type of censoring, the conditional expectation E{ I(Vi ⩽ t)Yi(t) ∣ data} is a function of the censoring distribution. If the truncation time V is observable, then we may use a Kaplan–Meier-type estimator of
. Consider the one-sample problem such that β = 0
and w(v, x) = w(v). The nonparametric maximum likelihood estimator of
is the Kaplan–Meier estimator based on the censored residual lifetime (Ri, Di); that is
![]() |
Hence, the maximum pseudo-partial likelihood estimator of S is
where
![]() |
The estimator
is not the nonparametric maximum conditional likelihood estimator proposed and studied by Tsai et al. (1987), nor is it the nonparametric maximum likelihood estimator. However, if there is no censoring,
is the nonparametric maximum likelihood estimator. It is straightforward to prove that
will converge to
and
will converge to
. As a result, under some regularity conditions,
will converge to the survival function S(t) of the survival time.
If the censoring time depends on the covariates, we need to use a smooth type of Kaplan–Meier estimator of
. If the truncation time Vi cannot be observed, we may use the nonparametric maximum likelihood estimator of
. More research is needed in order to understand the properties of the proposed method.
The Stanford heart transplant dataset of Miller Halpern (1982) is a prospective cohort study. The event time of interest is the survival time after entry. The censoring time is the duration between calendar entry date of the patients and February 1980. If we assume that there is no loss of follow-up, then C = February 1980 − E; and V = transplant calendar date − E = transplant waiting time, where C, V and E, respectively, are the censoring time, truncation time and calendar entry date of the patient. If V, the transplant waiting time, is independent of E, the calendar entry date, then C and V are independent for all patients in the cohort: the transplant waiting times of the patients who died or were censored before transplantation cannot be observed. Therefore, the censoring time C is quasi-independent of the truncation time V for the transplant patients in the Stanford heart transplant data. Consequently, the censoring is of the first type. The Channing House dataset is a retrospective cohort study. The event time of interest is the death age. The censoring time C for patients who did not leave the centre before July 1975 is the time between the birth date B and July 1975. The truncation time, i.e. entry age, V is E−B, where E is the calendar entry date. The residual censoring time Rc is C − V = July 1975 − E. If entry age is independent of calendar entry date, i.e. V and E are independent, then the residual censoring time Rc is independent of the truncation time V. This censoring is of the second type.
Most applications can be classified as either the first or second type of censoring. For example, the censoring mechanism of the proportional hazards model with missing covariates, see § 4.3, and the Stanford heart transplant data are of the first type; the censoring mechanisms of the renewal process (Vardi 1989), cross-sectional survival data (Wang 1991) and the Channing House data are of the second type.
The asymptotic properties of the maximum pseudo-partial likelihood estimators
and
for censored biased-sampling data of general nonnegative weight functions are established and discussed in the Appendix. We assume that the censoring is of the first type for the rest of the paper.
4. Application
4.1. Length-biased data
The techniques developed in the previous sections can be applied to length-biased data by using W(t) = t. We illustrate the method with a simulation study and an analysis of the shrub dataset. Consider a two-sample proportional hazards model with covariate z = 0 representing Group 0 and z = 1 representing Group 1. We generate 100 length-biased samples from Group 0 with population density f0(t) = t exp (− t)I(t > 0) and 100 length-biased samples from Group 1 with population density f1(t) = t exp (β − eβ
t)I(t > 0) for β = 0, 1, 2 in three different scenarios. The relative hazard between Group 0 and Group 1 is equal to eβ and, thus, the log hazard ratio is β = 0, 1, 2. We also calculate the estimator
obtained by maximizing the approximate likelihood
, which was also proposed by Wang (1996). The variance estimator proposed by Wang (1996) is identical to the estimator from sas. We use equation (A1) in Theorem A2 provided in the Appendix to obtain the variance estimator for the estimator
. For comparison we also include the inverse probability weighted estimator
of Binder (1992) and Lin (2000); see also Horvitz Thompson (1952) and Qi et al. (2005). For the definition of the inverse probability weighted estimator, see equation (13) in § 4.3. Table 1, based on 1000 replicates, which shows that
is more efficient than
and
and that the variance estimator of
underestimates the true sample variance of
.
Table 1.
Monte Carlo simulation for length-biased data. One hundred length-biased observations were generated from Group 0 with population density f0(t) = t exp (− t)I(t > 0) and 100 length-biased observations were generated from Group 1 with population density f1(t) = t exp (β − eβ
t)I(t > 0) for β = 0, 1, 2. Here
maximizes the loglikelihood l(β),
maximizes the loglikelihood
and
was proposed by Binder (1992) and Lin (2000). Estimates are based on 1000 replications
| Bias | Sample variance | Mean of estimated variance | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| β | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|||
| 0 | 0.002 | 0.003 | −0.001 | 0.010 | 0.020 | 0.049 | 0.010 | 0.020 | 0.036 | |||
| 1 | 0.009 | 0.012 | 0.014 | 0.018 | 0.030 | 0.070 | 0.017 | 0.030 | 0.050 | |||
| 2 | 0.032 | 0.033 | 0.060 | 0.058 | 0.080 | 0.139 | 0.053 | 0.075 | 0.103 | |||
We denote the width of an observed shrub from the shrub dataset by Ti. Wang (1996) assumed that the probability of including Ti in the dataset is proportional to Ti itself. We use the proportional hazards model,
where z1 = I(T belongs to transect I) and z2 = I(T belongs to transect II) are two indicator covariates. Use of sas with offset − log (Ti) provides
with model-based standard error estimates {se
, se
and corr(
. The estimates are very similar to the results of Wang (1996), but the model-based standard errors overestimate the true standard errors of
. Use of equation (A1) in the Appendix, which is identical to the robust sandwich covariance estimate from sas, gives {se
, se
and corr(
. For comparison
with estimated standard errors (0.33, 0.31).
4.2. Biased samples with right censoring
Miller Halpern (1982) compared four regression techniques on the updated Stanford heart transplant data without acknowledging that the transplant patient's survival time was sampled with bias. As mentioned in § 1, the survival times can be treated as a biased sample with a weight function equal to the distribution of the transplant waiting time. Miller Halpern (1982) did not provide the transplant waiting times but Crowley Hu (1977) did. The Weibull distribution fits the transplant waiting times very well. The R2 of the fit of log[−log
to log(t) is 0.97, where
is the product-limit estimate of the transplant waiting time survival function based on Crowley & Hu's (1977) 103 patients. The conditional maximum likelihood estimate for the Weibull survival function of transplant waiting time is exp(−0.027t0.925). Hence, the weight function, W(t) = 1 − exp (−0.027t0.925), will be used to obtain the parameters of the proportional hazards model. We assume that the hazard rate of the transplant patient's survival is proportional to exp
. Miller Halpern (1982) deleted 27 patients lacking the T5 mismatch score and 5 patients with survival times less than 10 days from the total of 184 patients in one of their data analyses. Based on 152 Stanford heart transplant patients, the pseudo-partial likelihood estimate
is (−0.13, 0.0021) with {se
, se
and
= (−0.17, 0.0026) with se
. In calculating the variances of the estimates, we assume that the weight function is known without error. Both methods show a strong relationship between survival time and age.
4.3. Missing covariates
It is assumed that, for each i, (Ti, Di, zi, Ri) are independent and identically distributed random vectors, where Ri = 1 if zi is fully observed and is zero otherwise. Let W(t, z) = pr (R = 1| T = t, Z = z) be the conditional probability of observing the full covariates data given the covariates z and follow-up time T = t. We assume that W(t,z) is either completely known or can be estimated from other methods; see Qi et al. (2005) and an unpublished Harvard School of Public Health technical report by M. Pugh, J. Robins, S. Lipsitz and D. Harrington. Copas Farewell (2001), Qi et al. (2005) and Pugh et al.'s report proposed the following weighted complete-case pseudolikelihood score function for inference of the proportional hazards models with missing covariates:
![]() |
(13) |
where
and Wi = W{ Ti, zi(Ti)}. The inverse probability weighted estimator
is the solution of UIPW(β) = 0. This pseudo-score UIPW was also proposed and studied by Binder (1992) and Lin (2000) in the survey-sampling literature. We may treat the complete case, with Ri = 1, as a biased sample from the population with selection probability proportional to W(t,z). The conditional density of (T, D) given the covariates z and R = 1 is proportional to W(t, z)fD (t ∣ z)S(1− D) (t ∣ z). Since the censoring was applied to the data before the cases with missing covariates were dropped, the censoring is of the first type considered in § 3.3. The pseudo-partial score function becomes
where
and
Let
denote the solution of Umc(β) = 0. If W(t, z) = W(t), then
. Hence,
is a weighted mean of Umc1, …, Umcn, with weight proportional to 1/ Wi, while Umc is the unweighted mean. When some of the Wi are close to zero, the equation UIPW = 0 becomes unstable. Therefore, in order to prove the asymptotic normality of the estimator
, Pugh et al.'s report and Qi et al. (2005) had to assume that Wi > ε > 0 for some positive ε. We need weaker assumption to prove the asymptotic properties of the estimator
; see the Appendix.
We now analyze the mouse leukaemia data of Kalbfleisch Prentice (2002) by the pseudo-partial likelihood method and the inverse probability weighted method. As in Kalbfleisch Prentice (2002), we dichotomized virus level into a binary variable with zero representing values below 104 and one otherwise. There are two analyses, corresponding to the endpoints death by thymic leukaemia and death by thymic or nonthymic leukaemia. Logistic regression was used to estimate the observation probabilities, i.e. weight function, based on 156 mice that survived for at least 400 days; the observation probabilities were set to zero for the remaining 48 mice that died or were censored before 400 days. In order to make the analysis comparable with that of Wang Chen (2001), we included only the survival time and its quadratic term as two predictors in our logistic regression model. Both analyses used 204 mice and treated both virus level and GPD1 phenotype as the missing covariates. Table 2 shows the results from applying the pseudo-partial likelihood method and the inverse probability weighted method. The pseudo-partial likelihood method shows that the virus level has a significant relationship with both endpoints, while the GPD1 has a significant relationship with death of thymic or nonthymic leukaemia and GPD1 has a moderately significant relationship with thymic leukaemia death. The inverse probability weighted method shows that GPD1 has a significant relationship with both endpoints and virus level has a moderately significant relationship with both endpoints. The inclusion of death by nonthymic leukaemia changes the estimates for GPD1 phenotype slightly, but moderately reduces the estimates for virus level. A similar phenomenon was also found by Qi et al. (2005). As in the Stanford heart transplant data analysis, the weight function is treated as known. If the weight function were treated as unknown, then the variance estimator in Theorem A2 would overestimate the true sample variance of
.
Table 2.
Analysis of mouse leukaemia data using the Cox regression model with various methods. Here
is the pseudo-partial-likelihood estimator and
is the inverse probability weighted estimator
| Thymic leukaemia | Thymic and nonthymic leukemia | |||
|---|---|---|---|---|
| Coefficient estimate (se) | Coefficient estimate (se) | |||
| Approach | GPD1 | Virus | GPD1 | Virus |
| Complete-case | −1.44(0.60) | 1.44(0.72) | −1.46(0.57) | 1.22(0.65) |
![]() |
−1.15(0.64) | 1.51(0.71) | −1.19(0.59) | 1.28(0.62) |
![]() |
−1.47(0.65) | 1.48(0.75) | −1.41(0.63) | 1.27(0.67) |
se: estimated standard error.
If W(.) is a function of D, then
is still a consistent estimator. Generally, however, under the same conditions,
is not consistent, since E{ Umc(β0)} ≠ 0. For generalization to the Cox model with missing covariates and a detailed simulation comparison of
and
, see Luo et al. (2009).
5. Discussion
The procedures developed in this paper can be easily extended to other types of incomplete data. The basis of our method is the conditional expectation of the partial score function given the data. For the proportional hazards model, if N(t) is completely known, the conditional expectation involves two terms; see equation (3). One is the conditional expectation of z, while the other is the conditional expectation of S(0) (β, t). If z is completely known, we only have to consider the conditional expectation of S(0). In other types of incomplete data, we may also have to compute the conditional expectation of z or Δ N(t) or both. For example, in the proportional hazards models with missing covariates or with covariates with measurement errors, the covariates z are partially missing. Paik Tsai (1997) applied a similar idea and proposed two estimators of β for the proportional hazards models with missing covariates.
The estimators
and
proposed in § 3, are not nonparametric maximum likelihood estimators and, therefore, we generally do not expect these estimators to be optimal. However, as a special case, when W(t, x) = W(t) and β0 = 0, the estimator
is the nonparametric maximum likelihood estimator and is the most efficient estimator. Since
and
maximize the pseudo-partial likelihood, we expect the efficiency of these two estimators to be quite high. Empirical evidence from our limited simulation and real-data experiments suggests that
is more efficient than
.
Another special case is given by W(t, x) = tx,f(t ∣ z) = f0(t) and p = pr (x = 0) = 1 − pr (x = 1). The nonparametric maximum likelihood estimator
of
was studied by Vardi (1982). We computed the asymptotic relative efficiency of
with respect to
when F0 is a uniform distribution on [0,1], for t and p ∈ {0.2, 0.4, 0.6, 0.8}. When p = 0 or 1, the product limit estimator based on
is identical to the nonparametric maximum likelihood estimator, so that the asymptotic relative efficiency equals 1 for p = 0 and p = 1. The lowest asymptotic relative efficiency we obtained was 0.985. Since the estimator
is much easier to calculate than the nonparametric maximum likelihood estimator,
is a preferred estimator.
Acknowledgments
The author wishes to thank an associate editor and reviewers for suggestions leading to an overall improvement. He also appreciates Dr. Bruce Levin's comments and inputs. The paper was partially completed during the author's visit to the Department of Statistics, National Cheng Kung University, Taiwan.
Appendix
Large sample properties
In deriving equation (12), we explicitly assume that the weight function W(t, x) is a distribution function for any given x. However, after the likelihood (12) is obtained, we only need a weaker assumption, that the weight be nonnegative, to prove the asymptotic properties. Here, we assume that the weight function W(t,x) satisfies Assumption 1.
Assumption 1
For every fixed x, there exists a constant a(x) such that { t ∣ W(t, x) > 0} = (a(x), ∞) or [a(x), ∞).
Assumption 1 does not require that W(·, x) be a nondecreasing function for any fixed x. The following four theorems describe the asymptotic properties of the estimators
and
.
Theorem A1
If Assumption 1 holds and the matrix I(β) is positive definite,
converges to β0 in probability as n → ∞, where
and
.
Proof
If Assumption 1 holds, we have
where
. Then
converges in probability to
Hence, U(β0) → 0 in probability as n → ∞ and, therefore, Theorem A1 holds by a standard argument.
For asymptotic normality, we need to introduce more notation. Define
where
, and F1(t) = pr { Ti < t, I(Di = 1)}. Furthermore, we define
Note that
is obtained by substituting s(0), s(1), F1(t) and β0 by
and
, respectively, in ξi.
Theorem A2
Under the same conditions as in Theorem A1,
converges weakly to a normal distribution with zero-mean and covariance matrix
, where Ξ = E(ξ⊗2).
Proof
The score function n−1/2 U(β0) can be expressed as
By Taylor series expansion, the second term in the above equation can be written as
Hence,
. By the multivariate central limit theorem and its corollary, n−1/2 U(β0) converges to a multivariate normal distribution, yielding Theorem A2.
Note that I(β0) can be consistently estimated by
and Ξ can be consistently estimated by
. Thus, the covariance matrix
can be consistently estimated by
| (A1) |
We now turn to a study of the large sample properties of the estimated baseline integrated hazard function
.
Theorem A3
Let M be a large positive number such that pr (T ⩾ M) is strictly positive. Under the same conditions as in Theorem A1, for t < M the process
converges weakly to a Gaussian process with zero-mean and covariance function E{ξ0i(s)ξ0i(t)}, which can be consistently estimated by
, where
Theorem A4
Under the conditions of Theorem A1, the asymptotic covariance of
and
can be consistently estimated by
.
Proofs of Theorem A3 and Theorem A4. We may write
as
![]() |
Here, the last term can be shown to converge to zero in probability. By Theorem A2,
, and by Taylor series expansion around β0, the first term can be approximated by
![]() |
The second term can be expressed as
![]() |
Therefore, a simple application of the multivariate central limit theorem implies that the finite-dimensional distribution of { An(t), Bn(t), Cn(t)} is a multivariate normal. As in the proof in Tsiatis (1981), the sequence of distributions induced by An, Bn and Cn is tight.
References
- Andersen P. K., Borgan O., Gill R. D., Keiding N. Statistical Models Based on Counting Processes. New York: Springer; 1993. [Google Scholar]
- Betensky R. A., Lindsey J. C., Ryan L. M., Wand M. P. Local EM estimation of the hazard function for interval-censored data. Biometrics. 1999;55:238–45. doi: 10.1111/j.0006-341x.1999.00238.x. [DOI] [PubMed] [Google Scholar]
- Binder D. A. Fitting Cox's proportional hazards models from survey data. Biometrika. 1992;79:139–47. [Google Scholar]
- Copas A. J., Farewell V. T. Incorporating retrospective data into an analysis of time to illness. Biostatistics. 2001;2:1–12. doi: 10.1093/biostatistics/2.1.1. [DOI] [PubMed] [Google Scholar]
- Cox D. R. Regression models and life tables (with Discussion) J. R. Statist. Soc. 1972;B. 34:187–220. [Google Scholar]
- Cox D. R. Partial likelihood. Biometrika. 1975;62:269–76. [Google Scholar]
- Crowley J., Hu M. Covariance analysis of heart transplant survival data. J. Am. Statist. Assoc. 1977;72:27–36. [Google Scholar]
- Dempster A. P., Laird N. M., Rubin D. B. Maximum likelihood estimation from incomplete data via the EM algorithm (with Discussion) J. R. Statist. Soc. 1977;B. 39:1–38. [Google Scholar]
- Horvitz D. G., Thompson D. J. A generalization of sampling without replacement from a finite universe. J. Am. Statist. Assoc. 1952;47:663–85. [Google Scholar]
- Hyde J. Survival analysis with incomplete observations. In: Miller R. G., Efron B., Brown B. W., Moses L. E., editors. Biostatistics Casebook. New York: John Wiley & Sons; 1980. pp. 31–46. [Google Scholar]
- Kalbfleisch J. D., Prentice R. L. The Statistical Analysis of Failure Time Data. 2nd ed. New York: Wiley; 2002. [Google Scholar]
- Keiding N., Gill R. D. Random truncation models and Markov processes. Ann. Statist. 1990;18:582–602. [Google Scholar]
- Lin D. Y. On fitting Cox's proportional hazards models to survey data. Biometrika. 2000;87:37–47. [Google Scholar]
- Luo X., Tsai W.-Y., Xu Q. Pseudo partial likelihood estimators for Cox regression with missing covariates. Biometrika. 2009;96 doi: 10.1093/biomet/asp027. (forthcoming) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miller R., Halpern J. Regression with censored data. Biometrika. 1982;69:521–31. [Google Scholar]
- Muttlak H. A., McDonald L. L. Ranked set sampling with size-biased probability of selection. Biometrics. 1990;46:435–46. [Google Scholar]
- Paik M. C., Tsai W.-Y. On using Cox proportional hazard models with missing covariate. Biometrika. 1997;84:579–93. [Google Scholar]
- Qi L., Wang C. Y., Prentice R. L. Weighted estimators for proportional hazards regression with missing covariates. J. Am. Statist. Assoc. 2005;100:1250–63. [Google Scholar]
- Tsai W.-Y., Jewell N. P., Wang M.-C. A note on the product limit estimate of a survival curve under right-censoring and left-truncation. Biometrika. 1987;74:883–6. [Google Scholar]
- Tsiatis A. A. A large sample study of Cox's regression model. Ann. Statist. 1981;9:91–108. [Google Scholar]
- Vardi Y. Nonparametric estimation in the presence of length bias. Ann. Statist. 1982;10:616–20. [Google Scholar]
- Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density. Biometrika. 1989;76:751–61. [Google Scholar]
- Wang C. Y., Chen H. Y. Augmented inverse probability weighted estimator for Cox missing covariate regression. Biometrics. 2001;57:414–9. doi: 10.1111/j.0006-341x.2001.00414.x. [DOI] [PubMed] [Google Scholar]
- Wang M.-C. Nonparametric estimation from cross-sectional survival data. J. Am. Statist. Assoc. 1991;86:130–43. [Google Scholar]
- Wang M.-C. Hazards regression analysis for length-biased data. Biometrika. 1996;83:343–54. [Google Scholar]
- Wang M.-C., Jewell N. P., Tsai W.-Y. Asymptotic properties of the product limit estimate under random truncation. Ann. Statist. 1986;14:1597–605. [Google Scholar]






























































