Skip to main content
Biometrika logoLink to Biometrika
. 2009 Jun 24;96(3):601–615. doi: 10.1093/biomet/asp026

Pseudo-partial likelihood for proportional hazards models with biased-sampling data

WEI YANN TSAI 1
PMCID: PMC3304552  PMID: 22422175

Abstract

We obtain a pseudo-partial likelihood for proportional hazards models with biased-sampling data by embedding the biased-sampling data into left-truncated data. The log pseudo-partial likelihood of the biased-sampling data is the expectation of the log partial likelihood of the left-truncated data conditioned on the observed data. In addition, asymptotic properties of the estimator that maximize the pseudo-partial likelihood are derived. Applications to length-biased data, biased samples with right censoring and proportional hazards models with missing covariates are discussed.

Some key words: em Algorithm, Left truncation, Length-biased data, Missing covariate, Right censoring

1. Introduction

The partial likelihood function of Cox (1975) has been mainly used for proportional hazards models with censored data (Cox 1972). For more complicated incomplete data, no unified method exists to find a partial likelihood for inference on the parameters of the proportional hazards models. Dempster et al. (1977) developed the em algorithm to obtain maximum likelihood estimators for incomplete data. Originally used for fully parametric models, the em algorithm was subsequently extended successfully to many nonparametric problems. In survival analysis, there is substantial literature generalizing the em algorithm to frailty models (Andersen et al. 1993, § 9), missing covariates (Paik Tsai 1997; Qi et al. 2005) and interval-censored data (Betensky et al. 1999).

In semiparametric models, one usually obtains, through a conditioning argument, an objective function with finitely many parameters of interest to which the em algorithm can be readily applied. The present paper gives an analogous pseudo-partial likelihood for proportional hazards models with biased-sampling data that can be used without intensive computation.

Under the proportional hazards models for biased-sampling data, the conditional probability density function of an observed nonnegative random variable T, given covariates z(t) and x, can be expressed as

1. (1)

where W(t,x) is a completely known nonnegative weight function, z(t) = { z1(t), …, zp(t)}T is a p-dimensional time-dependent covariate, x = (x1, …, xq)T is a q-dimensional time-independent covariate, f(tz) denotes a population conditional density function given z(s) for st and α (x, z) is a normalization constant making h(· ∣ x, z) a genuine probability density function. Furthermore, we will assume a proportional hazards model with Inline graphic where Inline graphic is the conditional survival function.

Biased-sampling data arise naturally in complex surveys. For example, in large-scale population-based surveys with multi-stage sampling, the complex design results in a set of probability weights for each subject. The weight function W(Ti, xi) represents the probability that the ith observation (Ti, xi, zi) was sampled from the population. Binder (1992) and Lin (2000) have proposed and studied a method for estimating the parameters of proportional hazards models from such survey data. For W(t, x) = t, the data are referred to as length-biased data. Wang (1996) proposed statistical inference for length-biased data based on Cox's model. For W(t, x) = I(xt), density (1) becomes a conditional probability density for left-truncated data. This problem has been extensively studied in the literature. Wang et al. (1986) used a classical approach to study the properties of the nonparametric maximum likelihood estimator. Keiding Gill (1990) used counting process techniques to study the properties of the same estimator.

The following four real datasets illustrate different types of biased-sampling data.

Example 1

Shrub data. Muttlak McDonald (1990) presented widths of 46 shrubs. Wang (1996) assumed that the probability of observing a shrub is proportional to the shrub's width, so that the sampling is length-biased. Wang (1996) analyzed the data with a proportional hazards model.

Example 2

Channing House data. Channing House is a retirement centre in Palo Alto, California. Hyde (1980) reported ages at entry and at death of 462 retirees, 365 females and 97 males, who were in residence between January 1964 and July 1975. The individuals who left Channing House or were still in the centre at the end of the study were censored. The data can be viewed as left-truncated with right censoring since the individual's death age must be greater than the entry age. The entry age serves as the left-truncation time.

Example 3

Stanford heart transplant data. Crowley Hu (1977) gave information on 103 potential heart transplant recipients who were enrolled in the Stanford heart transplant programme from October 1967 to April 1974. The data include age, waiting time to transplantation, survival or censoring time from acceptance to the programme, and three mismatch scores. Among the 103 potential heart transplant recipients, there were 69 patients who underwent the heart transplant operation. Later, Miller Halpern (1982) updated the data by reporting the survival or censoring times and ages of 184 patients who were enrolled in the same programme and had received heart transplants from October 1967 to February 1980. If we are only interested in analyzing the transplant patients, then the data of Crowley & Hu are left-truncated and right-censored data, with transplant waiting time as a random left-truncation variable. However, because Miller & Halpern did not report the transplant waiting times, their data can be viewed as biased-sampling data with right censoring. The weight function is the distribution of the transplant waiting time random variable.

Example 4

Mouse leukaemia data. Kalbfleisch Prentice (2002) reported the survival of 204 mice. The mice were followed up for two years for mortality due to thymic or nonthymic leukaemia. The two covariates of interest were the GPD1 phenotype and the level of endogenous murine leukaemia virus. There were 175 mice whose levels of endogenous murine leukaemia virus were recorded. The GPD1 phenotype was determined only on a subgroup of the 100 mice that survived 400 days; thus, the probability of missing covariates clearly depends on the follow-up time. The complete-case analysis, which uses only the mice with complete information, is clearly biased since the selection probability depends on the survival time outcome variable. In fact, the complete cases comprise biased-sampling data with the weight function equal to the probability of selecting complete cases.

2. Partial likelihood

2.1. General approach

Let χ be a sample space, and let x ∈ χ be a realization of the random vector X with density fX(x;ϕ) depending on a vector parameter ϕ = (β, η), in which β is of interest and η is a nuisance parameter. In some applications, the dimension of η may increase with the sample size and the application of maximum likelihood estimation may lead to spurious results. However, suppose that x = (c1, x1, …, cn, xn) and that the full likelihood factorizes into

2.1. (2)

where di = (c1, x1, …, ci−1, xi−1) and ei = (c1, x1, …, ci−1, xi−1, ci). The second product on the right-hand side of (2) is the partial likelihood of β based on (x1, …, xn) in the sequence (ci, xi)(i = 1, …, n). Cox (1975) argued that inference based only on the partial likelihood would be acceptable if the information about β, contained in the first factor, was small.

A complication that sometimes occurs is that one observes a function Inline graphic, instead of observing x ∈ χ. Therefore, inferences about β must be based on y. The following algorithm is a simple generalization of the em algorithm for partial likelihood. The em algorithm applied to gamma frailty models discussed in Andersen et al. (1993, § 9) is a special case of this generalization.

For incomplete data, the em algorithm finds maximum likelihood estimates for β and η through the iterative maximization of

2.1.

where β(c) and η(c) denote the current estimates. Therefore,

2.1.

are, respectively, a log pseudo-partial likelihood function and a pseudo-partial score function of β for the observed data y, where Ui(β) = ∂ log   fβ(xiei)/∂ β is the score function of the partial likelihood for complete data. Unfortunately, Up still involves the nuisance parameter η in many applications. For example, in frailty models Up is also a function of the frailty parameters and, therefore, cannot be used directly. However, Up is a function of β alone in some applications. In other situations, the maximization of lp(β, η, y) with respect to β and η has a simple solution. In § 3, we show two pseudo-partial likelihood functions for Cox's models with biased-sampling data in these two situations.

2.2. Partial likelihood for left-truncated and right-censored data

Let the lifetime T0i have distribution function Fi, and let the truncation time and censoring time (Vi, Ci) have joint distribution function Gi and joint probability density function gi. It is assumed that T0i and (Vi, Ci) are mutually independent. Moreover, we assume that there is a positive probability that T0iVi and CiVi. We do not sample from the joint distribution, but from the conditional distribution given the event { T0V, CV}. Let (Vi, T0i, Ci) (i = 1, …, n) be a sample of n independent triples from this conditional distribution. Then, our left-truncated and right-censored sample is (V1, T1, D1), …, (Vn, Tn, Dn), where Ti = min (T0i, Ci), Di = I(Ti = T0i) and I(·) is an indicator function. Note that TiVi(i = 1, …, n). We let Ni(t) = I(Tit, Di = 1) and Yvi (t) = I(Vit)Yi(t) be, respectively, the indicator of whether or not the ith individual failed before time t and the indicator of whether or not the ith individual is at risk just before time t, where Yi(t) = I(Tit). Furthermore, for given time-dependent covariates zi(t) = { z1i(t), z2i(t), …, zpi(t)}T, we assume that T0i follows a proportional hazards model, i.e.

2.2.

where λ (tzi) is the conditional hazard function of T0i given zi(t), β is a p × 1 vector of unknown regression coefficients and λ0(t) is the underlying or baseline hazard function. For convenience of notation, if it is unambiguous, the dependence of zi on t will be suppressed. As shown by Andersen et al. (1993, §§ 3.3 and 3.4), (N1, …, Nn) is a multivariate counting process that has an intensity process (λ (tz1)Yv1(t), …, λ (tzn)Yvn (t)) with respect to the filtration Inline graphic which is defined in Andersen et al. (1993, p. 153). Let Inline graphic be the cumulative underlying hazard function. The log partial likelihood, which is the conditional likelihood given V, can be written as

2.2. (3)

see equations (3.3.3) and (7.2.2) of Andersen et al. (1993), where

2.2.

and Δ H(t) = H(t+) − H(t−) for any function H(t). The partial derivative of (3) with respect to Δ Λ0(t) is {Δ N(t)/Δ Λ0(t)} − nS(0) (β, t), where Inline graphic is the number of observed failures up to time t. Therefore, for a fixed value of β, we would estimate λ0(t) by the Nelson–Aalen estimator

2.2. (4)

where Inline graphic. Inserting (4) into (3), we obtain the log profile partial likelihood Inline graphic. Here,

2.2. (5)

is the generalized log partial likelihood, originally derived by Cox (1972, 1975) for the case of censored survival data. Thus, the score function of the generalized Cox partial likelihood is Inline graphic where S(1) (β, t) = ∂ S(0) (β, t)/∂ β.

We will treat equations lc1 and lc2 as our working log partial likelihood for the complete data. Two log pseudo-partial likelihoods for biased-sampling data will be derived, respectively, based on lc1 and lc2 in the next section. The notation Ni(t), N(t) and Yi(t) will be used throughout with obvious adjustments for data with no censoring and/or no truncation. A parameter with subscript zero will denote the true parameter.

3. Data from biased sampling

3.1. Embedding the data into the left-truncation model

First we assume that W(·, x) is a distribution function for every fixed x; this assumption will be relaxed later. Let V and T0 be nonnegative random variables with conditional distribution functions pr (V < tx) = W(t, x) and pr { T0 < tz(·)} = 1 − S(tz), respectively. We assume that, conditional on x and z, V and T0 are independent. We observe (V, x, T0, z) only if T0V. Therefore, the conditional density of observing (V, T0) given (x, z) is proportional to I(tv)w(v, x)f(tz), where w(t, x) = ∂ W(t, x)/∂ t. Hence, the marginal density of observed T, given z and x, is proportional to Inline graphic, which is proportional to the conditional probability density function given in (1). Consequently, we may treat (V, x, T0, z) as the complete data vector and (x, T0, z) as the incomplete data with the truncation time V completely missing.

3.2. Pseudo-partial likelihoods

According to the argument in § 2.1 and from the Cox log partial likelihood lc2 of equation (5) in § 2.2, the following function, which is the conditional expectation of lc2 given the observed data, can be considered as a log pseudo-partial likelihood for the observed data (xi, T0i, zi) (i = 1, …, n):

3.2. (6)

If we condition on T0 = t, X = x, then the random variable V has conditional distribution W{min (v, t), x}/ W(t, x). Therefore, the second term of (6) is a function of (T01, x1, z1, …, T0n, xn, zn) and β, which does not involve the nuisance parameter λ0(t). We may, therefore, use the Monte Carlo method to compute it. For example, let Vjk (j = 1, …, n, k = 1, …, m) be random samples from the distribution W{min (v, T0j), xj}/ W(T0j, xj). Then the second term of (6) can be approximated by

3.2. (7)

We substitute (7) into the second term of (6) and obtain an approximate loglikelihood

3.2. (8)

In particular, for length-biased data, i.e. W(t, x) = t, the maximum approximate loglikelihood estimator based on (8) is asymptotically equivalent to the improved estimator proposed by Wang (1996). In addition, for m = 1, the approximate loglikelihood Inline graphic is identical to the log-pseudolikelihood described by Wang (1996).

The disadvantages of using lc2 as the working log partial likelihood for complete data are as follows: we must assume that W(t,x) is a nondecreasing function in t for every fixed x; the underlying cumulative hazard function must be estimated by other methods; it requires intensive computation to obtain the log pseudo-partial likelihood Inline graphic; and the loglikelihood lc2 contains less information about the parameters than the loglikelihood lc1. Therefore, we may take lc1 as our working log partial likelihood and apply the same procedure to (3). The resulting log pseudo-partial likelihood can be written as

3.2. (9)

where

3.2.

For a fixed value of β, maximization of (9) with respect to Δ Λ0(t) leads to Inline graphic. Therefore, for a fixed value of β, we would estimate λ0(t) by the Nelson–Aalen estimator

3.2. (10)

where Inline graphic. Inserting (10) into (8), we obtain the following log profile pseudo-partial likelihood depending only on β:

3.2. (11)

which is also a generalized Cox log partial likelihood. The vector of score statistics is given as

3.2. (12)

where Inline graphic. We shall base our estimator of β on (12), and the value of β which maximizes (11) will be denoted by Inline graphic. As pr { W(T0i, xi) = 0} = 0, W(t, xi)/ W(T0i, xi) is well defined.

The log pseudo-partial likelihood (11) is identical to the usual log partial likelihood of the models λ (tzi, z*i) = λ0(t) exp {βT zi + z*i (t)}, where z*i (t) = log   { W(t, xi)/ W(T0i, xi)}. In particular, when W(t, x) = W(t), z*i (t) can be simplified to − log { W(T0i)}. Standard statistical software, such as sas, can be used to obtain the estimate Inline graphic of β by setting the regression coefficient of z*i to be 1, using the option offset = z* in procedure phreg. The robust sandwich covariance matrix estimator from sas is consistent whereas the model-based covariance matrix is inconsistent because the score (12) is not a martingale. For computing estimates of parameters and variances for general weight functions, readers can use the author's R subroutine downloadable from www.columbia.edu/∼wt5/.

3.3. Censoring

We may assume that the data are subject not only to biased sampling, but also to right censoring. Let Vi, T0i and Ci be, respectively, truncation, survival time and censoring time. Recall that when we define the left-truncated and right-censored data, we assume that (Vi, Ci) and T0i are mutually independent. Hence, the joint probability density function of (T0i, Vi, Ci) can be expressed as fi(t)gi(v, c), where fi is the probability density function of T0i and gi is the joint probability density function of (Vi, Ci). We only identify two types of censoring mechanism based on different censoring and truncation mechanisms and assumptions about gi. However, there are possible applications to other types of censoring.

The first type of censoring assumes that the given covariates (xi, zi), original truncated time, survival time and censoring time are mutually independent. However, we observe the data (Vi, Ti, Di) only if ViTi, where Ti = min (T0i, Ci) and Di = I(T0iCi). The biased censored data (Ti, Di) are obtained by applying the censoring mechanism to the data before the data are sampled with bias. In the embedding, we first apply the censoring mechanism to the survival time and then apply the truncation mechanism to the observed censored data. This type of censoring is equivalent to assuming that

3.3.

where g2i(t) is the probability density function of the censoring time Ci. We may also say that Vi and Ci are quasi-independent in the region {(v, c) ∣ vc}. The observed data (Vi, Ti, Di) comprise a special case of the standard left-truncated and right-censored data as defined in § 2.2. Hence, in the first type of censoring, the conditional probability density function of the observed data (Vi, Ti, Di), given (xi, zi) = (x, z), is

3.3.

where Inline graphic is the survival function of the censoring time. Therefore, Inline graphic. The procedures proposed in §§ 3.2 and 3.3 are still valid because E{ I(Vit)Yi(t) ∣ Ti, Di, xi, zi} still equals Yi(t)W(t, xi)/ W(Ti, xi). The formulae of the log pseudo-partial likelihood will be the same for this type of censoring with T0i replaced by Ti and Ni(t) defined by Ni(t) = I(Tit, Di = 1).

The second type of censoring comprises the censoring of residual lifetime after the data are sampled with bias. Let R0i = T0iVi and Rci = CiVi be, respectively, the residual lifetime and the residual censoring time of the ith individual. Given covariates (xi, zi) and CiVi, we assume that Rci and (R0i, Vi) are independent. We observe (Vi, Ti, Di) only if ViTi, where Ti = Vi + Ri, Ri = min (R0i, Rci) and Di = I(R0iRci). The censoring time Ci and truncation time Vi are not independent in this type of censoring. Let g3i(t) be the probability density function of the residual censoring time Rci. The second type of censoring is equivalent to assuming that

3.3.

where g3i(t) is the probability density function of the residual censoring time Rci. The conditional density function of the observed data (Vi, Ti, Di) given (xi, zi) = (x, z) is

3.3.

where Inline graphic is the survival function of the residual censoring time. Hence,

3.3.
3.3.

Unfortunately, under the second type of censoring, the conditional expectation E{ I(Vit)Yi(t) ∣ data} is a function of the censoring distribution. If the truncation time V is observable, then we may use a Kaplan–Meier-type estimator of Inline graphic. Consider the one-sample problem such that β = 0 and w(v, x) = w(v). The nonparametric maximum likelihood estimator of Inline graphic is the Kaplan–Meier estimator based on the censored residual lifetime (Ri, Di); that is

3.3.

Hence, the maximum pseudo-partial likelihood estimator of S is

3.3.

where

3.3.

The estimator Inline graphic is not the nonparametric maximum conditional likelihood estimator proposed and studied by Tsai et al. (1987), nor is it the nonparametric maximum likelihood estimator. However, if there is no censoring, Inline graphic is the nonparametric maximum likelihood estimator. It is straightforward to prove that Inline graphic will converge to Inline graphic and Inline graphic will converge to Inline graphic. As a result, under some regularity conditions, Inline graphic will converge to the survival function S(t) of the survival time.

If the censoring time depends on the covariates, we need to use a smooth type of Kaplan–Meier estimator of Inline graphic. If the truncation time Vi cannot be observed, we may use the nonparametric maximum likelihood estimator of Inline graphic. More research is needed in order to understand the properties of the proposed method.

The Stanford heart transplant dataset of Miller Halpern (1982) is a prospective cohort study. The event time of interest is the survival time after entry. The censoring time is the duration between calendar entry date of the patients and February 1980. If we assume that there is no loss of follow-up, then C = February 1980 − E; and V = transplant calendar date − E = transplant waiting time, where C, V and E, respectively, are the censoring time, truncation time and calendar entry date of the patient. If V, the transplant waiting time, is independent of E, the calendar entry date, then C and V are independent for all patients in the cohort: the transplant waiting times of the patients who died or were censored before transplantation cannot be observed. Therefore, the censoring time C is quasi-independent of the truncation time V for the transplant patients in the Stanford heart transplant data. Consequently, the censoring is of the first type. The Channing House dataset is a retrospective cohort study. The event time of interest is the death age. The censoring time C for patients who did not leave the centre before July 1975 is the time between the birth date B and July 1975. The truncation time, i.e. entry age, V is EB, where E is the calendar entry date. The residual censoring time Rc is CV = July 1975 − E. If entry age is independent of calendar entry date, i.e. V and E are independent, then the residual censoring time Rc is independent of the truncation time V. This censoring is of the second type.

Most applications can be classified as either the first or second type of censoring. For example, the censoring mechanism of the proportional hazards model with missing covariates, see § 4.3, and the Stanford heart transplant data are of the first type; the censoring mechanisms of the renewal process (Vardi 1989), cross-sectional survival data (Wang 1991) and the Channing House data are of the second type.

The asymptotic properties of the maximum pseudo-partial likelihood estimators Inline graphic and Inline graphic for censored biased-sampling data of general nonnegative weight functions are established and discussed in the Appendix. We assume that the censoring is of the first type for the rest of the paper.

4. Application

4.1. Length-biased data

The techniques developed in the previous sections can be applied to length-biased data by using W(t) = t. We illustrate the method with a simulation study and an analysis of the shrub dataset. Consider a two-sample proportional hazards model with covariate z = 0 representing Group 0 and z = 1 representing Group 1. We generate 100 length-biased samples from Group 0 with population density f0(t) = t exp (− t)I(t > 0) and 100 length-biased samples from Group 1 with population density f1(t) = t exp (β − eβ t)I(t > 0) for β = 0, 1, 2 in three different scenarios. The relative hazard between Group 0 and Group 1 is equal to eβ and, thus, the log hazard ratio is β = 0, 1, 2. We also calculate the estimator Inline graphic obtained by maximizing the approximate likelihood Inline graphic, which was also proposed by Wang (1996). The variance estimator proposed by Wang (1996) is identical to the estimator from sas. We use equation (A1) in Theorem A2 provided in the Appendix to obtain the variance estimator for the estimator Inline graphic. For comparison we also include the inverse probability weighted estimator Inline graphic of Binder (1992) and Lin (2000); see also Horvitz Thompson (1952) and Qi et al. (2005). For the definition of the inverse probability weighted estimator, see equation (13) in § 4.3. Table 1, based on 1000 replicates, which shows that Inline graphic is more efficient than Inline graphic and Inline graphic and that the variance estimator of Inline graphic underestimates the true sample variance of Inline graphic.

Table 1.

Monte Carlo simulation for length-biased data. One hundred length-biased observations were generated from Group 0 with population density f0(t) = t exp (− t)I(t > 0) and 100 length-biased observations were generated from Group 1 with population density f1(t) = t exp (β − eβ t)I(t > 0) for β = 0, 1, 2. Here Inline graphic maximizes the loglikelihood l(β), Inline graphic maximizes the loglikelihood Inline graphic and Inline graphic was proposed by Binder (1992) and Lin (2000). Estimates are based on 1000 replications

Bias Sample variance Mean of estimated variance
β Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
0 0.002 0.003 −0.001 0.010 0.020 0.049 0.010 0.020 0.036
1 0.009 0.012 0.014 0.018 0.030 0.070 0.017 0.030 0.050
2 0.032 0.033 0.060 0.058 0.080 0.139 0.053 0.075 0.103

We denote the width of an observed shrub from the shrub dataset by Ti. Wang (1996) assumed that the probability of including Ti in the dataset is proportional to Ti itself. We use the proportional hazards model,

4.1.

where z1 = I(T belongs to transect I) and z2 = I(T belongs to transect II) are two indicator covariates. Use of sas with offset − log (Ti) provides Inline graphic with model-based standard error estimates {seInline graphic, seInline graphic and corr(Inline graphic. The estimates are very similar to the results of Wang (1996), but the model-based standard errors overestimate the true standard errors of Inline graphic. Use of equation (A1) in the Appendix, which is identical to the robust sandwich covariance estimate from sas, gives {seInline graphic, seInline graphic and corr(Inline graphic. For comparison Inline graphic with estimated standard errors (0.33, 0.31).

4.2. Biased samples with right censoring

Miller Halpern (1982) compared four regression techniques on the updated Stanford heart transplant data without acknowledging that the transplant patient's survival time was sampled with bias. As mentioned in § 1, the survival times can be treated as a biased sample with a weight function equal to the distribution of the transplant waiting time. Miller Halpern (1982) did not provide the transplant waiting times but Crowley Hu (1977) did. The Weibull distribution fits the transplant waiting times very well. The R2 of the fit of log[−logInline graphic to log(t) is 0.97, where Inline graphic is the product-limit estimate of the transplant waiting time survival function based on Crowley & Hu's (1977) 103 patients. The conditional maximum likelihood estimate for the Weibull survival function of transplant waiting time is exp(−0.027t0.925). Hence, the weight function, W(t) = 1 − exp (−0.027t0.925), will be used to obtain the parameters of the proportional hazards model. We assume that the hazard rate of the transplant patient's survival is proportional to expInline graphic. Miller Halpern (1982) deleted 27 patients lacking the T5 mismatch score and 5 patients with survival times less than 10 days from the total of 184 patients in one of their data analyses. Based on 152 Stanford heart transplant patients, the pseudo-partial likelihood estimate Inline graphic is (−0.13, 0.0021) with {seInline graphic, seInline graphic and Inline graphic = (−0.17, 0.0026) with seInline graphic. In calculating the variances of the estimates, we assume that the weight function is known without error. Both methods show a strong relationship between survival time and age.

4.3. Missing covariates

It is assumed that, for each i, (Ti, Di, zi, Ri) are independent and identically distributed random vectors, where Ri = 1 if zi is fully observed and is zero otherwise. Let W(t, z) = pr (R = 1| T = t, Z = z) be the conditional probability of observing the full covariates data given the covariates z and follow-up time T = t. We assume that W(t,z) is either completely known or can be estimated from other methods; see Qi et al. (2005) and an unpublished Harvard School of Public Health technical report by M. Pugh, J. Robins, S. Lipsitz and D. Harrington. Copas Farewell (2001), Qi et al. (2005) and Pugh et al.'s report proposed the following weighted complete-case pseudolikelihood score function for inference of the proportional hazards models with missing covariates:

4.3. (13)

where Inline graphic and Wi = W{ Ti, zi(Ti)}. The inverse probability weighted estimator Inline graphic is the solution of UIPW(β) = 0. This pseudo-score UIPW was also proposed and studied by Binder (1992) and Lin (2000) in the survey-sampling literature. We may treat the complete case, with Ri = 1, as a biased sample from the population with selection probability proportional to W(t,z). The conditional density of (T, D) given the covariates z and R = 1 is proportional to W(t, z)fD (tz)S(1− D) (tz). Since the censoring was applied to the data before the cases with missing covariates were dropped, the censoring is of the first type considered in § 3.3. The pseudo-partial score function becomes Inline graphic where Inline graphic and

4.3.

Let Inline graphic denote the solution of Umc(β) = 0. If W(t, z) = W(t), then Inline graphic. Hence, Inline graphic is a weighted mean of Umc1, …, Umcn, with weight proportional to 1/ Wi, while Umc is the unweighted mean. When some of the Wi are close to zero, the equation UIPW = 0 becomes unstable. Therefore, in order to prove the asymptotic normality of the estimator Inline graphic, Pugh et al.'s report and Qi et al. (2005) had to assume that Wi > ε > 0 for some positive ε. We need weaker assumption to prove the asymptotic properties of the estimator Inline graphic; see the Appendix.

We now analyze the mouse leukaemia data of Kalbfleisch Prentice (2002) by the pseudo-partial likelihood method and the inverse probability weighted method. As in Kalbfleisch Prentice (2002), we dichotomized virus level into a binary variable with zero representing values below 104 and one otherwise. There are two analyses, corresponding to the endpoints death by thymic leukaemia and death by thymic or nonthymic leukaemia. Logistic regression was used to estimate the observation probabilities, i.e. weight function, based on 156 mice that survived for at least 400 days; the observation probabilities were set to zero for the remaining 48 mice that died or were censored before 400 days. In order to make the analysis comparable with that of Wang Chen (2001), we included only the survival time and its quadratic term as two predictors in our logistic regression model. Both analyses used 204 mice and treated both virus level and GPD1 phenotype as the missing covariates. Table 2 shows the results from applying the pseudo-partial likelihood method and the inverse probability weighted method. The pseudo-partial likelihood method shows that the virus level has a significant relationship with both endpoints, while the GPD1 has a significant relationship with death of thymic or nonthymic leukaemia and GPD1 has a moderately significant relationship with thymic leukaemia death. The inverse probability weighted method shows that GPD1 has a significant relationship with both endpoints and virus level has a moderately significant relationship with both endpoints. The inclusion of death by nonthymic leukaemia changes the estimates for GPD1 phenotype slightly, but moderately reduces the estimates for virus level. A similar phenomenon was also found by Qi et al. (2005). As in the Stanford heart transplant data analysis, the weight function is treated as known. If the weight function were treated as unknown, then the variance estimator in Theorem A2 would overestimate the true sample variance of Inline graphic.

Table 2.

Analysis of mouse leukaemia data using the Cox regression model with various methods. Here Inline graphic is the pseudo-partial-likelihood estimator and Inline graphic is the inverse probability weighted estimator

Thymic leukaemia Thymic and nonthymic leukemia
Coefficient estimate (se) Coefficient estimate (se)
Approach GPD1 Virus GPD1 Virus
Complete-case −1.44(0.60) 1.44(0.72) −1.46(0.57) 1.22(0.65)
Inline graphic −1.15(0.64) 1.51(0.71) −1.19(0.59) 1.28(0.62)
Inline graphic −1.47(0.65) 1.48(0.75) −1.41(0.63) 1.27(0.67)

se: estimated standard error.

If W(.) is a function of D, then Inline graphic is still a consistent estimator. Generally, however, under the same conditions, Inline graphic is not consistent, since E{ Umc0)} ≠ 0. For generalization to the Cox model with missing covariates and a detailed simulation comparison of Inline graphic and Inline graphic, see Luo et al. (2009).

5. Discussion

The procedures developed in this paper can be easily extended to other types of incomplete data. The basis of our method is the conditional expectation of the partial score function given the data. For the proportional hazards model, if N(t) is completely known, the conditional expectation involves two terms; see equation (3). One is the conditional expectation of z, while the other is the conditional expectation of S(0) (β, t). If z is completely known, we only have to consider the conditional expectation of S(0). In other types of incomplete data, we may also have to compute the conditional expectation of z or Δ N(t) or both. For example, in the proportional hazards models with missing covariates or with covariates with measurement errors, the covariates z are partially missing. Paik Tsai (1997) applied a similar idea and proposed two estimators of β for the proportional hazards models with missing covariates.

The estimators Inline graphic and Inline graphic proposed in § 3, are not nonparametric maximum likelihood estimators and, therefore, we generally do not expect these estimators to be optimal. However, as a special case, when W(t, x) = W(t) and β0 = 0, the estimator Inline graphic is the nonparametric maximum likelihood estimator and is the most efficient estimator. Since Inline graphic and Inline graphic maximize the pseudo-partial likelihood, we expect the efficiency of these two estimators to be quite high. Empirical evidence from our limited simulation and real-data experiments suggests that Inline graphic is more efficient than Inline graphic.

Another special case is given by W(t, x) = tx,f(tz) = f0(t) and p = pr (x = 0) = 1 − pr (x = 1). The nonparametric maximum likelihood estimator Inline graphic of Inline graphic was studied by Vardi (1982). We computed the asymptotic relative efficiency of Inline graphic with respect to Inline graphic when F0 is a uniform distribution on [0,1], for t and p ∈ {0.2, 0.4, 0.6, 0.8}. When p = 0 or 1, the product limit estimator based on Inline graphic is identical to the nonparametric maximum likelihood estimator, so that the asymptotic relative efficiency equals 1 for p = 0 and p = 1. The lowest asymptotic relative efficiency we obtained was 0.985. Since the estimator Inline graphic is much easier to calculate than the nonparametric maximum likelihood estimator, Inline graphic is a preferred estimator.

Acknowledgments

The author wishes to thank an associate editor and reviewers for suggestions leading to an overall improvement. He also appreciates Dr. Bruce Levin's comments and inputs. The paper was partially completed during the author's visit to the Department of Statistics, National Cheng Kung University, Taiwan.

Appendix

Large sample properties

In deriving equation (12), we explicitly assume that the weight function W(t, x) is a distribution function for any given x. However, after the likelihood (12) is obtained, we only need a weaker assumption, that the weight be nonnegative, to prove the asymptotic properties. Here, we assume that the weight function W(t,x) satisfies Assumption 1.

Assumption 1

For every fixed x, there exists a constant a(x) such that { tW(t, x) > 0} = (a(x), ∞) or [a(x), ∞).

Assumption 1 does not require that W(·, x) be a nondecreasing function for any fixed x. The following four theorems describe the asymptotic properties of the estimators Inline graphic and Inline graphic.

Theorem A1

If Assumption 1 holds and the matrix I(β) is positive definite, Inline graphic converges to β0 in probability as n → ∞, where Inline graphic and Inline graphic.

Proof

If Assumption 1 holds, we have

Proof

where Inline graphic. Then Inline graphic converges in probability to

Proof

Hence, U0) → 0 in probability as n → ∞ and, therefore, Theorem A1 holds by a standard argument.

For asymptotic normality, we need to introduce more notation. Define

Proof

where Inline graphic, and F1(t) = pr { Ti < t, I(Di = 1)}. Furthermore, we define

Proof

Note that Inline graphic is obtained by substituting s(0), s(1), F1(t) and β0 by Inline graphic and Inline graphic, respectively, in ξi.

Theorem A2

Under the same conditions as in Theorem A1, Inline graphic converges weakly to a normal distribution with zero-mean and covariance matrix Inline graphic, where Ξ = E⊗2).

Proof

The score function n−1/2 U0) can be expressed as

Proof

By Taylor series expansion, the second term in the above equation can be written as

Proof

Hence, Inline graphic. By the multivariate central limit theorem and its corollary, n−1/2 U0) converges to a multivariate normal distribution, yielding Theorem A2.

Note that I0) can be consistently estimated by Inline graphic and Ξ can be consistently estimated by Inline graphic. Thus, the covariance matrix Inline graphic can be consistently estimated by

graphic file with name asp026eqn14.jpg (A1)

We now turn to a study of the large sample properties of the estimated baseline integrated hazard function Inline graphic.

Theorem A3

Let M be a large positive number such that pr (TM) is strictly positive. Under the same conditions as in Theorem A1, for t < M the process Inline graphic converges weakly to a Gaussian process with zero-mean and covariance function E0i(s0i(t)}, which can be consistently estimated by Inline graphic, where

Theorem A3

Theorem A4

Under the conditions of Theorem A1, the asymptotic covariance of Inline graphic and Inline graphic can be consistently estimated by Inline graphic.

Proofs of Theorem A3 and Theorem A4. We may write Inline graphic as

graphic file with name asp026ueq24.jpg

Here, the last term can be shown to converge to zero in probability. By Theorem A2, Inline graphic, and by Taylor series expansion around β0, the first term can be approximated by

graphic file with name asp026ueq25.jpg

The second term can be expressed as

graphic file with name asp026ueq26.jpg

Therefore, a simple application of the multivariate central limit theorem implies that the finite-dimensional distribution of { An(t), Bn(t), Cn(t)} is a multivariate normal. As in the proof in Tsiatis (1981), the sequence of distributions induced by An, Bn and Cn is tight.

References

  1. Andersen P. K., Borgan O., Gill R. D., Keiding N. Statistical Models Based on Counting Processes. New York: Springer; 1993. [Google Scholar]
  2. Betensky R. A., Lindsey J. C., Ryan L. M., Wand M. P. Local EM estimation of the hazard function for interval-censored data. Biometrics. 1999;55:238–45. doi: 10.1111/j.0006-341x.1999.00238.x. [DOI] [PubMed] [Google Scholar]
  3. Binder D. A. Fitting Cox's proportional hazards models from survey data. Biometrika. 1992;79:139–47. [Google Scholar]
  4. Copas A. J., Farewell V. T. Incorporating retrospective data into an analysis of time to illness. Biostatistics. 2001;2:1–12. doi: 10.1093/biostatistics/2.1.1. [DOI] [PubMed] [Google Scholar]
  5. Cox D. R. Regression models and life tables (with Discussion) J. R. Statist. Soc. 1972;B. 34:187–220. [Google Scholar]
  6. Cox D. R. Partial likelihood. Biometrika. 1975;62:269–76. [Google Scholar]
  7. Crowley J., Hu M. Covariance analysis of heart transplant survival data. J. Am. Statist. Assoc. 1977;72:27–36. [Google Scholar]
  8. Dempster A. P., Laird N. M., Rubin D. B. Maximum likelihood estimation from incomplete data via the EM algorithm (with Discussion) J. R. Statist. Soc. 1977;B. 39:1–38. [Google Scholar]
  9. Horvitz D. G., Thompson D. J. A generalization of sampling without replacement from a finite universe. J. Am. Statist. Assoc. 1952;47:663–85. [Google Scholar]
  10. Hyde J. Survival analysis with incomplete observations. In: Miller R. G., Efron B., Brown B. W., Moses L. E., editors. Biostatistics Casebook. New York: John Wiley & Sons; 1980. pp. 31–46. [Google Scholar]
  11. Kalbfleisch J. D., Prentice R. L. The Statistical Analysis of Failure Time Data. 2nd ed. New York: Wiley; 2002. [Google Scholar]
  12. Keiding N., Gill R. D. Random truncation models and Markov processes. Ann. Statist. 1990;18:582–602. [Google Scholar]
  13. Lin D. Y. On fitting Cox's proportional hazards models to survey data. Biometrika. 2000;87:37–47. [Google Scholar]
  14. Luo X., Tsai W.-Y., Xu Q. Pseudo partial likelihood estimators for Cox regression with missing covariates. Biometrika. 2009;96 doi: 10.1093/biomet/asp027. (forthcoming) [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Miller R., Halpern J. Regression with censored data. Biometrika. 1982;69:521–31. [Google Scholar]
  16. Muttlak H. A., McDonald L. L. Ranked set sampling with size-biased probability of selection. Biometrics. 1990;46:435–46. [Google Scholar]
  17. Paik M. C., Tsai W.-Y. On using Cox proportional hazard models with missing covariate. Biometrika. 1997;84:579–93. [Google Scholar]
  18. Qi L., Wang C. Y., Prentice R. L. Weighted estimators for proportional hazards regression with missing covariates. J. Am. Statist. Assoc. 2005;100:1250–63. [Google Scholar]
  19. Tsai W.-Y., Jewell N. P., Wang M.-C. A note on the product limit estimate of a survival curve under right-censoring and left-truncation. Biometrika. 1987;74:883–6. [Google Scholar]
  20. Tsiatis A. A. A large sample study of Cox's regression model. Ann. Statist. 1981;9:91–108. [Google Scholar]
  21. Vardi Y. Nonparametric estimation in the presence of length bias. Ann. Statist. 1982;10:616–20. [Google Scholar]
  22. Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density. Biometrika. 1989;76:751–61. [Google Scholar]
  23. Wang C. Y., Chen H. Y. Augmented inverse probability weighted estimator for Cox missing covariate regression. Biometrics. 2001;57:414–9. doi: 10.1111/j.0006-341x.2001.00414.x. [DOI] [PubMed] [Google Scholar]
  24. Wang M.-C. Nonparametric estimation from cross-sectional survival data. J. Am. Statist. Assoc. 1991;86:130–43. [Google Scholar]
  25. Wang M.-C. Hazards regression analysis for length-biased data. Biometrika. 1996;83:343–54. [Google Scholar]
  26. Wang M.-C., Jewell N. P., Tsai W.-Y. Asymptotic properties of the product limit estimate under random truncation. Ann. Statist. 1986;14:1597–605. [Google Scholar]

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES