Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Dec 10.
Published in final edited form as: Stat Med. 2013 Jul 3;32(28):4980–4994. doi: 10.1002/sim.5885

A General Semiparametric Hazards Regression Model: Efficient Estimation and Structure Selection

Xingwei Tong, Liang Zhu, Chenlei Leng *, Wendy Leisenring, Leslie L Robison
PMCID: PMC3913752  NIHMSID: NIHMS545123  PMID: 23824784

Abstract

We consider a general semiparametric hazards regression model that encompasses Cox’s proportional hazards model and the accelerated failure time model for survival analysis. To overcome the nonexistence of the maximum likelihood, we derive a kernel-smoothed profile likelihood function, and prove that the resulting estimates of the regression parameters are consistent and achieve semiparametric efficiency. In addition, we develop penalized structure selection techniques to determine which covariates constitute the accelerate failure time model and which covariates constitute the proportional hazards model. The proposed method is able to estimate the model structure consistently and model parameters efficiently. Furthermore, variance estimation is straightforward. The proposed estimation performs well in simulation studies and is applied to the analysis of a real data set. Copyright

Keywords: Accelerated failure time model, Cox’s proportional hazards model, Efficiency, Kernel-smoothed profile likelihood function, Model selection, Penalized likelihood

1. Introduction

The proportional hazards model proposed by Cox ([1], [2]) and the accelerated failure time model (AFT) are the two major approaches to analyze censored data in survival studies [3]. Cox’s model specifies that the covariates act multiplicatively on the hazard function, while the AFT model specifies that the effect of the covariates is multiplicative on time. Clearly, the covariates in these two models provide different measures of the effect on survival.

Efficient estimation and inference for the proportional hazard model are based on the partial likelihood that is easy to implement and gives attractive statistical properties [3]. On the other hand, efficient inference for the AFT model with unspecified error distribution is much more difficult ([4], [5], [6]). Jin et al. [7] proposed to use linear programming for the rank-based estimation method that can be suboptimal in terms of efficiency. Zeng and Lin [8] proposed an approximate nonparametric maximum likelihood method via maximizing a kernel-smoothed profile likelihood function, and showed that this estimator is efficient. Lu [9] further studied this model with a cure fraction.

In practice, when given a dataset, we need to choose whether to use Cox’s model or the AFT model. A usual technique would be to fit one of the two models and then to conduct some model testing to assess the lack-of-fit. However, due to the finite sample size and other data characteristics, the ability to check the model assumptions may be limited. In addition, it is not unusual that both Cox’s model and the AFT model may fit the data reasonably well, if appropriate time-dependent covariates are included [8].

Instead of deciding which model to fit, we consider a general semiparametric hazard model in this paper first studied by Chen and Jewell [10]. This model encompasses Cox’s and the AFT model as special cases, and is more flexible in modeling survival data. Chen and Jewell used the estimating equations method for inference that is not efficient in general. Numerical challenges also arise due to the non-smoothness of Chen and Jewell’s estimating equations. Chen and Jewell confined application of their model to low dimensional cases maybe due to the computational complexity. We, on the other hand, propose a kernel-smoothed profile likelihood function aiming at maximum likelihood estimation, motivated by Zeng and Lin [8]. The maximization can be easily implemented using any optimization algorithm developed for smooth objective functions, such as the Newton-Raphson algorithm. We show that the resulting estimators are asymptotically normal and attain the semiparametric efficiency bound. The limiting covariance matrix of the estimates can be readily computed as an output of the algorithm. More importantly, we develop penalized model selection techniques to determine which covariates constitute the accelerate failure time model and which covariates contribute to the proportional hazards model. The resulting estimators are shown to have the oracle property ([11], [12]). Namely, the final chosen model correctly distinguishes Cox covariates and AFT covariates with probability tending to one, and the resulting estimates are as efficient as if we were to use the true model structure for estimation and inference. Zhang and Lu [12] studied variable selection for the Cox model while Cai et al. [13] studied it for the AFT model. See also Lu and Zhang [14] and Johnson [15]. Our model selection has an added implication in terms of distinguishing these two kinds of covariates.

The rest of the paper is organized as follows. We discuss the general semiparametric hazards model in Section 2 and discuss using a smoothed profile likelihood function for estimation. Section 3 gives the large sample properties of the proposed estimators. We study penalized model selection techniques in Section 4. Simulation studies are presented in Section 5, followed by an analysis of Childhood Cancer Survivor study data in Section 6. We make some concluding remarks in Section 7. All the proofs are relegated to the appendix.

2. A general hazards model

Let T denote the failure time of interest and X be the p-dimensional covariate that is time invariant. If T follows the proportional hazards model, then the conditional hazards function of T given X has the form

λTX(t)=λ0(t)exp(Xβ), (1)

where β is the regression parameter and λ0(t) is an unspecified baseline hazard function. If T follows the accelerated failure time model, then T satisfies that

logT=-Xβ+ε,

where ε is a measurement error independent of X. If we denote the cumulative hazards function of ε as Λε and set Λ0(·) = Λε(log(·)), the cumulative hazards function of T given X in the AFT model takes the form

ΛTX(t)=Λ0(texp(Xβ)).

Differentiating both sides of the above equality, one obtains the conditional hazards function of T given X

λTX(t)=λ0(texp(Xβ))exp(Xβ). (2)

In this paper, we explore a general hazards regression model which includes Cox’s model and the AFT model as special cases. We assume that the conditional hazards function of T given X takes the form

λTX(t)=λ0(texp(Xβ1))exp(Xβ1)exp(Xβ2), (3)

where β1 and β2 are two regression parameters. Alternatively, this general hazards model implies that the failure time T follows the model

logT=-Xβ1+ε, (4)

where eε follows the proportional hazards model with its cumulative hazards function given by Λ0(t) exp(Xβ2).

The two parameters, β1 and β2, measure two different effects of the covariate on the survival time. The first parameter β1 can be interpreted as the rate that the covariate either accelerates or degrades the rate at which an individual proceeds along the time line. The second parameter β2 identifies multiplicative effects of the covariate on the hazard function after adjusting for β1. This model reduces to the accelerated failure time model when β2 = 0, and it becomes the proportional hazards model when β1 = 0. For a covariate Xj, if β1j = β2j = 0, this variable has no effects on the survival time; if β1j = 0 and β2j ≠ 0, this variable is a Cox covariate; if β1j ≠ 0 and β2j = 0, this variable is an AFT variable; and if β1j ≠ 0 and β2j = 0, this variable is a mixed variable that contributes to both models. Clearly, this general hazards model is broader than the two models and is more flexible. The structure of the model raises two important questions: one is whether efficient estimation of β1 and β2 is possible, and the other is which covariates play the role in Cox’s model and which covariates play the role in the accelerated failure time model.

Let C be the censoring time and τ be the largest study time. Suppose that we have a data set with n subjects. For the ith subject, we observe the censored event time Yi = min(Ci, Ti ), the censoring indicator δi = I(Ti ≤ Ci ) and covariate Xi, where I(·) is the indicator function. We assume that Ci is independent of Ti conditional on Xi and that its distribution does not depend on β=(β1,β2). Then given i.i.d observations Oi = (Yi, δi, Xi ), i = 1, ···, n, from O = (Y, δ, X), the log-likelihood function of β and λ is given by

1ni=1n[δiXi(β1+β2)+δilogλ(Yiexp(Xiβ1))-exp(Xiβ2)Λ(Yiexp(Xiβ1))]. (5)

Generally, the maximum likelihood estimation can be obtained by maximizing the above log-likelihood function. An argument similar to that in Zeng and Lin [8] shows, however, that the maximum of (5) does not exist, because the estimator of Λ is very non-smooth. To overcome this difficulty, we use a piecewise constant function to approximate λ. That is, we divide the interval that contains all the possible values Yiexp(Xiβ1) into Kn equally space intervals and then assumes that the function λ is a constant on each interval. More specifically, we write

λ(t)=k=1KnαkI(t[tk-1,tk)),

where 0 = t0 < t1 < ··· < tKn = η, η=supi,β1B1Yiexp(Xiβ1). Here Inline graphic is a compact subset of Rp, α1, ···, αKn are the corresponding unknown constants on the intervals. Then the cumulative hazards function Λ can be written as

Λ(t)=k=1Knαk(t-tk)I(t[tk-1,tk))+ηKnk=1KnαkI(ttk-1).

Therefore the log-likelihood function (5) can be re-expressed as

1ni=1nδiXi(β1+β2)+1nk=1Knlogαk[i=1nδiI(Ri(β1)[tk-1,tk))]-1nk=1Knαk×[i=1nexp(Xiβ2)(Ri(β1)-tk)I(Ri(β1)[tk-1,tk))+ηKni=1nexp(Xiβ2)I(Ri(β1)tk-1)], (6)

where Ri(β1)=Yiexp(Xiβ1).

Differentiating with respect to αk, one obtain that

αk=i=1nδiI(Ri(β1)[tk-1,tk))[i=1nexp(Xiβ2)(Ri(β1)-tk)I(Ri(β1)[tk-1,tk))+ηKni=1nexp(Xiβ2)I(Ri(β1)tk-1)]-1

Therefore, the profile log-likelihood function of β is

lnp(β)=1ni=1nδiXi(β1+β2)+k=1Kn[1ni=1nδiI(Ri(β1)[tk-1,tk))]×log[Knnηj=1nδjI(Rj(β)[tk-1,tk))]-k=1Kn[1ni=1nδiI(Ri(β1)[tk-1,tk))]×log[Knnηi=1nexp(Xiβ2)(Ri(β1)-tk)×I(Ri(β1)[tk-1,tk))+1ni=1nexp(Xiβ2)I(Ri(β1)tk-1)]. (7)

Since the profile log-likelihood function is not smooth due to the many indicator functions in (7). As a result, direct maximization is difficult or impossible. As n → ∞, Kn → ∞, Kn/n → 0, we have

lnp(β)l(β),uniformlyforβB,

where

l(β)=E[δX(β1+β2)+δlog[dP(δ=1,R(β1)t)/dtEexp(Xβ2)I(R(β1)t)|t=R(β1)].

This motivates to use the kernel-smoothed approximation to l(β), given by

lns(β)=1ni=1nδiXi(β1+β2)+1ni=1nδilog[1nhj=1nδjK((Rj(β1)-Ri(β1))/h)]-1ni=1nδilog[1nj=1nexp(Xjβ2)-(Rj(β1)-Ri(β1))/hK(s)ds], (8)

where h = h(n) is a bandwidth parameter converging to zero for n → ∞.

The proposed estimator β^n=(β^1n,β^2n) was obtained by maximizing the kernel-smoothed likelihood function lns(β) over βInline graphic for a compact set Inline graphicR2p. Given β̂n, the cumulative hazard function can be estimated as

Λ^n(t)=0t1nhi=1nδiK((Ri(β^1n-s)/h)1ni=1nexp(Xiβ^2n)-(Ri(β^1n)-s)/hK(u)duds, (9)

and the corresponding baseline function can be estimated by

λ^n(t)=1nhi=1nδiK((Ri(β^1n-t)/h)1ni=1nexp(Xiβ^2n)-(Ri(β^1n)-t)/hK(u)du.

These are kernel-smoothed estimates of the hazard functions motivated by the need to smooth the likelihood.

For bandwidth selection, we can use cross-validation such as the K-fold cross-validation approach. An application of this method in the current setting is to divide the data into K equal-sized groups. Let Dk denote the kth subgroup of data, then the kth prediction error is given by

PEk(h)=iDk[Ni(τ)-0τI(Yit)exp(Xiβ2(-k))dΛ^n(texp(Xiβ^1(-k)))]2

for k = 1,…, K, where β^1(-k) and β^2(-k) are the estimators of β1 and β2 based on the data without the subgroup Dk. The optimal bandwidth is the minimizer of the total prediction error PE(h)=k=1KPEk(h).

3. Large sample property

In this section, we study the asymptotic behavior of the proposed estimators. The regularity conditions and proofs are given in the Appendix. First, we present the consistent result. Denote the true value of β as β0=(β10,β20)

Theorem 1

Suppose that Conditions A1–A6 in the Appendix hold. If h = O(nv) with −1/2 < v < 0, then β̂nβ0, a.s..

Theorem 1 states the strong consistency of the proposed estimator β̂n. Intuitively, this can be seen from the uniform convergence of lns(β) over βInline graphic and the fact that β0 uniquely maximizes the limiting function of lns(β). For the asymptotic normality of the estimates, we have the following result.

Theorem 2

Suppose that Conditions A1–A6 in the Appendix hold. If the bandwidth satisfies nh2m → 0 and nh6 → ∞, then n(β^n-β0) converges weakly to a normal random variable with mean zero and covariance achieving the semi-parametric efficient bounds of β0. Here m is defined in condition A6 in the Appendix.

The inverse negative second derivative of lns(β^n) can be used for estimating the asymptotic covariance of n(β^n-β0), because β̂n is efficient. We note that the estimating equations based estimator in Chen and Jewell [10] is not efficient in general. From this theorem, we see that in order to obtain asymptotic normality, oversmoothing is needed. This result is similar to that in Zeng and Lin [8]. The cross validation choice of the bandwidth, aiming at minimizing the expected prediction error, tends to choose a bandwidth that optimizes the overall performance with respect to the kernel smoothed hazard function and the parameters, and thus does not satisfy the rate specified in Theorem 2. However, in our simulation studies and the data analysis, we observe that the difference between the a prespecified bandwidth and a cross validation based bandwidth is often small. Zeng and Lin [8] also reported that their result was not sensitive to the bandwidth. We now make a connection of our estimates to those in Cox’s model and the AFT model. In the proof of Theorem 2 in the Appendix, we show that the score function Un(β) of β takes the form

Un(β)=1ni=1nl.βs(Oi)+op(n-1/2),

where

l.βs(O)=(X2-EX2exp(Xβ20)I(Yt)Eexp(Xβ20)I(Yt))dM(t)+(X2-EX2exp(Xβ20)I(Yt)Eexp(Xβ20)I(Yt))texp(Xβ10)λ.(texp(Xβ10))λ(texp(Xβ10))dM(t)+(X3-EX3exp(Xβ20)I(Yt)Eexp(Xβ20)I(Yt))dM(t), (10)

for X2=(X,0p), X3(0p, X′)′, with 0p being the p-dimensional zero vector, and dM(t) = /(Yt) − exp(Xβ20)/(Yt)dΛ(t exp(Xβ10)). Here and in the sequel, = df (t)/dt denotes the first derivative function of f provided it exists.

From (10), if the AFT model holds, that is β2 = 0, then the third term in (10) disappears. This gives the same score function as the AFT model derived in Zeng and Lin [8]. When the proportional hazards model holds, that is β1 = 0, then the first two terms in (10) disappear. The resulting score function of β2 reduces to

1ni=1n(Xi-EXexp(Xβ20)I(Yt)Eexp(Xβ20)I(Yt))dMi(t),

which coincides with the score function derived from the partial likelihood function in the proportional hazards model. These arguments illustrate clearly that the efficient score functions used in Cox’s and the AFT models are special cases of those of the proposed general model.

To analyze the asymptotic properties of Λ̂n(·), we write R0 = Y exp(Xβ10), G0(s) = E[exp(Xβ20)/(R0 > t)], G1(s) = E[exp(Xβ20)/(R0 > t)X2], G2(s) = G1(s)/G0(s), and G3(s) = E[exp(Xβ20)/(R0 > t)X3]. Denote fX|Y (x ) as the conditional density function of X given Y.

Theorem 3

If the conditions in Theorem 2 hold, then n(Λ^n(t)-Λ0(t)) follows a zero-mean Gaussian processes asymptotically with covariance function EH(O; t)H(O, s) at (t, s) when n tends to infinity, where

H(O;t)=0tδfR0δ,X(s)-λ(s)exp(Xβ20)I(R0>s)G0(s)ds-0t[G1(s)+G3(s)]λ(s)+sG1(s)λ.(s)G0(s)ds×[Elβs(O)lβs(O)]-1lβs(O).

4. Model selection

Model selection is important in regression analysis and many procedures have been developed. In this section, we apply the model selection idea to analyze the model structure. For our model, the goal is to determine which covariates play role in Cox’s model and the accelerate failure time model. Thus, developing model selection strategies is very important and meaningful in elucidating the roles of the covariates. To this end, we propose to maximize the penalized smoothed profile likelihood function

Q(β)=lns(β)-m=12j=1ppwmj(βmj),

where p(·) is a penalty function and wmj, m = 1, 2; j = 1, ···, p are the tuning parameters.

Denote β̃n as the maximizer of Q(β) over βInline graphic. Let V(β0)=El.βs(O)l.βs(O)=(Vij) evaluated at β = β0 be the information matrix. To derive the asymptotic properties of β̃n, denote the nonzero index set as B0={mj:βmj(0)=0} and B1={mj:βmj(0)0}, where βmj(0) is the jth component of βm0. For a subset B of {11, ···, 1p, 21, ···, 2p}, let βB = {βmj, mjB} be a sub-vector of β. We have the following theorem that states the structure selection consistency, also known as the oracle properties [11]. It is clear that the conditions in Theorem 1 in [16] all hold, one obtains the following Oracle properties,

Theorem 4

Suppose that the conditions of Theorem 2 and Condition A7 hold and that for m = 1, 2, j = 1, ···, p, wmj → 0 and nwmj when n →]∞. Then we have

  1. Selection consistency: β̃nB0 = 0 with probability tending to one;

  2. Asymptotical normality: n(βnB1-β0B1)N(0,{V(βB1)}-1), where V(βB1) is the Fisher information matrix for the nonzero set B1.

Condition A7 put some requirements on the penalty functions that can be used in Q(β). Here we use the smoothly clipped absolute deviation (SCAD) penalty function whose first derivative is given by

p.w(θ)=w{I(θw)+(aw-θ)+(a-1)wI(θ>w)}, (11)

where a = 3.7 is a constant, and c+ = max(c, 0) for any real scalar c. The SCAD penalty satisfies condition A7 in the Appendix and enjoys favorable properties as outlined in Fan and Li [11]. Alternatively we may use the adaptive Lasso penalty [17], an extension of the Lasso [18]. By setting wmj = w, we choose w by the Bayesian information criterion (BIC), which minimizes with respect to w

BIC(w)=-lns(βw)+#βw·log(n)/n,

where #β̃w denotes the number of nonzero coefficients in β̃w. Following Wang, Li and Tsai [19], we can show that the resulting penalized estimators are model selection consistent. Thus, the asymptotic covariance matrix of β̃nB1 can be consistently estimated by 1n{V(βnB1)}-1. Since the proof of this theorem is now standard [12], we omit the proof to save space.

5. Simulation

We conduct simulation studies in this section to evaluate the finite sample performance of the proposed method by comparing to the estimating equations method in Chen and Jewell [10], as well as the Cox’s and the AFT model. We also assess the accuracy of the inference procedure and the performance in model structure determination.

Example 1

In the first study, we compare the performance of our proposed estimator with those proposed by Chen and Jewell [10]. For that purpose, we simulate data motivated by the simulation setup in Chen and Jewell [10]. In particular, we consider the one-covariate case where Xi follows the Bernoulli distribution with success probability 0.5. The censoring time Ci follows the uniform distribution U(0, τ) with τ = 9. Given Xi, we generate the failure time Ti under model (3) with λ0(t) = 1/(bt + 1) and different values of β1 and β2. For this simulation, we use four possible configurations for (β1, β2)′ from (1, −1)′, (1, 0)′, (0, 1)′, (0, 0)′. When β10=0, Cox’s model should fit the data very well, and when β20=0, the AFT model should fit the data very well. When both β10 and β20 are 0, both the AFT model and Cox’s model would fit the data well. Two sample sizes n = 100 and n = 200 are used. For each simulation setup, the simulation is repeated 1000 times.

The results for the proposed method, Cox’s model and the AFT model, using the efficient estimation approach proposed by Zeng and Lin [8], are reported in Table 1. In our simulation studies, we take h = 0.85n−1/7. For our proposed method, we report the average bias (Bias), the sample standard deviations of the estimates across 1000 replications (SSD), the averages of the estimated standard errors (ESE), and the 95% empirical coverage probabilities (CP). For the other two methods, we only provide the Bias and SSD’s. These results indicate that the estimation procedure of the proposed method gives almost unbiased estimates. The ESEs are all reasonably close to SSDs, and the coverage probability is close to the nominal probability 95%. These indicate that the proposed estimation and inference procedures work well for finite samples. The Cox model works well when β10 = 0, but has significant bias when β10 is not 0. Similarly, the AFT model works well only when β20 = 0, but has significant bias when β20 is not 0. One reviewer asked why the results of the proposed estimator for bias and SSD are different from the Cox (AFT) model when β10 = 0 (β20 = 0). The reason is that in that situation, Cox model only estimates β2 while the proposed method still estimates two parameters. If one use variable selection in the next step, then β1 will be set to be 0 with high probability and the estimates for β2 should be the same as that in the Cox model. From the comparison with the estimators proposed by Chen and Jewell [10], all the Bias’ of both estimators are very small, but our estimators have smaller SSDs than the corresponding estimators by Chen and Jewell [10] in general. This implies that our estimators is more efficient, which agrees with the statement in Theorem 2. We also try the simulation when b = 0.5 and obtain the similar results (not shown). To evaluate the performance for estimating Λ0, we plot the estimated curves in Figure 1 for b = 0.5, 1 and n = 100, 200. Figure 1 shows that the estimated curves are very close to the true curve of Λ0(t). In terms of bandwidth selection, we have also tried K-fold cross-validation to choose the optimal bandwidth and the results are similar.

Table 1.

Summary of Simulation studies in Example 1 with hn= 0.85n− 1/7.

Proposed method CJ’s method Cox’s model AFT model

β True Bias SSD ESE CP Bias SSD Bias SSD Bias SSD
n = 100
β1 1 0.051 0.536 0.534 0.965 −0.016 0.572 - - −0.714 0.930
β2 −1 0.007 0.359 0.391 0.974 0.011 0.506 0.517 0.255 - -

β1 −1 −0.018 0.547 0.538 0.97 −0.104 0.606 - - −0.058 0.411
β2 0 0.007 0.345 0.347 0.968 −0.025 0.508 −0.652 0.254 - -

β1 0 −0.069 0.550 0.538 0.953 −0.102 0.488 - - 0.992 0.843
β2 1 −0.006 0.398 0.433 0.970 0.043 0.465 0.017 0.244 - -

β1 0 0.039 0.535 0.501 0.943 −0.093 0.512 - - 0.024 0.396
β2 0 −0.016 0.352 0.378 0.968 0.024 0.440 0.060 0.236 - -

n = 200
β1 1 −0.001 0.456 0.402 0.939 −0.009 0.509 - - −1.114 0.828
β2 −1 0.030 0.289 0.284 0.941 −0.003 0.412 0.516 0.178 - -

β1 −1 0.029 0.444 0.408 0.950 −0.058 0.552 - - −0.045 0.301
β2 0 −0.050 0.311 0.314 0.954 0.076 0.439 −0.653 0.179 - -

β1 0 0.032 0.438 0.405 0.942 −0.097 0.417 - - 1.271 0.389
β2 1 −0.061 0.309 0.316 0.946 0.046 0.39 0.004 0.171 - -

β1 0 0.014 0.425 0.376 0.915 −0.067 0.393 - - 0.006 0.295
β2 0 −0.012 0.279 0.276 0.956 0.031 0.343 −0.005 0.169 - -

CJ’s method refers to the estimator proposed by Chen and Jewell (2001) with the function G(t; Z, β), required for defining the estimating equations, specified as tZ.

Figure 1.

Figure 1

Estimated curve of Λ0(t) = b−1 log(1 + bt) with b = 0.5 and b = 1 for n=100 (left panel) and n=200 (right panel). Solid line: True curve Λ0(t) with b = 0.5 (Upper) and b = 1 (Lower); Dashed line: the proposed estimated curve Λ̂n(t) with b = 0.5 (Upper) and b = 1 (Lower).

Example 2

In the second study, we consider a setup similar to the first study except that there are six covariates. These covariates are generated from a multivariate normal distribution with mean zero and the covariance between Xi and Xj being ρ|i-j|. For this example, we take ρ= 0, 0.2, 0.5, 0.8. The true coefficients are taken as β10 = (1, 1, 1, 0, 0, 0)′, and β20 = (0, 0, 0, 1, 1, 1)′. The largest follow up time is taken to be τ = 6 and the censoring time is exponentially distributed with rate 4, which gives a censoring rate around 45%. We apply the variable selection method with the SCAD penalty and compare the result with the estimate without penalization and the estimates assuming the true structure is known. We also try other penalties, such as LASSO, hard thresholding penalty function (HARD), and obtain the similar results. Table 2 reports the relative error, defined as the median ratio of the MSE versus the MSE from the method without variable selection. The average numbers of 0 coefficients are also reported in Table 2, in which the column labeled “Correct” presents the average number of six true coefficients correctly estimated as 0 and the column labeled “Incorrect” depicts the average of six nonzero coefficients erroneously estimated as 0.

Table 2.

Summary of Simulation studies in Example 2

β0 = (1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1) β0 = (1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)

Aver. no. of 0 coeff Aver. no. of 0 coeff

n ρ Method Relative Error Correct Incorrect Relative Error Correct Incorrect
100 0 PROPOSED 0.422 5.76 0.68 0.861 7.56 0.54
ORACLE 0.403 6 0 0.404 8 0

0.2 PROPOSED 0.507 5.72 0.90 0.883 7.48 0.57
ORACLE 0.403 6 0 0.431 8 0

0.5 PROPOSED 0.549 5.71 0.87 0.861 7.56 0.54
ORACLE 0.355 6 0 0.404 8 0

0.8 PROPOSED 0.460 5.80 0.76 0.883 7.48 0.57
ORACLE 0.442 6 0 0.431 8 0

200 0 PROPOSED 0.351 5.82 0.77 0.737 7.68 0.55
ORACLE 0.321 6 0 0.327 8 0

0.2 PROPOSED 0.365 5.84 0.69 0.819 7.63 0.72
ORACLE 0.354 6 0 0.297 8 0

0.5 PROPOSED 0.484 5.87 0.80 0.761 7.69 0.58
ORACLE 0.385 6 0 0.290 8 0

0.8 PROPOSED 0.322 5.86 0.88 0.841 7.52 0.72
ORACLE 0.295 6 0 0.305 8 0

Based on 1000 replications with sample sizes n = 100 and 200. we can see from Table 2 that the proposed methods outperform the method without penalization. This improvement is more significant when the sample size is large. On the other hand, the proposed method performs well in structure selection, as the number of correctly estimated 0 is close to 6, and the number of incorrectly estimated 0 is close to 0. In addition, we can see that the proposed method gives a relative error very close to the oracle relative error, especially when the sample size is large. This confirms the theoretical results in Theorem 4 empirically. We also provide the simulation results for the case when there are covariates in both AFT and Cox part. Specifically we set β10 = (1, 1, 1, 1, 0, 0)′, and β20 = (0, 0, 1, 1, 1, 1)′. From Table 2, we can see that the performance in terms of relative errors and variable selection is still satisfactory, although the relative error part is worse than that when there are no overlapped covariates in both AFT and Cox models.

6. Application

We apply the proposed methods to a subset of data from the Childhood Cancer Survivor Study (CCSS). This study followed up around 14000 survivors who were diagnosed with childhood cancers and survived at least 5 years since the primary diagnosis to investigate late effects of cancer and cancer treatments ([20], [21]). The diagnoses include leukemia, central nervous system (CNS) malignancy, Hodgkin lymphoma, non-Hodgkin lymphoma, Wilm’s tumor, neuroblastoma, soft-tissue sarcoma and bone tumors. During the follow-up, detailed demographic and treatment information was collected via questionnaires and the date of death was obtained from the death certificate if the patient died during the study.

Among all diseases, patients with Hodgkin lymphoma (HL) has the highest death rate during the follow-up (16.8%). So here we focus the subgroup of survivors who have been diagnosed HL during their childhood and focus on the study of risk factors on death. In total there were 1559 HL survivors. For the survivors who were alive at the end of study, the dates of last contact were obtained as censoring time. The covariates we are interested in are age at diagnosis (X1, years), gender (X2: 0=male, 1=female), radiation dose in the cancer treatment (X3, 0: <2500 cGy, 1: ≥2500 cGy) and cumulative anthracycline exposure (X4, g/m2). Now we use model (3) to fit the data. Table 3 shows the estimation results for the estimates without penalty and with the SCAD penalty. The SCAD penalty here is p.wj(βj)=wj{I(βjwj)+(awj-βj)+(a-1)wjI(βj>wj)}, with wj =w/|β̃j|, a = 3.7, w = 0.3 and β̃j being jth component of the estimator without penalty. Using K-fold cross validation, the bandwidth is chosen to be 0.2. From Table 3, the covariate information of anthracycline exposure plays an important role in constructing the AFT model since only anthracycline exposure is significant. On the other hand, radiation exposure is important to building Cox’s model since other covariates in β2 are insignificant. More specifically, higher cumulative anthracycline dose and higher radiation exposure are likely to increase the risk of death in HL survivors, but age at diagnosis and gender do not have significant influence. We also note that the standard errors in the model with penalization are markedly smaller than those in the unpenalized model.

Table 3.

Estimation results for the Childhood Cancer Survivor Study with standard errors in parentheses.

Without penalty With penalty

β1 Age −0.041 (0.020) -
Gender −0.529 (0.139) -
radiation dose −0.888 (0.256) -
anthracycline exposure 3.244 (0.627) 4.016 (0.342)

β2 Age 0.125 (0.034) -
Gender 0.591 (0.249) -
radiation dose 1.824 (0.354) 1.028 (0.192)
anthracycline exposure −0.344 (1.053) -

Conditional on the covariate information X, one can estimate the conditional survival function by exp(− Λ̂n (t exp(Xβ̂1n)) exp(Xβ̂2n)). Then we can estimate the marginal survival function for a subgroup by averaging the conditional survival function estimates. In Figure 2, we plot the estimated marginal survival function for two groups: the first group with X4 = 0 corresponding to those without exposure to anthracycline and the other group with X4 > 0 corresponding to those with exposure to anthracycline. For comparison, we plot the nonparametric Kaplan-Meier estimates, our proposed estimates and the estimates by Chen and Jewell [10] for the marginal survival functions for these two subgroups. From Figure 2, one can see that our proposed estimates agree better with the Kaplan-Meier estimates. The large discrepancy of the Chen and Jewell’s estimates to the Kaplan-Meier estimates is probably due to the numerical instability in implementing their algorithm.

Figure 2.

Figure 2

Estimated marginal survival functions for X4 > 0 (left panel) and X4 = 0 (right panel). Solid line: Kaplan-Meier estimates; Dashed line: Chen and Jewell’s estimates; Dotted lines: Our proposed estimates.

7. Conclusion

We have proposed an efficient procedure for estimation and inference for a general semiparametric hazards model and studied penalized likelihood methods for model structure selection. Comparing to the estimation approach in Chen and Jewell [10], our proposal is semiparametric efficient and enjoys better numerical properties. In addition, the inference procedure is much simpler than the approach taken by Chen and Jewell [10]. Our method is potentially useful, especially at the initial model building stage, to elucidate the roles of the covariates in time to event analysis.

In this paper, we have focused on univariate failure time. An extension to multivariate failure time model can be useful. Another interesting topic worth pursuing is to include cure fraction [9] which is more natural for many problems.

Appendix

To derive the asymptotic results of the proposed estimator, we need the following regular conditions. For a function h, let be its first derivative function and be its second derivative function. Throughout the paper, we assume that the identifiability condition outlined in Chen and Jewell ([10], Proposition 1) holds.

  • A1

    The true value β0=(β10,β20)belongs to the interior of a known compact set Inline graphic in R2p.

  • A2

    Conditional on X, C is independent of T and ε, and X is bounded.

  • A3

    The censoring time C has a positive and twice continuously differentiable density in [0, τ] and inf P (Ci = τ|Xi ) > 0.

  • A4

    For t ≥ 0, λ0(t) is positive and thrice continuously differentiable and λ̇0(0) > 0.

  • A5

    If there exists a vector β and one deterministic function G, such that Xβ = G(ε) with probability one, then β = 0 and G = 0.

  • A6

    The kernel function K(·) is thrice-continuously differentiable and its rth (r = 0, 1, 2, 3) derivative functions have bounded variations. In addition, the first (m − 1) moments of mth derivative function of K(·) are 0 for some m > 3.

  • A7

    liminfnliminfθ0+p.wjn(θ)/wjn>0,maxjB1p.wjn(βj(0))=O(n-1/2),maxjB1p¨wjn(βj(0))=o(1).

For i = 1, ···, n, define the counting processes Ni (t) = δi/(Yit) and their compensation martingales

Mi(t)=Ni(t)-0tI(Yit)exp(Xiβ20)dΛ0(texp(Xiβ10))

for t ∈ [0, τ].

Proof of Theorem 1

The uniform convergence of lns(β) to l(β) over βInline graphic follows from the following two results

supβ1B,s[0,τ]|1nhi=1nδiK((Ri(β1)-s)/h)-dP(δ=1,Ri(β1)s)/ds|0,a.s.

and

supβB,s[0,τ]|1nj=1nexp(Xjβ2)-(Rj(β1)-s)/hK(t)dt-Eexp(Xβ2)I(R(β1)s)|0,a.s..

Now we show that β0 is the unique maximizer of l(β). To this end, define

Λ(t;β)=0tdP(δ=1,R(β1)s)/dsEexp(Xβ2)I(R(β1)>s)ds.

Note that

Λ(t;β0)=0tEdN(sexp(-Xβ10))Eexp(Xβ20)I(R(β10)>s)=Λ0(t)

and

Eexp(Xβ2)Λ(R(β1);β)=P(δ=1).

Suppose that β maximizes l(β), that is

δX(β1+β2)+δlogλ(R(β1);β)-exp(Xβ2)Λ(R(β1);β)δX(β10+β20)+δlogλ0(R(β10))-exp(Xβ20)Λ0(R(β10)),

where λ(t; β) = dΛ(t; β)/dt.

Taking expectation on both sides and then using the nonnegativity of Kullback-Leibler information, one obtains that

exp[δX(β1+β2)+δlogλ(R(β1);β)-exp(Xβ2)Λ(R(β1);β)]=exp[δX(β10+β20)+δlogλ0(R(β10))-exp(Xβ20)Λ0(R(β10))].

First, we let δ = 0 and Y = τ, then we obtain that exp(−exp(Xβ2)Λ(τ exp(Xβ1); β)) = exp(−exp(Xβ200(τ exp(Xβ10))). Letting δ = 1 and integrating Y = y to τ, then one obtains that

exp(-exp(Xβ2)Λ(yexp(Xβ1);β))-exp(-exp(Xβ2)Λ(τexp(Xβ1);β)=exp(-exp(Xβ20)Λ0(yexp(Xβ10)))-exp(-exp(Xβ20)Λ0(τexp(Xβ10))).

These two equalities yield that

exp(Xβ2)Λ(yexp(Xβ1);β)=exp(Xβ20)Λ0(yexp(Xβ10)).

Then there exists an increasing and differentiable function H0, such that

Texp(Xβ1)=H0(exp(X(β20-β2))Λ0(exp(ε))). (A.1)

From the definition of T and ε, one obtains that T = exp(−Xβ10) exp(ε) and thus X′(β1β10) + ε = H1(X′(β20β2) + H2(ε)), where H1(·) = log H0(exp(·)) and H2(·) = log(Λ0(exp(·))). Taking derivatives on both sides with respect to ε, one obtains that X′(β20β2) = H3(ε) for some function H3, which yields that β2 = β20 from assumption A5. Again using assumptions A5, (A.1) and this result gives that X′(β1β10) = H1(H2(ε)) − ε and β1 = β10. This completes the proof.

Proof of Theorem 2

Let X1 = (X′, X′)′, X2=(X,0p),X3=(0p,X), where 0p denotes the p-dimensional zero vector. Let Pn and P represent the empirical measure and probability measure. Then the score function of β is

Un(β)=Pn[δX1+δA3n(Y,X;β)A1n(Y,X;β)-δA4n(Y,X;β)A2n(Y,X;β)]

where Akn(y, x; β) = PnAki (y, x; β), k = 1, 2, 3, 4 and

A1i(y,x;β)=1hδiK((Ri(β1)-R(y,x;β1))/h),R(y,x;β1)=yexp(xβ1),A2i(y,x;β)=exp(Xiβ2)-(Ri(β1)-R(y,x;β1))/hK(s)ds,A3i(y,x;β)=δiK.(Ri(β1)-R(y,x;β1)h)Ri(β1)X2i-R(y,x;β1)x2h2,A4i(y,x;β)=exp(Xiβ2)X3i-(Ri(β1)-R(y,x;β1))/hK(s)ds+exp(Xiβ2)K(Ri(β1)-R(y,x;β1)h)Ri(β1)X2i-R(y,x;β1)x2h.

Let R0 = R(Y, X; β0), R0(y, x ) = R(y, x ; β0). After some calculation, one obtains that

Ajn(y,x;θ0)EBj(O;y,x)

for j = 1, 2, 3, 4, where

B1(O;y,x)=I(R0R0(y,x))λ0(R0(y,x))exp(Xβ20),B2(O;y,x)=I(R0R0(y,x))exp(Xβ20),B3(O;y,x)=fR0δ,X(R0(y,x))λ0(R0(y,x))R0(y,x)(X2-x2)-I(R0R0(y,x))λ.0(R0(y,x))×exp(Xβ20)R0(y,x)(X2-x2)-I(R0R0(y,x))λ0(R0(y,x))exp(Xβ20)X2,B4(O;y,x)=I(R0R0(y,x))exp(Xβ20)X3+fR0δ,X(R0(y,x))exp(Xβ20)R0(y,x)(X2-x2).

Let Õ = (, δ̃, ) be i.i.d copy of O, and the notation EOh(O) is the expectation of some function h(·) of O with respect to the joint density of O. Using the consistency of the estimator β̂n and conditions A3–A6, it follows from the proof of Theorem 2 in Zeng et al. ([8]) and Theorem 2.11.23 in van der Vaart and Wellner [22] that

nPnδ{A3n(Y,X;β^n)A1n(Y,X;β^n)-A4n(Y,X;β^n)A2n(Y,X;β^n)-PnB3(O;Y,X)PnB1(O;Y,X)+PnB4(O;Y,X)PnB2(O;Y,X)}=op(1).

Since Un(β̂n) = 0, one obtains that

0=nUn(β^n)=nPn{δX1+δA3n(Y,X;β^n)A1n(Y,X;β^n)-δA4n(Y,X;β^n)A2n(Y,X;β^n)}=nPn{δX1+δPnB3(O;Y,X)PnB1(O;Y,X)-δPnB4(O;Y,X)PnB2(O;Y,X)}+op(1)=nPn{X1+PnB3(O;t,X)PnB1(O;t,X)-PnB4(O;t,X)PnB2(O;t,X)}dM(t)+nPn{X1+PnB3(O;t,X)PnB1(O;t,X)-PnB4(O;t,X)PnB2(O;t,X)}×I(R0R0(t,x))λ0(R0(t,X)))exp(Xβ20)dR0(t,X)+op(1)=nPn{X1+EOB3(O;t,X)EOB1(O;t,X)-EOB4(O;t,X)EOB2(O;t,X)}dM(t)+op(1).

By calculating the expectation of Bj, we can simplify the expressions as

0=nPnl.βs(O)+op(1),

where

l.βs(O)={X1-EO{X1exp(Xβ20)I(R0R0(t,X))}EO{I(R0R0(t,X))exp(Xβ20)}}dM(t)+{X2-EO{X2I(R0R0(t,X))exp(Xβ20)}EO{I(R0R0(t,X))exp(Xβ20)}}×λ.0(R0(t,X))R0(t,X)λ0(R0(t,X))dM(t).

Here l.βs is easily shown to be the efficient score function for β0, which is given by Bickel, et al. [23]. Write H(t)=-EO{X1exp(Xβ20))I(R0t)}EO{exp(Xβ20))I(R0t)}-EOX2exp(Xβ20))I(R0t)EO{exp(Xβ20))I(R0t)}λ.0(t)tλ0(t). Applying the definition of the M(t), the score function can be rewritten as

l.βs(O)=X1δ-X1I(R0t)λ0(t)exp(Xβ20)dt+δX2λ.0(R0)R0λ0(R0)-X2λ.0(t)tλ0(t)I(R0t)λ0(t)exp(Xβ20)dt+0R0H(t)λ0(t)exp(Xβ20)dt+δH(R0)=X2δ+X2R0{δλ.0(R0)λ0(R0)-λ0(R0)}+δH(R0)+X3δ-X30R0λ0(t)exp(Xβ20)dt-0R0H(t)λ0(t))exp(Xβ20)dt.

Now we prove that E[l.βs(l.βs)] is nonsingular. Otherwise, there exists a vector α, such that αE[İθ(İθ)′]α′ = 0, which yields that αl.βs=0. After multiplying both sides by d exp(−Λ0(R0(t, X))) and then integrating t from 0 to t, we obtain that X2α=A1(ε) and X3α=A2(ε) for two differentiable functions A1 and A2 by solving the two equations obtained by letting δ = 1 and δ = 0. Let α=(α1,α2), where a1 is the first p-dimensional vector of α, α2 is the last p-dimensional vector of α. By the definitions of X2 and X3, one obtains that Xα1 = A1(ε) and Xα2 = A2(ε) with probability one. Finally, Condition A5 entails that α1 = α2 = 0, and thus α = 0. This completes the proof.

Proof of Theorem 3

Denote Gn=n(Pn-P), and = Y exp(Xβ̂1n). Note that

nΛ^n(t)=n0t1nhi=1nδiK((Yiexp(Xiβ^1n)-s)/h)1ni=1nexp(Xiβ^2n)-(Yiexp(Xiβ^1n)-s)/hK(u)duds=Gn0tδK((R0-s)/h)G0(s)ds+Gn0tδh[K(R^-sh)-K(R0-sh)]G0(s)ds-Gn0tEO[δfR0δ(s)]G0(s)2exp(Xβ20)-R0-shK(u)duds×X3(β^n-β0)-Gn0tEO[δfR0δ(s)]G0(s)2exp(Xβ20)-R0-shK(u)duds-Gn0tEO[δfR0δ(s)]G0(s)21hexp(Xβ20)K(R0-sh)R0ds×X2(β^n-β0)+n0tEO[δfR0δ(s)]G0(s)ds+op(1).

By the asymptotical normality of β̂n, one obtains that

Gn0tδh[K(R^-sh)-K(R0-sh)]G0(s)ds=Gn0tδh2K(R0-sh)R0G0(s)X2(β^n-β0)ds+op(1)=EO0tδh2K.(R0-sh)R0G0(s)X2ds×[EOl.βs(O)l.βs(O)]-1Gnl.βs(O)+op(1)=0tdsEOI(R0y)K.(y-sh)yexp(Xβ20)λ0(y)X2h2G0(s)dy×[EOl.βs(O)l.βs(O)]-1Gnl.βs(O)+op(1)=0tdsK.(y-sh)yG1(y)λ0(y)h2G0(s)dy+[EOl.βs(O)l.βs(O)]-1Gnl.βs(O)+op(1)=-0t[G1(s)+sG.1(s)]λ0(s)+sG1(s)λ.0(s)G0(s)ds×[EOlβs(O)l.βs(O)]-1Gnl.βs(O)+op(1)

In addition, by observing that

EO[δfR0δ(s)]=E1hK(y-sh)λ0(y)I(R0y)exp(Xβ20)dy=1hK(y-sh)λ0(y)G0(y)dyλ0(s)G0(s),

0tEO[δfR0δ(s)]G0(s)ds=Λ0(t) and

E1hexp(Xβ20)K(R0-sh)R0X2-sG.1(s),

one obtains that

n(Λ^n(t)-Λ0(t))=Gn0tδfR0δ,X(s)-λ0(s)exp(Xβ20)I(R0>s)G0(s)ds-0t[G1(s)+G3(s)]λ0(s)+sG1(s)λ.0(s)G0(s)ds×[EOl.βs(O)l.βs(O)]-1Gnl.βs(O)+op(1).

The proof is completed.

References

  • 1.Cox DR. Regression Models and Life Tables. Journal of the Royal Statistical Society Series B. 1972;34:187–220. [Google Scholar]
  • 2.Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
  • 3.Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Time Data. Wiley; New York: 2002. [Google Scholar]
  • 4.Ritov Y. Estimation in a linear regression model with censored data. Annals of Statistics. 1990;18:303–328. [Google Scholar]
  • 5.Wei LJ, Ying Z, Lin DY. Linear regression analysis of censored survival data based on rank tests. Biometrika. 1990;77:845–851. [Google Scholar]
  • 6.Lai TL, Ying Z. Rank regression methods for left-truncated and right-censored data. Annals of Statistics. 1991;19:531–556. [Google Scholar]
  • 7.Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika. 2003;90:341–353. [Google Scholar]
  • 8.Zeng D, Lin D. Efficient estimation for the accelerated failure time model. Journal of the American Statistical Association. 2007;102:1387–1396. [Google Scholar]
  • 9.Lu W. Efficient estimation for accelerated failure time model with a cure fraction. Statistica Sinica. 2010;20:661–674. [PMC free article] [PubMed] [Google Scholar]
  • 10.Chen YQ, Jewell NP. On a general class of semiparametric hazards regression models. Biometrika. 2001;88:687–702. [Google Scholar]
  • 11.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  • 12.Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
  • 13.Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics. 2009;65:394–404. doi: 10.1111/j.1541-0420.2008.01074.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lu W, Zhang HH. Variable selection for proportional odds model. Statistics in Medicine. 2007;26:3771–3781. doi: 10.1002/sim.2833. [DOI] [PubMed] [Google Scholar]
  • 15.Johnson BA. Variable selection in semiparametric linear regression with censored data. Journal of the Royal Statistical Society, Series B. 2008;70:351–370. [Google Scholar]
  • 16.Johnson B, Lin DY, Zeng D. Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models. Journal of the American Statistical Assoication. 2008;103(482):672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zou H. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  • 18.Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  • 19.Wang H, Li R, Tsai C. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Robison LL, Mertens AC, Boice JD, et al. Study design and cohort characteristics of the childhood cancer survivor study: A multiinstitutional collaborative project. Medical and Pediatric Oncology. 2002;38(4):229–239. doi: 10.1002/mpo.1316. [DOI] [PubMed] [Google Scholar]
  • 21.Robison LL, Armstrong G, Boice JD, et al. The Childhood Cancer Survivor Study: a National Cancer Institute-supported resource for outcome and intervention research. Journal of Clinical Oncology. 2009;27(14):2308–2318. doi: 10.1200/JCO.2009.22.3339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Van der Vaart AW, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996. [Google Scholar]
  • 23.Bickel PJ, Klaassen C, Ritov R, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer; New York: 1993. [Google Scholar]

RESOURCES