A General Semiparametric Hazards Regression Model: Efficient Estimation and Structure Selection

Xingwei Tong; Liang Zhu; Chenlei Leng; Wendy Leisenring; Leslie L Robison

doi:10.1002/sim.5885

. Author manuscript; available in PMC: 2014 Dec 10.

Published in final edited form as: Stat Med. 2013 Jul 3;32(28):4980–4994. doi: 10.1002/sim.5885

A General Semiparametric Hazards Regression Model: Efficient Estimation and Structure Selection

Xingwei Tong, Liang Zhu, Chenlei Leng ^*, Wendy Leisenring, Leslie L Robison

PMCID: PMC3913752 NIHMSID: NIHMS545123 PMID: 23824784

Abstract

We consider a general semiparametric hazards regression model that encompasses Cox’s proportional hazards model and the accelerated failure time model for survival analysis. To overcome the nonexistence of the maximum likelihood, we derive a kernel-smoothed profile likelihood function, and prove that the resulting estimates of the regression parameters are consistent and achieve semiparametric efficiency. In addition, we develop penalized structure selection techniques to determine which covariates constitute the accelerate failure time model and which covariates constitute the proportional hazards model. The proposed method is able to estimate the model structure consistently and model parameters efficiently. Furthermore, variance estimation is straightforward. The proposed estimation performs well in simulation studies and is applied to the analysis of a real data set. Copyright

Keywords: Accelerated failure time model, Cox’s proportional hazards model, Efficiency, Kernel-smoothed profile likelihood function, Model selection, Penalized likelihood

1. Introduction

The proportional hazards model proposed by Cox ([1], [2]) and the accelerated failure time model (AFT) are the two major approaches to analyze censored data in survival studies [3]. Cox’s model specifies that the covariates act multiplicatively on the hazard function, while the AFT model specifies that the effect of the covariates is multiplicative on time. Clearly, the covariates in these two models provide different measures of the effect on survival.

Efficient estimation and inference for the proportional hazard model are based on the partial likelihood that is easy to implement and gives attractive statistical properties [3]. On the other hand, efficient inference for the AFT model with unspecified error distribution is much more difficult ([4], [5], [6]). Jin et al. [7] proposed to use linear programming for the rank-based estimation method that can be suboptimal in terms of efficiency. Zeng and Lin [8] proposed an approximate nonparametric maximum likelihood method via maximizing a kernel-smoothed profile likelihood function, and showed that this estimator is efficient. Lu [9] further studied this model with a cure fraction.

In practice, when given a dataset, we need to choose whether to use Cox’s model or the AFT model. A usual technique would be to fit one of the two models and then to conduct some model testing to assess the lack-of-fit. However, due to the finite sample size and other data characteristics, the ability to check the model assumptions may be limited. In addition, it is not unusual that both Cox’s model and the AFT model may fit the data reasonably well, if appropriate time-dependent covariates are included [8].

Instead of deciding which model to fit, we consider a general semiparametric hazard model in this paper first studied by Chen and Jewell [10]. This model encompasses Cox’s and the AFT model as special cases, and is more flexible in modeling survival data. Chen and Jewell used the estimating equations method for inference that is not efficient in general. Numerical challenges also arise due to the non-smoothness of Chen and Jewell’s estimating equations. Chen and Jewell confined application of their model to low dimensional cases maybe due to the computational complexity. We, on the other hand, propose a kernel-smoothed profile likelihood function aiming at maximum likelihood estimation, motivated by Zeng and Lin [8]. The maximization can be easily implemented using any optimization algorithm developed for smooth objective functions, such as the Newton-Raphson algorithm. We show that the resulting estimators are asymptotically normal and attain the semiparametric efficiency bound. The limiting covariance matrix of the estimates can be readily computed as an output of the algorithm. More importantly, we develop penalized model selection techniques to determine which covariates constitute the accelerate failure time model and which covariates contribute to the proportional hazards model. The resulting estimators are shown to have the oracle property ([11], [12]). Namely, the final chosen model correctly distinguishes Cox covariates and AFT covariates with probability tending to one, and the resulting estimates are as efficient as if we were to use the true model structure for estimation and inference. Zhang and Lu [12] studied variable selection for the Cox model while Cai et al. [13] studied it for the AFT model. See also Lu and Zhang [14] and Johnson [15]. Our model selection has an added implication in terms of distinguishing these two kinds of covariates.

The rest of the paper is organized as follows. We discuss the general semiparametric hazards model in Section 2 and discuss using a smoothed profile likelihood function for estimation. Section 3 gives the large sample properties of the proposed estimators. We study penalized model selection techniques in Section 4. Simulation studies are presented in Section 5, followed by an analysis of Childhood Cancer Survivor study data in Section 6. We make some concluding remarks in Section 7. All the proofs are relegated to the appendix.

2. A general hazards model

Let T denote the failure time of interest and X be the p-dimensional covariate that is time invariant. If T follows the proportional hazards model, then the conditional hazards function of T given X has the form

λ_{T ∣ X} (t) = λ_{0} (t) exp (X^{'} β),

(1)

where β is the regression parameter and λ₀(t) is an unspecified baseline hazard function. If T follows the accelerated failure time model, then T satisfies that

log T = - X^{'} β + ε,

where ε is a measurement error independent of X. If we denote the cumulative hazards function of ε as Λ_ε and set Λ₀(·) = Λ_ε(log(·)), the cumulative hazards function of T given X in the AFT model takes the form

Λ_{T ∣ X} (t) = Λ_{0} (t exp (X^{'} β)) .

Differentiating both sides of the above equality, one obtains the conditional hazards function of T given X

λ_{T ∣ X} (t) = λ_{0} (t exp (X^{'} β)) exp (X^{'} β) .

(2)

In this paper, we explore a general hazards regression model which includes Cox’s model and the AFT model as special cases. We assume that the conditional hazards function of T given X takes the form

λ_{T ∣ X} (t) = λ_{0} (t exp (X^{'} β_{1})) exp (X^{'} β_{1}) exp (X^{'} β_{2}),

(3)

where β₁ and β₂ are two regression parameters. Alternatively, this general hazards model implies that the failure time T follows the model

log T = - X^{'} β_{1} + ε,

(4)

where e^ε follows the proportional hazards model with its cumulative hazards function given by Λ₀(t) exp(X′β₂).

The two parameters, β₁ and β₂, measure two different effects of the covariate on the survival time. The first parameter β₁ can be interpreted as the rate that the covariate either accelerates or degrades the rate at which an individual proceeds along the time line. The second parameter β₂ identifies multiplicative effects of the covariate on the hazard function after adjusting for β₁. This model reduces to the accelerated failure time model when β₂ = 0, and it becomes the proportional hazards model when β₁ = 0. For a covariate X_j, if β₁_j = β₂_j = 0, this variable has no effects on the survival time; if β₁_j = 0 and β₂_j ≠ 0, this variable is a Cox covariate; if β₁_j ≠ 0 and β₂_j = 0, this variable is an AFT variable; and if β₁_j ≠ 0 and β₂_j = 0, this variable is a mixed variable that contributes to both models. Clearly, this general hazards model is broader than the two models and is more flexible. The structure of the model raises two important questions: one is whether efficient estimation of β₁ and β₂ is possible, and the other is which covariates play the role in Cox’s model and which covariates play the role in the accelerated failure time model.

Let C be the censoring time and τ be the largest study time. Suppose that we have a data set with n subjects. For the ith subject, we observe the censored event time Y_i = min(C_i, T_i ), the censoring indicator δ_i = I(T_i ≤ C_i ) and covariate X_i, where I(·) is the indicator function. We assume that C_i is independent of T_i conditional on X_i and that its distribution does not depend on $β = (β_{1}^{'}, β_{2}^{'})$ . Then given i.i.d observations O_i = (Y_i, δ_i, X_i ), i = 1, ···, n, from O = (Y, δ, X), the log-likelihood function of β and λ is given by

\frac{1}{n} \sum_{i = 1}^{n} [δ_{i} X_{i}^{'} (β_{1} + β_{2}) + δ_{i} log λ (Y_{i} exp (X_{i}^{'} β_{1})) - exp (X_{i} β_{2}) Λ (Y_{i} exp (X_{i}^{'} β_{1}))] .

(5)

Generally, the maximum likelihood estimation can be obtained by maximizing the above log-likelihood function. An argument similar to that in Zeng and Lin [8] shows, however, that the maximum of (5) does not exist, because the estimator of Λ is very non-smooth. To overcome this difficulty, we use a piecewise constant function to approximate λ. That is, we divide the interval that contains all the possible values $Y_{i} exp (X_{i}^{'} β_{1})$ into K_n equally space intervals and then assumes that the function λ is a constant on each interval. More specifically, we write

λ (t) = \sum_{k = 1}^{K_{n}} α_{k} I (t \in [t_{k - 1}, t_{k})),

where 0 = t₀ < t₁ < ··· < t_{K_n} = η, $η = {sup}_{i, β_{1} \in B_{1}} ∣ Y_{i} exp (X_{i}^{'} β_{1}) ∣$ . Here Inline graphic is a compact subset of R^p, α₁, ···, α_{K_n} are the corresponding unknown constants on the intervals. Then the cumulative hazards function Λ can be written as

Λ (t) = \sum_{k = 1}^{K_{n}} α_{k} (t - t_{k}) I (t \in [t_{k - 1}, t_{k})) + \frac{η}{K_{n}} \sum_{k = 1}^{K_{n}} α_{k} I (t \geq t_{k - 1}) .

Therefore the log-likelihood function (5) can be re-expressed as

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} δ_{i} X_{i}^{'} (β_{1} + β_{2}) + \frac{1}{n} \sum_{k = 1}^{K_{n}} log α_{k} [\sum_{i = 1}^{n} δ_{i} I (R_{i} (β_{1}) \in [t_{k - 1}, t_{k}))] - \frac{1}{n} \sum_{k = 1}^{K_{n}} α_{k} \\ \times [\sum_{i = 1}^{n} exp (X_{i}^{'} β_{2}) (R_{i} (β_{1}) - t_{k}) I (R_{i} (β_{1}) \in [t_{k - 1}, t_{k})) + \frac{η}{K_{n}} \sum_{i = 1}^{n} exp (X_{i}^{'} β_{2}) I (R_{i} (β_{1}) \geq t_{k - 1})], \end{array}

(6)

where $R_{i} (β_{1}) = Y_{i} exp (X_{i}^{'} β_{1})$ .

Differentiating with respect to α_k, one obtain that

α_{k} = \sum_{i = 1}^{n} δ_{i} I (R_{i} (β_{1}) \in [t_{k - 1}, t_{k})) [\sum_{i = 1}^{n} exp (X_{i}^{'} β_{2}) (R_{i} (β_{1}) - t_{k}) I (R_{i} (β_{1}) \in {[t_{k - 1}, t_{k})) + \frac{η}{K_{n}} \sum_{i = 1}^{n} exp (X_{i}^{'} β_{2}) I (R_{i} (β_{1}) \geq t_{k - 1})]}^{- 1}

Therefore, the profile log-likelihood function of β is

\begin{array}{l} l_{n}^{p} (β) = \frac{1}{n} \sum_{i = 1}^{n} δ_{i} X_{i}^{'} (β_{1} + β_{2}) + \sum_{k = 1}^{K_{n}} [\frac{1}{n} \sum_{i = 1}^{n} δ_{i} I (R_{i} (β_{1}) \in [t_{k - 1}, t_{k}))] \\ \times log [\frac{K_{n}}{n η} \sum_{j = 1}^{n} δ_{j} I (R_{j} (β) \in [t_{k - 1}, t_{k}))] \\ - \sum_{k = 1}^{K_{n}} [\frac{1}{n} \sum_{i = 1}^{n} δ_{i} I (R_{i} (β_{1}) \in [t_{k - 1}, t_{k}))] \times log [\frac{K_{n}}{n η} \sum_{i = 1}^{n} exp (X_{i}^{'} β_{2}) (R_{i} (β_{1}) - t_{k}) \\ \times I (R_{i} (β_{1}) \in [t_{k - 1}, t_{k})) + \frac{1}{n} \sum_{i = 1}^{n} exp (X_{i}^{'} β_{2}) I (R_{i} (β_{1}) \geq t_{k - 1})] . \end{array}

(7)

Since the profile log-likelihood function is not smooth due to the many indicator functions in (7). As a result, direct maximization is difficult or impossible. As n → ∞, K_n → ∞, K_n/n → 0, we have

l_{n}^{p} (β) \to l (β), uniformly for β \in B,

where

l (β) = E [δ X^{'} (β_{1} + β_{2}) + δ log [{\frac{d P (δ = 1, R (β_{1}) \leq t) / d t}{E exp (X^{'} β_{2}) I (R (β_{1}) \geq t)} |}_{t = R (β_{1})}] .

This motivates to use the kernel-smoothed approximation to l(β), given by

\begin{array}{l} l_{n}^{s} (β) = \frac{1}{n} \sum_{i = 1}^{n} δ_{i} X_{i}^{'} (β_{1} + β_{2}) + \frac{1}{n} \sum_{i = 1}^{n} δ_{i} log [\frac{1}{n h} \sum_{j = 1}^{n} δ_{j} K ((R_{j} (β_{1}) - R_{i} (β_{1})) / h)] \\ - \frac{1}{n} \sum_{i = 1}^{n} δ_{i} log [\frac{1}{n} \sum_{j = 1}^{n} exp (X_{j}^{'} β_{2}) \int_{- \infty}^{(R_{j} (β_{1}) - R_{i} (β_{1})) / h} K (s) d s], \end{array}

(8)

where h = h(n) is a bandwidth parameter converging to zero for n → ∞.

The proposed estimator ${\hat{β}}_{n} = {({\hat{β}}_{1 n}^{'}, {\hat{β}}_{2 n}^{'})}^{'}$ was obtained by maximizing the kernel-smoothed likelihood function $l_{n}^{s} (β)$ over β ∈ Inline graphic for a compact set ∈ R²^p. Given β̂_n, the cumulative hazard function can be estimated as

{\hat{Λ}}_{n} (t) = \int_{0}^{t} \frac{\frac{1}{n h} \sum_{i = 1}^{n} δ_{i} K ((R_{i} ({\hat{β}}_{1 n} - s) / h)}{\frac{1}{n} \sum_{i = 1}^{n} exp (X_{i}^{'} {\hat{β}}_{2 n}) \int_{- \infty}^{(R_{i} ({\hat{β}}_{1 n}) - s) / h} K (u) d u} d s,

(9)

and the corresponding baseline function can be estimated by

{\hat{λ}}_{n} (t) = \frac{\frac{1}{n h} \sum_{i = 1}^{n} δ_{i} K ((R_{i} ({\hat{β}}_{1 n} - t) / h)}{\frac{1}{n} \sum_{i = 1}^{n} exp (X_{i}^{'} {\hat{β}}_{2 n}) \int_{- \infty}^{(R_{i} ({\hat{β}}_{1 n}) - t) / h} K (u) d u} .

These are kernel-smoothed estimates of the hazard functions motivated by the need to smooth the likelihood.

For bandwidth selection, we can use cross-validation such as the K-fold cross-validation approach. An application of this method in the current setting is to divide the data into K equal-sized groups. Let D_k denote the kth subgroup of data, then the kth prediction error is given by

{P E}_{k} (h) = \sum_{i \in D_{k}} {[N_{i} (τ) - \int_{0}^{τ} I (Y_{i} \geq t) exp (X_{i}^{'} β_{2}^{(- k)}) d {\hat{Λ}}_{n} (t exp (X_{i}^{'} {\hat{β}}_{1}^{(- k)}))]}^{2}

for k = 1,…, K, where ${\hat{β}}_{1}^{(- k)}$ and ${\hat{β}}_{2}^{(- k)}$ are the estimators of β₁ and β₂ based on the data without the subgroup D_k. The optimal bandwidth is the minimizer of the total prediction error $P E (h) = \sum_{k = 1}^{K} {P E}_{k} (h)$ .

3. Large sample property

In this section, we study the asymptotic behavior of the proposed estimators. The regularity conditions and proofs are given in the Appendix. First, we present the consistent result. Denote the true value of β as $β_{0} = {(β_{10}^{'}, β_{20}^{'})}^{'}$

Theorem 1

Suppose that Conditions A1–A6 in the Appendix hold. If h = O(n^v) with −1/2 < v < 0, then β̂_n → β₀, a.s..

Theorem 1 states the strong consistency of the proposed estimator β̂_n. Intuitively, this can be seen from the uniform convergence of $l_{n}^{s} (β)$ over β ∈ Inline graphic and the fact that β₀ uniquely maximizes the limiting function of $l_{n}^{s} (β)$ . For the asymptotic normality of the estimates, we have the following result.

Theorem 2

Suppose that Conditions A1–A6 in the Appendix hold. If the bandwidth satisfies nh²^m → 0 and nh⁶ → ∞, then $\sqrt{n} ({\hat{β}}_{n} - β_{0})$ converges weakly to a normal random variable with mean zero and covariance achieving the semi-parametric efficient bounds of β₀. Here m is defined in condition A6 in the Appendix.

The inverse negative second derivative of $l_{n}^{s} ({\hat{β}}_{n})$ can be used for estimating the asymptotic covariance of $\sqrt{n} ({\hat{β}}_{n} - β_{0})$ , because β̂_n is efficient. We note that the estimating equations based estimator in Chen and Jewell [10] is not efficient in general. From this theorem, we see that in order to obtain asymptotic normality, oversmoothing is needed. This result is similar to that in Zeng and Lin [8]. The cross validation choice of the bandwidth, aiming at minimizing the expected prediction error, tends to choose a bandwidth that optimizes the overall performance with respect to the kernel smoothed hazard function and the parameters, and thus does not satisfy the rate specified in Theorem 2. However, in our simulation studies and the data analysis, we observe that the difference between the a prespecified bandwidth and a cross validation based bandwidth is often small. Zeng and Lin [8] also reported that their result was not sensitive to the bandwidth. We now make a connection of our estimates to those in Cox’s model and the AFT model. In the proof of Theorem 2 in the Appendix, we show that the score function U_n(β) of β takes the form

U_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} {\dot{l}}_{β}^{s} (O_{i}) + o_{p} (n^{- 1 / 2}),

where

\begin{array}{l} {\dot{l}}_{β}^{s} (O) = \int (X_{2} - \frac{{E X}_{2} exp (X^{'} β_{20}) I (Y \geq t)}{E exp (X^{'} β_{20}) I (Y \geq t)}) d M (t) \\ + \int (X_{2} - \frac{{E X}_{2} exp (X^{'} β_{20}) I (Y \geq t)}{E exp (X^{'} β_{20}) I (Y \geq t)}) \frac{t exp (X^{'} β_{10}) \dot{λ} (t exp (X^{'} β_{10}))}{λ (t exp (X^{'} β_{10}))} d M (t) \\ + \int (X_{3} - \frac{{E X}_{3} exp (X^{'} β_{20}) I (Y \geq t)}{E exp (X^{'} β_{20}) I (Y \geq t)}) d M (t), \end{array}

(10)

for $X_{2} = {(X^{'}, 0_{p}^{'})}^{'}$ , X₃(0_p, X′)′, with 0_p being the p-dimensional zero vector, and dM(t) = dδ/(Y ≤ t) − exp(X′β₂₀)/(Y ≥ t)dΛ(t exp(X′β₁₀)). Here and in the sequel, ḟ = df (t)/dt denotes the first derivative function of f provided it exists.

From (10), if the AFT model holds, that is β₂ = 0, then the third term in (10) disappears. This gives the same score function as the AFT model derived in Zeng and Lin [8]. When the proportional hazards model holds, that is β₁ = 0, then the first two terms in (10) disappear. The resulting score function of β₂ reduces to

\frac{1}{n} \sum_{i = 1}^{n} \int (X_{i} - \frac{EX exp (X^{'} β_{20}) I (Y \geq t)}{E exp (X^{'} β_{20}) I (Y \geq t)}) {d M}_{i} (t),

which coincides with the score function derived from the partial likelihood function in the proportional hazards model. These arguments illustrate clearly that the efficient score functions used in Cox’s and the AFT models are special cases of those of the proposed general model.

To analyze the asymptotic properties of Λ̂_n(·), we write R₀ = Y exp(X′β₁₀), G₀(s) = E[exp(X′β₂₀)/(R₀ > t)], G₁(s) = E[exp(X′β₂₀)/(R₀ > t)X₂], G₂(s) = G₁(s)/G₀(s), and G₃(s) = E[exp(X′β₂₀)/(R₀ > t)X₃]. Denote f_X|Y (x ) as the conditional density function of X given Y.

Theorem 3

If the conditions in Theorem 2 hold, then $\sqrt{n} ({\hat{Λ}}_{n} (t) - Λ_{0} (t))$ follows a zero-mean Gaussian processes asymptotically with covariance function EH(O; t)H(O, s) at (t, s) when n tends to infinity, where

\begin{array}{l} H (O; t) = \int_{0}^{t} \frac{δ f_{R_{0} ∣ δ, X} (s) - λ (s) exp (X^{'} β_{20}) I (R_{0} > s)}{G_{0} (s)} d s \\ - \int_{0}^{t} \frac{[G_{1}^{'} (s) + G_{3}^{'} (s)] λ (s) + {s G}_{1}^{'} (s) \dot{λ} (s)}{G_{0} (s)} d s \times {[{E l}_{β}^{s} (O) l_{β}^{s} {(O)}^{'}]}^{- 1} l_{β}^{s} (O) . \end{array}

4. Model selection

Model selection is important in regression analysis and many procedures have been developed. In this section, we apply the model selection idea to analyze the model structure. For our model, the goal is to determine which covariates play role in Cox’s model and the accelerate failure time model. Thus, developing model selection strategies is very important and meaningful in elucidating the roles of the covariates. To this end, we propose to maximize the penalized smoothed profile likelihood function

Q (β) = l_{n}^{s} (β) - \sum_{m = 1}^{2} \sum_{j = 1}^{p} p_{w_{m j}} (∣ β_{m j} ∣),

where p(·) is a penalty function and w_mj, m = 1, 2; j = 1, ···, p are the tuning parameters.

Denote β̃_n as the maximizer of Q(β) over β ∈ Inline graphic . Let $V (β_{0}) = E {\dot{l}}_{β}^{s} (O) {\dot{l}}_{β}^{s} {(O)}^{'} = (V_{i j})$ evaluated at β = β₀ be the information matrix. To derive the asymptotic properties of β̃_n, denote the nonzero index set as $B_{0} = {m j : β_{m j}^{(0)} = 0}$ and $B_{1} = {m j : β_{m j}^{(0)} \neq 0}$ , where $β_{m j}^{(0)}$ is the jth component of β_m₀. For a subset B of {11, ···, 1p, 21, ···, 2p}, let β_B = {β_mj, mj ∈ B} be a sub-vector of β. We have the following theorem that states the structure selection consistency, also known as the oracle properties [11]. It is clear that the conditions in Theorem 1 in [16] all hold, one obtains the following Oracle properties,

Theorem 4

Suppose that the conditions of Theorem 2 and Condition A7 hold and that for m = 1, 2, j = 1, ···, p, w_mj → 0 and $\sqrt{n} w_{m j} \to \infty$ when n →]∞. Then we have

Selection consistency: β̃_nB₀ = 0 with probability tending to one;
Asymptotical normality: $\sqrt{n} ({\tilde{β}}_{n B_{1}} - β_{0 B 1}) \to N (0, {V (β_{B 1})}^{- 1})$ , where V(β_B₁) is the Fisher information matrix for the nonzero set B₁.

Condition A7 put some requirements on the penalty functions that can be used in Q(β). Here we use the smoothly clipped absolute deviation (SCAD) penalty function whose first derivative is given by

{\dot{p}}_{w} (∣ θ ∣) = w {I (∣ θ ∣ \leq w) + \frac{{(a w - ∣ θ ∣)}_{+}}{(a - 1) w} I (∣ θ ∣ > w)},

(11)

where a = 3.7 is a constant, and c₊ = max(c, 0) for any real scalar c. The SCAD penalty satisfies condition A7 in the Appendix and enjoys favorable properties as outlined in Fan and Li [11]. Alternatively we may use the adaptive Lasso penalty [17], an extension of the Lasso [18]. By setting w_mj = w, we choose w by the Bayesian information criterion (BIC), which minimizes with respect to w

BIC (w) = - l_{n}^{s} ({\tilde{β}}_{w}) + # {\tilde{β}}_{w} \cdot log (n) / n,

where #β̃_w denotes the number of nonzero coefficients in β̃_w. Following Wang, Li and Tsai [19], we can show that the resulting penalized estimators are model selection consistent. Thus, the asymptotic covariance matrix of β̃_nB₁ can be consistently estimated by $\frac{1}{n} {V ({\tilde{β}}_{{n B}_{1}})}^{- 1}$ . Since the proof of this theorem is now standard [12], we omit the proof to save space.

5. Simulation

We conduct simulation studies in this section to evaluate the finite sample performance of the proposed method by comparing to the estimating equations method in Chen and Jewell [10], as well as the Cox’s and the AFT model. We also assess the accuracy of the inference procedure and the performance in model structure determination.

Example 1

In the first study, we compare the performance of our proposed estimator with those proposed by Chen and Jewell [10]. For that purpose, we simulate data motivated by the simulation setup in Chen and Jewell [10]. In particular, we consider the one-covariate case where X_i follows the Bernoulli distribution with success probability 0.5. The censoring time C_i follows the uniform distribution U(0, τ) with τ = 9. Given X_i, we generate the failure time T_i under model (3) with λ₀(t) = 1/(bt + 1) and different values of β₁ and β₂. For this simulation, we use four possible configurations for (β₁, β₂)′ from (1, −1)′, (1, 0)′, (0, 1)′, (0, 0)′. When β₁₀=0, Cox’s model should fit the data very well, and when β₂₀=0, the AFT model should fit the data very well. When both β₁₀ and β₂₀ are 0, both the AFT model and Cox’s model would fit the data well. Two sample sizes n = 100 and n = 200 are used. For each simulation setup, the simulation is repeated 1000 times.

The results for the proposed method, Cox’s model and the AFT model, using the efficient estimation approach proposed by Zeng and Lin [8], are reported in Table 1. In our simulation studies, we take h = 0.85n⁻¹^/⁷. For our proposed method, we report the average bias (Bias), the sample standard deviations of the estimates across 1000 replications (SSD), the averages of the estimated standard errors (ESE), and the 95% empirical coverage probabilities (CP). For the other two methods, we only provide the Bias and SSD’s. These results indicate that the estimation procedure of the proposed method gives almost unbiased estimates. The ESEs are all reasonably close to SSDs, and the coverage probability is close to the nominal probability 95%. These indicate that the proposed estimation and inference procedures work well for finite samples. The Cox model works well when β₁₀ = 0, but has significant bias when β₁₀ is not 0. Similarly, the AFT model works well only when β₂₀ = 0, but has significant bias when β₂₀ is not 0. One reviewer asked why the results of the proposed estimator for bias and SSD are different from the Cox (AFT) model when β₁₀ = 0 (β₂₀ = 0). The reason is that in that situation, Cox model only estimates β₂ while the proposed method still estimates two parameters. If one use variable selection in the next step, then β₁ will be set to be 0 with high probability and the estimates for β₂ should be the same as that in the Cox model. From the comparison with the estimators proposed by Chen and Jewell [10], all the Bias’ of both estimators are very small, but our estimators have smaller SSDs than the corresponding estimators by Chen and Jewell [10] in general. This implies that our estimators is more efficient, which agrees with the statement in Theorem 2. We also try the simulation when b = 0.5 and obtain the similar results (not shown). To evaluate the performance for estimating Λ₀, we plot the estimated curves in Figure 1 for b = 0.5, 1 and n = 100, 200. Figure 1 shows that the estimated curves are very close to the true curve of Λ₀(t). In terms of bandwidth selection, we have also tried K-fold cross-validation to choose the optimal bandwidth and the results are similar.

Table 1.

Summary of Simulation studies in Example 1 with h_n= 0.85n^{− 1}^/⁷.

		Proposed method				CJ’s method		Cox’s model		AFT model

β	True	Bias	SSD	ESE	CP	Bias	SSD	Bias	SSD	Bias	SSD
n = 100
β1	1	0.051	0.536	0.534	0.965	−0.016	0.572	-	-	−0.714	0.930
β2	−1	0.007	0.359	0.391	0.974	0.011	0.506	0.517	0.255	-	-

β1	−1	−0.018	0.547	0.538	0.97	−0.104	0.606	-	-	−0.058	0.411
β2	0	0.007	0.345	0.347	0.968	−0.025	0.508	−0.652	0.254	-	-

β1	0	−0.069	0.550	0.538	0.953	−0.102	0.488	-	-	0.992	0.843
β2	1	−0.006	0.398	0.433	0.970	0.043	0.465	0.017	0.244	-	-

β1	0	0.039	0.535	0.501	0.943	−0.093	0.512	-	-	0.024	0.396
β2	0	−0.016	0.352	0.378	0.968	0.024	0.440	0.060	0.236	-	-

n = 200
β1	1	−0.001	0.456	0.402	0.939	−0.009	0.509	-	-	−1.114	0.828
β2	−1	0.030	0.289	0.284	0.941	−0.003	0.412	0.516	0.178	-	-

β1	−1	0.029	0.444	0.408	0.950	−0.058	0.552	-	-	−0.045	0.301
β2	0	−0.050	0.311	0.314	0.954	0.076	0.439	−0.653	0.179	-	-

β1	0	0.032	0.438	0.405	0.942	−0.097	0.417	-	-	1.271	0.389
β2	1	−0.061	0.309	0.316	0.946	0.046	0.39	0.004	0.171	-	-

β1	0	0.014	0.425	0.376	0.915	−0.067	0.393	-	-	0.006	0.295
β2	0	−0.012	0.279	0.276	0.956	0.031	0.343	−0.005	0.169	-	-

Open in a new tab

CJ’s method refers to the estimator proposed by Chen and Jewell (2001) with the function G(t; Z, β), required for defining the estimating equations, specified as tZ.

Estimated curve of Λ₀(t) = b⁻¹ log(1 + bt) with b = 0.5 and b = 1 for n=100 (left panel) and n=200 (right panel). Solid line: True curve Λ₀(t) with b = 0.5 (Upper) and b = 1 (Lower); Dashed line: the proposed estimated curve Λ̂_n(t) with b = 0.5 (Upper) and b = 1 (Lower).

Example 2

In the second study, we consider a setup similar to the first study except that there are six covariates. These covariates are generated from a multivariate normal distribution with mean zero and the covariance between X_i and X_j being ρ^|i-j|. For this example, we take ρ= 0, 0.2, 0.5, 0.8. The true coefficients are taken as β₁₀ = (1, 1, 1, 0, 0, 0)′, and β₂₀ = (0, 0, 0, 1, 1, 1)′. The largest follow up time is taken to be τ = 6 and the censoring time is exponentially distributed with rate 4, which gives a censoring rate around 45%. We apply the variable selection method with the SCAD penalty and compare the result with the estimate without penalization and the estimates assuming the true structure is known. We also try other penalties, such as LASSO, hard thresholding penalty function (HARD), and obtain the similar results. Table 2 reports the relative error, defined as the median ratio of the MSE versus the MSE from the method without variable selection. The average numbers of 0 coefficients are also reported in Table 2, in which the column labeled “Correct” presents the average number of six true coefficients correctly estimated as 0 and the column labeled “Incorrect” depicts the average of six nonzero coefficients erroneously estimated as 0.

Table 2.

Summary of Simulation studies in Example 2

			β₀ = (1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1)			β₀ = (1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)

				Aver. no. of 0 coeff			Aver. no. of 0 coeff

n	ρ	Method	Relative Error	Correct	Incorrect	Relative Error	Correct	Incorrect
100	0	PROPOSED	0.422	5.76	0.68	0.861	7.56	0.54
		ORACLE	0.403	6	0	0.404	8	0

	0.2	PROPOSED	0.507	5.72	0.90	0.883	7.48	0.57
		ORACLE	0.403	6	0	0.431	8	0

	0.5	PROPOSED	0.549	5.71	0.87	0.861	7.56	0.54
		ORACLE	0.355	6	0	0.404	8	0

	0.8	PROPOSED	0.460	5.80	0.76	0.883	7.48	0.57
		ORACLE	0.442	6	0	0.431	8	0

200	0	PROPOSED	0.351	5.82	0.77	0.737	7.68	0.55
		ORACLE	0.321	6	0	0.327	8	0

	0.2	PROPOSED	0.365	5.84	0.69	0.819	7.63	0.72
		ORACLE	0.354	6	0	0.297	8	0

	0.5	PROPOSED	0.484	5.87	0.80	0.761	7.69	0.58
		ORACLE	0.385	6	0	0.290	8	0

	0.8	PROPOSED	0.322	5.86	0.88	0.841	7.52	0.72
		ORACLE	0.295	6	0	0.305	8	0

Open in a new tab

Based on 1000 replications with sample sizes n = 100 and 200. we can see from Table 2 that the proposed methods outperform the method without penalization. This improvement is more significant when the sample size is large. On the other hand, the proposed method performs well in structure selection, as the number of correctly estimated 0 is close to 6, and the number of incorrectly estimated 0 is close to 0. In addition, we can see that the proposed method gives a relative error very close to the oracle relative error, especially when the sample size is large. This confirms the theoretical results in Theorem 4 empirically. We also provide the simulation results for the case when there are covariates in both AFT and Cox part. Specifically we set β₁₀ = (1, 1, 1, 1, 0, 0)′, and β₂₀ = (0, 0, 1, 1, 1, 1)′. From Table 2, we can see that the performance in terms of relative errors and variable selection is still satisfactory, although the relative error part is worse than that when there are no overlapped covariates in both AFT and Cox models.

6. Application

We apply the proposed methods to a subset of data from the Childhood Cancer Survivor Study (CCSS). This study followed up around 14000 survivors who were diagnosed with childhood cancers and survived at least 5 years since the primary diagnosis to investigate late effects of cancer and cancer treatments ([20], [21]). The diagnoses include leukemia, central nervous system (CNS) malignancy, Hodgkin lymphoma, non-Hodgkin lymphoma, Wilm’s tumor, neuroblastoma, soft-tissue sarcoma and bone tumors. During the follow-up, detailed demographic and treatment information was collected via questionnaires and the date of death was obtained from the death certificate if the patient died during the study.

Among all diseases, patients with Hodgkin lymphoma (HL) has the highest death rate during the follow-up (16.8%). So here we focus the subgroup of survivors who have been diagnosed HL during their childhood and focus on the study of risk factors on death. In total there were 1559 HL survivors. For the survivors who were alive at the end of study, the dates of last contact were obtained as censoring time. The covariates we are interested in are age at diagnosis (X₁, years), gender (X₂: 0=male, 1=female), radiation dose in the cancer treatment (X₃, 0: <2500 cGy, 1: ≥2500 cGy) and cumulative anthracycline exposure (X₄, g/m²). Now we use model (3) to fit the data. Table 3 shows the estimation results for the estimates without penalty and with the SCAD penalty. The SCAD penalty here is ${\dot{p}}_{w_{j}} (∣ β_{j} ∣) = w_{j} {I (∣ β_{j} ∣ \leq w_{j}) + \frac{{(a w_{j} - ∣ β_{j} ∣)}_{+}}{(a - 1) w_{j}} I (∣ β_{j} ∣ > w_{j})}$ , with w_j =w/|β̃_j|, a = 3.7, w = 0.3 and β̃_j being jth component of the estimator without penalty. Using K-fold cross validation, the bandwidth is chosen to be 0.2. From Table 3, the covariate information of anthracycline exposure plays an important role in constructing the AFT model since only anthracycline exposure is significant. On the other hand, radiation exposure is important to building Cox’s model since other covariates in β₂ are insignificant. More specifically, higher cumulative anthracycline dose and higher radiation exposure are likely to increase the risk of death in HL survivors, but age at diagnosis and gender do not have significant influence. We also note that the standard errors in the model with penalization are markedly smaller than those in the unpenalized model.

Table 3.

Estimation results for the Childhood Cancer Survivor Study with standard errors in parentheses.

		Without penalty	With penalty

β₁	Age	−0.041 (0.020)	-
	Gender	−0.529 (0.139)	-
	radiation dose	−0.888 (0.256)	-
	anthracycline exposure	3.244 (0.627)	4.016 (0.342)

β₂	Age	0.125 (0.034)	-
	Gender	0.591 (0.249)	-
	radiation dose	1.824 (0.354)	1.028 (0.192)
	anthracycline exposure	−0.344 (1.053)	-

Open in a new tab

Conditional on the covariate information X, one can estimate the conditional survival function by exp(− Λ̂_n (t exp(X′β̂₁_n)) exp(X′β̂₂_n)). Then we can estimate the marginal survival function for a subgroup by averaging the conditional survival function estimates. In Figure 2, we plot the estimated marginal survival function for two groups: the first group with X₄ = 0 corresponding to those without exposure to anthracycline and the other group with X₄ > 0 corresponding to those with exposure to anthracycline. For comparison, we plot the nonparametric Kaplan-Meier estimates, our proposed estimates and the estimates by Chen and Jewell [10] for the marginal survival functions for these two subgroups. From Figure 2, one can see that our proposed estimates agree better with the Kaplan-Meier estimates. The large discrepancy of the Chen and Jewell’s estimates to the Kaplan-Meier estimates is probably due to the numerical instability in implementing their algorithm.

Estimated marginal survival functions for X4 > 0 (left panel) and X₄ = 0 (right panel). Solid line: Kaplan-Meier estimates; Dashed line: Chen and Jewell’s estimates; Dotted lines: Our proposed estimates.

7. Conclusion

We have proposed an efficient procedure for estimation and inference for a general semiparametric hazards model and studied penalized likelihood methods for model structure selection. Comparing to the estimation approach in Chen and Jewell [10], our proposal is semiparametric efficient and enjoys better numerical properties. In addition, the inference procedure is much simpler than the approach taken by Chen and Jewell [10]. Our method is potentially useful, especially at the initial model building stage, to elucidate the roles of the covariates in time to event analysis.

In this paper, we have focused on univariate failure time. An extension to multivariate failure time model can be useful. Another interesting topic worth pursuing is to include cure fraction [9] which is more natural for many problems.

Appendix

To derive the asymptotic results of the proposed estimator, we need the following regular conditions. For a function h, let ḣ be its first derivative function and ḧ be its second derivative function. Throughout the paper, we assume that the identifiability condition outlined in Chen and Jewell ([10], Proposition 1) holds.

A1
The true value $β_{0} = {(β_{10}^{'}, β_{20}^{'})}^{'}$ belongs to the interior of a known compact set in R²^p.
A2
Conditional on X, C is independent of T and ε, and X is bounded.
A3
The censoring time C has a positive and twice continuously differentiable density in [0, τ] and inf P (C_i = τ|X_i ) > 0.
A4
For t ≥ 0, λ₀(t) is positive and thrice continuously differentiable and λ̇₀(0) > 0.
A5
If there exists a vector β and one deterministic function G, such that X′β = G(ε) with probability one, then β = 0 and G = 0.
A6
The kernel function K(·) is thrice-continuously differentiable and its rth (r = 0, 1, 2, 3) derivative functions have bounded variations. In addition, the first (m − 1) moments of mth derivative function of K(·) are 0 for some m > 3.
A7
$lim inf_{n \to \infty} \underset{θ \to 0^{+}}{lim inf} {\dot{p}}_{w_{j n}} (θ) / w_{j n} > 0, max_{j \in B_{1}} {\dot{p}}_{w_{j n}} (∣ β_{j}^{(0)} ∣) = O (n^{- 1 / 2}), max_{j \in B_{1}} {\ddot{p}}_{w_{j n}} (∣ β_{j}^{(0)} ∣) = o (1)$ .

For i = 1, ···, n, define the counting processes N_i (t) = δ_i/(Y_i ≤ t) and their compensation martingales

M_{i} (t) = N_{i} (t) - \int_{0}^{t} I (Y_{i} \geq t) exp (X_{i}^{'} β_{20}) d Λ_{0} (t exp (X_{i}^{'} β_{10}))

for t ∈ [0, τ].

Proof of Theorem 1

The uniform convergence of $l_{n}^{s} (β)$ to l(β) over β ∈ Inline graphic follows from the following two results

sup_{β_{1} \in B, s \in [0, τ]} | \frac{1}{n h} \sum_{i = 1}^{n} δ_{i} K ((R_{i} (β_{1}) - s) / h) - d P (δ = 1, R_{i} (β_{1}) \leq s) / d s | \to 0, a . s .

and

sup_{β \in B, s \in [0, τ]} | \frac{1}{n} \sum_{j = 1}^{n} exp (X_{j}^{'} β_{2}) \int_{- \infty}^{(R_{j} (β_{1}) - s) / h} K (t) d t - E exp (X^{'} β_{2}) I (R (β_{1}) \geq s) | \to 0, a . s ..

Now we show that β₀ is the unique maximizer of l(β). To this end, define

Λ (t; β) = \int_{0}^{t} \frac{d P (δ = 1, R (β_{1}) \leq s) / d s}{E exp (X^{'} β_{2}) I (R (β_{1}) > s)} d s .

Note that

Λ (t; β_{0}) = \int_{0}^{t} \frac{EdN (s exp (- X^{'} β_{10}))}{E exp (X^{'} β_{20}) I (R (β_{10}) > s)} = Λ_{0} (t)

and

E exp (X^{'} β_{2}) Λ (R (β_{1}); β) = P (δ = 1) .

Suppose that β maximizes l(β), that is

δ X^{'} (β_{1} + β_{2}) + δ log λ (R (β_{1}); β) - exp (X^{'} β_{2}) Λ (R (β_{1}); β) \geq δ X^{'} (β_{10} + β_{20}) + δ log λ_{0} (R (β_{10})) - exp (X^{'} β_{20}) Λ_{0} (R (β_{10})),

where λ(t; β) = dΛ(t; β)/dt.

Taking expectation on both sides and then using the nonnegativity of Kullback-Leibler information, one obtains that

\begin{array}{l} exp [δ X^{'} (β_{1} + β_{2}) + δ log λ (R (β_{1}); β) - exp (X^{'} β_{2}) Λ (R (β_{1}); β)] \\ = exp [δ X^{'} (β_{10} + β_{20}) + δ log λ_{0} (R (β_{10})) - exp (X^{'} β_{20}) Λ_{0} (R (β_{10}))] . \end{array}

First, we let δ = 0 and Y = τ, then we obtain that exp(−exp(X′β₂)Λ(τ exp(X′β₁); β)) = exp(−exp(X′β₂₀)Λ₀(τ exp(X′β₁₀))). Letting δ = 1 and integrating Y = y to τ, then one obtains that

\begin{array}{l} exp (- exp (X^{'} β_{2}) Λ (y exp (X^{'} β_{1}); β)) - exp (- exp (X^{'} β_{2}) Λ (τ exp (X^{'} β_{1}); β) \\ = exp (- exp (X^{'} β_{20}) Λ_{0} (y exp (X^{'} β_{10}))) - exp (- exp (X^{'} β_{20}) Λ_{0} (τ exp (X^{'} β_{10}))) . \end{array}

These two equalities yield that

exp (X^{'} β_{2}) Λ (y exp (X^{'} β_{1}); β) = exp (X^{'} β_{20}) Λ_{0} (y exp (X^{'} β_{10})) .

Then there exists an increasing and differentiable function H₀, such that

T exp (X^{'} β_{1}) = H_{0} (exp (X^{'} (β_{20} - β_{2})) Λ_{0} (exp (ε))) .

(A.1)

From the definition of T and ε, one obtains that T = exp(−X′β₁₀) exp(ε) and thus X′(β₁ − β₁₀) + ε = H₁(X′(β₂₀ ′ β₂) + H₂(ε)), where H₁(·) = log H₀(exp(·)) and H₂(·) = log(Λ₀(exp(·))). Taking derivatives on both sides with respect to ε, one obtains that X′(β₂₀ − β₂) = H₃(ε) for some function H₃, which yields that β₂ = β₂₀ from assumption A5. Again using assumptions A5, (A.1) and this result gives that X′(β₁ − β₁₀) = H₁(H₂(ε)) − ε and β₁ = β₁₀. This completes the proof.

Proof of Theorem 2

Let X₁ = (X′, X′)′, $X_{2} = {(X^{'}, 0_{p}^{'})}^{'}, X_{3} = {(0_{p}^{'}, X^{'})}^{'}$ , where 0_p denotes the p-dimensional zero vector. Let P_n and P represent the empirical measure and probability measure. Then the score function of β is

U_{n} (β) = P_{n} [δ X_{1} + δ \frac{A_{3 n} (Y, X; β)}{A_{1 n} (Y, X; β)} - δ \frac{A_{4 n} (Y, X; β)}{A_{2 n} (Y, X; β)}]

where A_kn(y, x; β) = P_nA_ki (y, x; β), k = 1, 2, 3, 4 and

\begin{array}{l} A_{1 i} (y, x; β) = \frac{1}{h} δ_{i} K ((R_{i} (β_{1}) - R (y, x; β_{1})) / h), R (y, x; β_{1}) = y exp (x^{'} β_{1}), \\ A_{2 i} (y, x; β) = exp (X_{i}^{'} β_{2}) \int_{- \infty}^{(R_{i} (β_{1}) - R (y, x; β_{1})) / h} K (s) d s, \\ A_{3 i} (y, x; β) = δ_{i} \dot{K} (\frac{R_{i} (β_{1}) - R (y, x; β_{1})}{h}) \frac{R_{i} (β_{1}) X_{2 i} - R (y, x; β_{1}) x_{2}}{h^{2}}, \\ A_{4 i} (y, x; β) = exp (X_{i}^{'} β_{2}) X_{3 i} \int_{- \infty}^{(R_{i} (β_{1}) - R (y, x; β_{1})) / h} K (s) d s \\ + exp (X_{i}^{'} β_{2}) K (\frac{R_{i} (β_{1}) - R (y, x; β_{1})}{h}) \frac{R_{i} (β_{1}) X_{2 i} - R (y, x; β_{1}) x_{2}}{h} . \end{array}

Let R₀ = R(Y, X; β₀), R₀(y, x ) = R(y, x ; β₀). After some calculation, one obtains that

A_{j n} (y, x; θ_{0}) \to {E B}_{j} (O; y, x)

for j = 1, 2, 3, 4, where

\begin{array}{l} B_{1} (O; y, x) = I (R_{0} \geq R_{0} (y, x)) λ_{0} (R_{0} (y, x)) exp (X^{'} β_{20}), \\ B_{2} (O; y, x) = I (R_{0} \geq R_{0} (y, x)) exp (X^{'} β_{20}), \\ B_{3} (O; y, x) = f_{R_{0} ∣ δ, X} (R_{0} (y, x)) λ_{0} (R_{0} (y, x)) R_{0} (y, x) (X_{2} - x_{2}) - I (R_{0} \geq R_{0} (y, x)) {\dot{λ}}_{0} (R_{0} (y, x)) \\ \times exp (X^{'} β_{20}) R_{0} (y, x) (X_{2} - x_{2}) - I (R_{0} \geq R_{0} (y, x)) λ_{0} (R_{0} (y, x)) exp (X^{'} β_{20}) X_{2}, \\ B_{4} (O; y, x) = I (R_{0} \geq R_{0} (y, x)) exp (X^{'} β_{20}) X_{3} + f_{R_{0} ∣ δ, X} (R_{0} (y, x)) exp (X^{'} β_{20}) R_{0} (y, x) (X_{2} - x_{2}) . \end{array}

Let Õ = (Ỹ, δ̃, X̃) be i.i.d copy of O, and the notation E_Oh(O) is the expectation of some function h(·) of O with respect to the joint density of O. Using the consistency of the estimator β̂_n and conditions A3–A6, it follows from the proof of Theorem 2 in Zeng et al. ([8]) and Theorem 2.11.23 in van der Vaart and Wellner [22] that

\sqrt{n} P_{n} δ {\frac{A_{3 n} (Y, X; {\hat{β}}_{n})}{A_{1 n} (Y, X; {\hat{β}}_{n})} - \frac{A_{4 n} (Y, X; {\hat{β}}_{n})}{A_{2 n} (Y, X; {\hat{β}}_{n})} - \frac{P_{n} B_{3} (\tilde{O}; Y, X)}{P_{n} B_{1} (\tilde{O}; Y, X)} + \frac{P_{n} B_{4} (\tilde{O}; Y, X)}{P_{n} B_{2} (\tilde{O}; Y, X)}} = o_{p} (1) .

Since U_n(β̂_n) = 0, one obtains that

\begin{array}{l} 0 = \sqrt{n} U_{n} ({\hat{β}}_{n}) = \sqrt{n} P_{n} {δ X_{1} + δ \frac{A_{3 n} (Y, X; {\hat{β}}_{n})}{A_{1 n} (Y, X; {\hat{β}}_{n})} - δ \frac{A_{4 n} (Y, X; {\hat{β}}_{n})}{A_{2 n} (Y, X; {\hat{β}}_{n})}} \\ = \sqrt{n} P_{n} {δ X_{1} + δ \frac{P_{n} B_{3} (\tilde{O}; Y, X)}{P_{n} B_{1} (\tilde{O}; Y, X)} - δ \frac{P_{n} B_{4} (\tilde{O}; Y, X)}{P_{n} B_{2} (\tilde{O}; Y, X)}} + o_{p} (1) \\ = \sqrt{n} P_{n} \int {X_{1} + \frac{P_{n} B_{3} (\tilde{O}; t, X)}{P_{n} B_{1} (\tilde{O}; t, X)} - \frac{P_{n} B_{4} (\tilde{O}; t, X)}{P_{n} B_{2} (\tilde{O}; t, X)}} d M (t) \\ + \sqrt{n} P_{n} \int {X_{1} + \frac{P_{n} B_{3} (\tilde{O}; t, X)}{P_{n} B_{1} (\tilde{O}; t, X)} - \frac{P_{n} B_{4} (\tilde{O}; t, X)}{P_{n} B_{2} (\tilde{O}; t, X)}} \\ \times I (R_{0} \geq R_{0} (t, x)) λ_{0} (R_{0} (t, X))) exp (X^{'} β_{20}) d R_{0} (t, X) + o_{p} (1) \\ = \sqrt{n} P_{n} \int {X_{1} + \frac{E_{\tilde{O}} B_{3} (\tilde{O}; t, X)}{E_{\tilde{O}} B_{1} (\tilde{O}; t, X)} - \frac{E_{\tilde{O}} B_{4} (\tilde{O}; t, X)}{E_{\tilde{O}} B_{2} (\tilde{O}; t, X)}} d M (t) + o_{p} (1) . \end{array}

By calculating the expectation of B_j, we can simplify the expressions as

0 = \sqrt{n} P_{n} {\dot{l}}_{β}^{s} (O) + o_{p} (1),

where

\begin{array}{l} {\dot{l}}_{β}^{s} (O) = \int {X_{1} - \frac{E_{\tilde{O}} {{\tilde{X}}_{1} exp ({\tilde{X}}^{'} β_{20}) I ({\tilde{R}}_{0} \geq R_{0} (t, X))}}{E_{\tilde{O}} {I ({\tilde{R}}_{0} \geq R_{0} (t, X)) exp ({\tilde{X}}^{'} β_{20})}}} d M (t) \\ + \int {X_{2} - \frac{E_{\tilde{O}} {{\tilde{X}}_{2} I ({\tilde{R}}_{0} \geq R_{0} (t, X)) exp ({\tilde{X}}^{'} β_{20})}}{E_{\tilde{O}} {I ({\tilde{R}}_{0} \geq R_{0} (t, X)) exp ({\tilde{X}}^{'} β_{20})}}} \times \frac{{\dot{λ}}_{0} (R_{0} (t, X)) R_{0} (t, X)}{λ_{0} (R_{0} (t, X))} d M (t) . \end{array}

Here ${\dot{l}}_{β}^{s}$ is easily shown to be the efficient score function for β₀, which is given by Bickel, et al. [23]. Write $H (t) = - \frac{E_{\tilde{O}} {{\tilde{X}}_{1} exp ({\tilde{X}}^{'} β_{20})) I ({\tilde{R}}_{0} \geq t)}}{E_{\tilde{O}} {exp ({\tilde{X}}^{'} β_{20})) I ({\tilde{R}}_{0} \geq t)}} - \frac{E_{\tilde{O}} {\tilde{X}}_{2} exp ({\tilde{X}}^{'} β_{20})) I ({\tilde{R}}_{0} \geq t)}{E_{\tilde{O}} {exp ({\tilde{X}}^{'} β_{20})) I ({\tilde{R}}_{0} \geq t)}} \frac{{\dot{λ}}_{0} (t) t}{λ_{0} (t)}$ . Applying the definition of the M(t), the score function can be rewritten as

\begin{array}{l} {\dot{l}}_{β}^{s} (O) = X_{1} δ - \int X_{1} I (R_{0} \geq t) λ_{0} (t) exp (X^{'} β_{20}) d t + δ X_{2} \frac{{\dot{λ}}_{0} (R_{0}) R_{0}}{λ_{0} (R_{0})} \\ - \int X_{2} \frac{{\dot{λ}}_{0} (t) t}{λ_{0} (t)} I (R_{0} \geq t) λ_{0} (t) exp (X^{'} β_{20}) d t + \int_{0}^{R_{0}} H (t) λ_{0} (t) exp (X^{'} β_{20}) d t + δ H (R_{0}) \\ = X_{2} δ + X_{2} R_{0} {δ \frac{{\dot{λ}}_{0} (R_{0})}{λ_{0} (R_{0})} - λ_{0} (R_{0})} + δ H (R_{0}) + X_{3} δ - X_{3} \int_{0}^{R_{0}} λ_{0} (t) exp (X^{'} β_{20}) d t \\ - \int_{0}^{R_{0}} H (t) λ_{0} (t)) exp (X^{'} β_{20}) d t . \end{array}

Now we prove that $E [{\dot{l}}_{β}^{s} {({\dot{l}}_{β}^{s})}^{'}]$ is nonsingular. Otherwise, there exists a vector α, such that αE[İ_θ(İ_θ)′]α′ = 0, which yields that $α^{'} {\dot{l}}_{β}^{s} = 0$ . After multiplying both sides by d exp(−Λ₀(R₀(t, X))) and then integrating t from 0 to t, we obtain that $X_{2}^{'} α = A_{1} (ε)$ and $X_{3}^{'} α = A_{2} (ε)$ for two differentiable functions A₁ and A₂ by solving the two equations obtained by letting δ = 1 and δ = 0. Let $α = {(α_{1}^{'}, α_{2}^{'})}^{'}$ , where a₁ is the first p-dimensional vector of α, α₂ is the last p-dimensional vector of α. By the definitions of X₂ and X₃, one obtains that X′α₁ = A₁(ε) and X′α₂ = A₂(ε) with probability one. Finally, Condition A5 entails that α₁ = α₂ = 0, and thus α = 0. This completes the proof.

Proof of Theorem 3

Denote $G_{n} = \sqrt{n} (P_{n} - P)$ , and R̂ = Y exp(X′β̂₁_n). Note that

\begin{array}{l} \sqrt{n} {\hat{Λ}}_{n} (t) = \sqrt{n} \int_{0}^{t} \frac{\frac{1}{n h} \sum_{i = 1}^{n} δ_{i} K ((Y_{i} exp (X_{i}^{'} {\hat{β}}_{1 n}) - s) / h)}{\frac{1}{n} \sum_{i = 1}^{n} exp (X_{i}^{'} {\hat{β}}_{2 n}) \int_{- \infty}^{(Y_{i} exp (X_{i}^{'} {\hat{β}}_{1 n}) - s) / h} K (u) d u} d s \\ = G_{n} \int_{0}^{t} \frac{δ K ((R_{0} - s) / h)}{G_{0} (s)} d s + G_{n} \int_{0}^{t} \frac{\frac{δ}{h} [K (\frac{\hat{R} - s}{h}) - K (\frac{R_{0} - s}{h})]}{G_{0} (s)} d s \\ - G_{n} \int_{0}^{t} \frac{E_{\tilde{O}} [\tilde{δ} f_{{\tilde{R}}_{0} ∣ \tilde{δ}} (s)]}{G_{0} {(s)}^{2}} exp (X^{'} β_{20}) \int_{- \infty}^{\frac{R_{0} - s}{h}} K (u) duds \times X_{3}^{'} ({\hat{β}}_{n} - β_{0}) \\ - G_{n} \int_{0}^{t} \frac{E_{\tilde{O}} [\tilde{δ} f_{{\tilde{R}}_{0} ∣ \tilde{δ}} (s)]}{G_{0} {(s)}^{2}} exp (X^{'} β_{20}) \int_{- \infty}^{\frac{R_{0} - s}{h}} K (u) duds \\ - G_{n} \int_{0}^{t} \frac{E_{\tilde{O}} [\tilde{δ} f_{{\tilde{R}}_{0} ∣ \tilde{δ}} (s)]}{G_{0} {(s)}^{2}} \frac{1}{h} exp (X^{'} β_{20}) K (\frac{R_{0} - s}{h}) R_{0} d s \times X_{2}^{'} ({\hat{β}}_{n} - β_{0}) \\ + \sqrt{n} \int_{0}^{t} \frac{E_{\tilde{O}} [\tilde{δ} f_{{\tilde{R}}_{0} ∣ \tilde{δ}} (s)]}{G_{0} (s)} d s + o_{p} (1) . \end{array}

By the asymptotical normality of β̂_n, one obtains that

\begin{array}{l} G_{n} \int_{0}^{t} \frac{\frac{δ}{h} [K (\frac{\hat{R} - s}{h}) - K (\frac{R_{0} - s}{h})]}{G_{0} (s)} d s = G_{n} \int_{0}^{t} \frac{\frac{δ}{h^{2}} K (\frac{R_{0} - s}{h}) R_{0}}{G_{0} (s)} X_{2}^{'} ({\hat{β}}_{n} - β_{0}) d s + o_{p} (1) \\ = E_{\tilde{O}} \int_{0}^{t} \frac{\frac{\tilde{δ}}{h^{2}} \dot{K} (\frac{{\tilde{R}}_{0} - s}{h}) {\tilde{R}}_{0}}{G_{0} (s)} {\tilde{X}}_{2}^{'} d s \times {[E_{\tilde{O}} {\dot{l}}_{β}^{s} (\tilde{O}) {\dot{l}}_{β}^{s} {(\tilde{O})}^{'}]}^{- 1} G_{n} {\dot{l}}_{β}^{s} (O) + o_{p} (1) \\ = \int_{0}^{t} d s E_{\tilde{O}} \int \frac{I ({\tilde{R}}_{0} \geq y) \dot{K} (\frac{y - s}{h}) y exp ({\tilde{X}}^{'} β_{20}) λ_{0} (y) {\tilde{X}}_{2}^{'}}{h^{2} G_{0} (s)} d y \times {[E_{\tilde{O}} {\dot{l}}_{β}^{s} (\tilde{O}) {\dot{l}}_{β}^{s} {(\tilde{O})}^{'}]}^{- 1} G_{n} {\dot{l}}_{β}^{s} (O) + o_{p} (1) \\ = \int_{0}^{t} d s \int \frac{\dot{K} (\frac{y - s}{h}) y G_{1}^{'} (y) λ_{0} (y)}{h^{2} G_{0} (s)} d y + {[E_{\tilde{O}} {\dot{l}}_{β}^{s} (\tilde{O}) {\dot{l}}_{β}^{s} {(\tilde{O})}^{'}]}^{- 1} G_{n} {\dot{l}}_{β}^{s} (O) + o_{p} (1) \\ = - \int_{0}^{t} \frac{[G_{1}^{'} (s) + s {\dot{G}}_{1}^{'} (s)] λ_{0} (s) + {s G}_{1}^{'} (s) {\dot{λ}}_{0} (s)}{G_{0} (s)} d s \times {[E_{\tilde{O}} l_{β}^{s} (\tilde{O}) {\dot{l}}_{β}^{s} {(\tilde{O})}^{'}]}^{- 1} G_{n} {\dot{l}}_{β}^{s} (O) + o_{p} (1) \end{array}

In addition, by observing that

\begin{array}{l} E_{\tilde{O}} [\tilde{δ} f_{{\tilde{R}}_{0} ∣ \tilde{δ}} (s)] = E \int \frac{1}{h} K (\frac{y - s}{h}) λ_{0} (y) I ({\tilde{R}}_{0} \geq y) exp ({\tilde{X}}^{'} β_{20}) d y \\ = \int \frac{1}{h} K (\frac{y - s}{h}) λ_{0} (y) G_{0} (y) d y \to λ_{0} (s) G_{0} (s), \end{array}

$\int_{0}^{t} \frac{E_{\tilde{O}} [\tilde{δ} f_{{\tilde{R}}_{0} ∣ \tilde{δ}} (s)]}{G_{0} (s)} d s = Λ_{0} (t)$ and

E \frac{1}{h} exp ({\tilde{X}}^{'} β_{20}) K (\frac{{\tilde{R}}_{0} - s}{h}) {\tilde{R}}_{0} {\tilde{X}}_{2}^{'} \to - s {\dot{G}}_{1} (s),

one obtains that

\begin{array}{l} \sqrt{n} ({\hat{Λ}}_{n} (t) - Λ_{0} (t)) = G_{n} \int_{0}^{t} \frac{δ f_{R_{0} ∣ δ, X} (s) - λ_{0} (s) exp (X^{'} β_{20}) I (R_{0} > s)}{G_{0} (s)} d s \\ - \int_{0}^{t} \frac{[G_{1}^{'} (s) + G_{3}^{'} (s)] λ_{0} (s) + {s G}_{1}^{'} (s) {\dot{λ}}_{0} (s)}{G_{0} (s)} d s \\ \times {[E_{\tilde{O}} {\dot{l}}_{β}^{s} (\tilde{O}) {\dot{l}}_{β}^{s} {(\tilde{O})}^{'}]}^{- 1} G_{n} {\dot{l}}_{β}^{s} (O) + o_{p} (1) . \end{array}

The proof is completed.

References

1.Cox DR. Regression Models and Life Tables. Journal of the Royal Statistical Society Series B. 1972;34:187–220. [Google Scholar]
2.Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
3.Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Time Data. Wiley; New York: 2002. [Google Scholar]
4.Ritov Y. Estimation in a linear regression model with censored data. Annals of Statistics. 1990;18:303–328. [Google Scholar]
5.Wei LJ, Ying Z, Lin DY. Linear regression analysis of censored survival data based on rank tests. Biometrika. 1990;77:845–851. [Google Scholar]
6.Lai TL, Ying Z. Rank regression methods for left-truncated and right-censored data. Annals of Statistics. 1991;19:531–556. [Google Scholar]
7.Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika. 2003;90:341–353. [Google Scholar]
8.Zeng D, Lin D. Efficient estimation for the accelerated failure time model. Journal of the American Statistical Association. 2007;102:1387–1396. [Google Scholar]
9.Lu W. Efficient estimation for accelerated failure time model with a cure fraction. Statistica Sinica. 2010;20:661–674. [PMC free article] [PubMed] [Google Scholar]
10.Chen YQ, Jewell NP. On a general class of semiparametric hazards regression models. Biometrika. 2001;88:687–702. [Google Scholar]
11.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
12.Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
13.Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics. 2009;65:394–404. doi: 10.1111/j.1541-0420.2008.01074.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Lu W, Zhang HH. Variable selection for proportional odds model. Statistics in Medicine. 2007;26:3771–3781. doi: 10.1002/sim.2833. [DOI] [PubMed] [Google Scholar]
15.Johnson BA. Variable selection in semiparametric linear regression with censored data. Journal of the Royal Statistical Society, Series B. 2008;70:351–370. [Google Scholar]
16.Johnson B, Lin DY, Zeng D. Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models. Journal of the American Statistical Assoication. 2008;103(482):672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Zou H. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
18.Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
19.Wang H, Li R, Tsai C. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Robison LL, Mertens AC, Boice JD, et al. Study design and cohort characteristics of the childhood cancer survivor study: A multiinstitutional collaborative project. Medical and Pediatric Oncology. 2002;38(4):229–239. doi: 10.1002/mpo.1316. [DOI] [PubMed] [Google Scholar]
21.Robison LL, Armstrong G, Boice JD, et al. The Childhood Cancer Survivor Study: a National Cancer Institute-supported resource for outcome and intervention research. Journal of Clinical Oncology. 2009;27(14):2308–2318. doi: 10.1200/JCO.2009.22.3339. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Van der Vaart AW, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996. [Google Scholar]
23.Bickel PJ, Klaassen C, Ritov R, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer; New York: 1993. [Google Scholar]

[R1] 1.Cox DR. Regression Models and Life Tables. Journal of the Royal Statistical Society Series B. 1972;34:187–220. [Google Scholar]

[R2] 2.Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]

[R3] 3.Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Time Data. Wiley; New York: 2002. [Google Scholar]

[R4] 4.Ritov Y. Estimation in a linear regression model with censored data. Annals of Statistics. 1990;18:303–328. [Google Scholar]

[R5] 5.Wei LJ, Ying Z, Lin DY. Linear regression analysis of censored survival data based on rank tests. Biometrika. 1990;77:845–851. [Google Scholar]

[R6] 6.Lai TL, Ying Z. Rank regression methods for left-truncated and right-censored data. Annals of Statistics. 1991;19:531–556. [Google Scholar]

[R7] 7.Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika. 2003;90:341–353. [Google Scholar]

[R8] 8.Zeng D, Lin D. Efficient estimation for the accelerated failure time model. Journal of the American Statistical Association. 2007;102:1387–1396. [Google Scholar]

[R9] 9.Lu W. Efficient estimation for accelerated failure time model with a cure fraction. Statistica Sinica. 2010;20:661–674. [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Chen YQ, Jewell NP. On a general class of semiparametric hazards regression models. Biometrika. 2001;88:687–702. [Google Scholar]

[R11] 11.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R12] 12.Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]

[R13] 13.Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics. 2009;65:394–404. doi: 10.1111/j.1541-0420.2008.01074.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Lu W, Zhang HH. Variable selection for proportional odds model. Statistics in Medicine. 2007;26:3771–3781. doi: 10.1002/sim.2833. [DOI] [PubMed] [Google Scholar]

[R15] 15.Johnson BA. Variable selection in semiparametric linear regression with censored data. Journal of the Royal Statistical Society, Series B. 2008;70:351–370. [Google Scholar]

[R16] 16.Johnson B, Lin DY, Zeng D. Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models. Journal of the American Statistical Assoication. 2008;103(482):672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Zou H. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R18] 18.Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R19] 19.Wang H, Li R, Tsai C. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Robison LL, Mertens AC, Boice JD, et al. Study design and cohort characteristics of the childhood cancer survivor study: A multiinstitutional collaborative project. Medical and Pediatric Oncology. 2002;38(4):229–239. doi: 10.1002/mpo.1316. [DOI] [PubMed] [Google Scholar]

[R21] 21.Robison LL, Armstrong G, Boice JD, et al. The Childhood Cancer Survivor Study: a National Cancer Institute-supported resource for outcome and intervention research. Journal of Clinical Oncology. 2009;27(14):2308–2318. doi: 10.1200/JCO.2009.22.3339. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Van der Vaart AW, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996. [Google Scholar]

[R23] 23.Bickel PJ, Klaassen C, Ritov R, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer; New York: 1993. [Google Scholar]

PERMALINK

A General Semiparametric Hazards Regression Model: Efficient Estimation and Structure Selection

Xingwei Tong

Liang Zhu

Chenlei Leng

Wendy Leisenring

Leslie L Robison

Abstract

1. Introduction

2. A general hazards model

3. Large sample property

Theorem 1

Theorem 2

Theorem 3

4. Model selection

Theorem 4

5. Simulation

Example 1

Table 1.

Figure 1.

Example 2

Table 2.

6. Application

Table 3.

Figure 2.

7. Conclusion

Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A General Semiparametric Hazards Regression Model: Efficient Estimation and Structure Selection

Xingwei Tong

Liang Zhu

Chenlei Leng

Wendy Leisenring

Leslie L Robison

Abstract

1. Introduction

2. A general hazards model

3. Large sample property

Theorem 1

Theorem 2

Theorem 3

4. Model selection

Theorem 4

5. Simulation

Example 1

Table 1.

Figure 1.

Example 2

Table 2.

6. Application

Table 3.

Figure 2.

7. Conclusion

Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases