Rank-based variable selection with censored data

Jinfeng Xu; Chenlei Leng; Zhiliang Ying

doi:10.1007/s11222-009-9126-y

. Author manuscript; available in PMC: 2013 Sep 4.

Published in final edited form as: Stat Comput. 2010 Apr 1;20(2):165–176. doi: 10.1007/s11222-009-9126-y

Rank-based variable selection with censored data

Jinfeng Xu ^1,^✉, Chenlei Leng ², Zhiliang Ying ³

PMCID: PMC3762511 NIHMSID: NIHMS491952 PMID: 24013588

Abstract

A rank-based variable selection procedure is developed for the semiparametric accelerated failure time model with censored observations where the penalized likelihood (partial likelihood) method is not directly applicable.

The new method penalizes the rank-based Gehan-type loss function with the ℓ₁ penalty. To correctly choose the tuning parameters, a novel likelihood-based χ²-type criterion is proposed. Desirable properties of the estimator such as the oracle properties are established through the local quadratic expansion of the Gehan loss function.

In particular, our method can be easily implemented by the standard linear programming packages and hence numerically convenient. Extensions to marginal models for multivariate failure time are also considered. The performance of the new procedure is assessed through extensive simulation studies and illustrated with two real examples.

Keywords: Accelerated failure time model, Adaptive Lasso, BIC, Gehan-type loss function, Lasso, Variable selection

1 Introduction

As an attractive alternative to the proportional hazards model (Cox 1972), the accelerated failure time model specifies directly a regression model of the log transformed survival time on a set of covariates,

\log T_{i} = β_{0}^{T} X_{i} + ε_{i}, i = 1, \dots, n,

where T_i is the survival time, X_i is a p-dimensional covariate, β₀ is a p-vector of unknown regression parameters and ε_i (i = 1, …, n) are independent error terms with a common, but completely unspecified, distribution. Because of this direct relationship between survival time and covariates, the accelerated failure time model is physically interpretable and in many ways more appealing than the proportional hazards model (Kalbfleisch and Prentice 2002, Chap. 7).

A common phenomenon of survival analysis is that data are subject to right censoring. Due to the censoring, the observed data are (T̃_i, Δ_i, X_i, i = 1, …, n), where C_i is the censoring time, T̃_i = T_i ∧ C_i is the observed time and Δ_i = 1_{{T_i≤C_i}} is the censoring indicator. Define e_i(β) = log T̃_i − β^T X_i, N_i(β; t) = Δ_i 1_{{e_i(β)≤t}} and Y_i(β; t) = 1_{{e_i(β)≥t}}. Note that N_i and Y_i are the counting processes and at-risk process on the time scale of the residual. Write

\begin{array}{l} S^{(0)} (β; t) = n^{- 1} \sum_{i = 1}^{n} Y_{i} (β; t), \\ S^{(1)} (β; t) = n^{- 1} \sum_{i = 1}^{n} Y_{i} (β; t) X_{i} . \end{array}

The weighted log-rank estimating function for β₀ takes the form

\begin{array}{l} U_{ϕ} (β) & = & \sum_{i = 1}^{n} Δ_{i} ϕ {β; e_{i} (β)} [X_{i} - \bar{X} {β; e_{i} (β)}], or \\ U_{ϕ} (β) & = & \sum_{i = 1}^{n} \int_{- \infty}^{\infty} ϕ (β; t) {X_{i} - \bar{X} (β; t)} d N_{i} (β; t), \end{array}

where X̄ (β; t) = S⁽¹⁾ (β; t)/S⁽⁰⁾ (β; t), and ϕ is a possibly data-dependent weight function, the choice of ϕ = S⁽⁰⁾ leads to the Gehan statistics. In this case, U_ϕ can be written as

U_{G} (β) = \sum_{i = 1}^{n} Δ_{i} S^{(0)} {β; e_{i} (β)} [X_{i} - \bar{X} {β; e_{i} (β)}],

U_{G} (β) = n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{n} Δ_{i} (X_{i} - X_{j}) 1_{{e_{i} (β) \leq e_{j} (β)}},

which is the gradient of the convex function

L_{G} (β) : = n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{n} Δ_{i} {e_{i} (β) - e_{j} (β)}^{-}

for |a|⁻ = |a| 1_a<0. Based on this fact, Jin et al. (2003) provided simple and reliable methods to obtain the rank estimators. They showed that the rank estimator with the Gehantype (1965) weight function can be readily obtained by minimizing a convex objective function through a standard linear programming technique.

An important objective of survival analysis is to identify a subset of significant variables from a large number of covariates which are often collected to reduce possible modeling bias. Variable selection is fundamental to statistical modeling and recently, a number of approaches based on the penalized partial likelihood have been applied to the Cox proportional hazards model and gained increasing popularity. See, for example, Lasso (Tibshirani, 1996, 1997), Scad (Fan and Li, 2001, 2002), Alasso (Zou, 2006, 2008; Zhang and Lu 2007; Lu and Zhang 2007; Wang et al. 2007b), and LSA (Wang and Leng 2007). Wang et al. (2007a) discussed the penalized least absolute deviation estimation without considering censoring. In the accelerated failure time model aforementioned, we are particularly interested in estimating the unknown vector β₀ and identifying any nonzero components. Zhang and Lu (2007) proposed the penalized partial likelihood method for variable selection in the Cox model. However, in the accelerated failure time model, partial likelihood is not available and semiparametric estimation of the regression coefficient vector relies on the rank-based methods. The estimates are usually obtained by minimizing a non-smooth objective function or solving the estimating equations which are step functions with potentially multiple roots (Jin et al. 2003). Johnson (2008) and Johnson et al. (2008) proposed a general variable selection procedure by penalizing estimation equations with broad and important applications especially including the censored accelerated failure time model. For uncensored data, Johnson and Peng (2008) developed a rank-based variable selection procedure in the linear model and the desirable properties such as the robustness and the oracle properties are obtained. In this article, we propose the ℓ₁ regularized Gehan estimator for simultaneous estimation and variable selection which yields advantages in two fronts. First, the shrinkage property of the ℓ₁ penalty and proper choice of tuning parameters build sparse models without sacrificing accuracy. Secondly, the single criterion function with both components being of ℓ₁-type reduces (numerically) the minimization to a strictly linear programming problem, making any resulting methodology extremely easy to implement.

The rest of the paper is organized as follows. Section 2 introduces the ℓ₁ regularized Gehan estimator and gives its asymptotic properties. A novel χ² type criterion is proposed to choose the tuning parameters in Sect. 3. Extensions to multivariate failure time data are considered in Sect. 4. The proposed methodology is illustrated with the applications to two datasets in Sect. 5. In Sect. 6, simulations are conducted to assess the finite-sample performance of the proposed methods. Section 7 concludes with a general discussion. All the proofs are relegated to the Appendix.

2 The ℓ₁ regularized Gehan estimator

Define the ℓ₁ regularized Gehan loss function as

\begin{array}{l} L_{P} (β) & : = & n^{- 1} L_{G} (β) + \sum_{j = 1}^{p} λ_{n j} | β_{j} | \\ = & n^{- 2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} Δ_{i} {e_{i} (β) - e_{j} (β)}^{-} + \sum_{j = 1}^{p} λ_{n j} | β_{j} |, \end{array}

(1)

where λ_nj, j = 1, …, p are tuning parameters and β_j is the j th component of the vector β, j = 1, …, p. The regularized estimate of β₀ is a minimizer of L_p(β) and denoted by β̂.

Let $β_{0} = {{(β_{0}^{1})}^{T}, {(β_{0}^{2})}^{T}}^{T}$ , where $β_{0}^{1}$ is an s-vector and $β_{0}^{2}$ is a (p − s)-vector. Without loss of generality, assume $β_{0}^{2}$ is the zero vector and $β_{0}^{1}$ is the nonzero vector. Suppose further that {λ_nj} satisfy the following conditions:

\begin{matrix} (C) & \sqrt{n} λ_{nj} \to 0 for 1 \leq j \leq s and \\ \sqrt{n} λ_{nj} \to \infty for s + 1 \leq j \leq p . \end{matrix}

This assumption states that the penalties applied on the zero entries in β dominate those on the nonzero entries. Intuitively, the requirement that $\sqrt{n} λ_{nj} \to 0$ for 1 ≤ j ≤ s enables the resulting estimates of $β_{0}^{1}$ to be $\sqrt{n}$ -consistent; and the condition that $\sqrt{n} λ_{nj} \to \infty$ for s + 1 ≤ j ≤ p shrinks the estimates of $β_{0}^{2}$ to zero. This observation is made rigorous by the following theorem.

Theorem 1

(Oracle properties) Under the conditions 1–4 of Ying (1993) and (C), with probability tending to one, the penalized estimator β̂ = {(β̂¹)^T, (β̂²)^T}^T has the following properties:

β̂² = 0;
$n^{\frac{1}{2}} ({\hat{β}}^{1} - β_{0}^{1}) \to N (0, V)$ , where, $V = A_{G 1}^{- 1} B_{G 1} A_{G 1}^{- 1}$ , A_G1 and B_G1 are defined in the Appendix.

Remark 1

Theorem 1 states that the proposed estimator estimates the coefficients of the important variables as if the true were known in advance, which is referred to as the oracle properties by Fan and Li (2001). However, as noted by an anonymous referee, the model selection consistency of the adaptive Lasso is obtained through the large sample theory and for finite sample size and small signal-to-noise ratio, it can perform rather poorly, sometimes even worse than the ordinary Lasso. Furthermore, Leeb and Pötscher (2008) showed that the maximal risk of any sparse estimator goes to infinity as the sample size increases. Hence the minimax efficiency and the model selection consistency seem to be two irreconcilable properties. In practice, we should always be cautious about using which to choose a good estimator.

The limiting covariance matrix V involves the unknown hazard function, so that it is difficult to estimate the covariance matrix analytically. Here we apply the random perturbation method (Rao and Zhao 1992; Parzen et al. 1994; Jin et al. 2001) to approximate the distribution of β̂. To be specific, define β̂* as a minimizer of the perturbed ℓ₁ regularized Gehan loss function

\begin{array}{l} L_{P}^{*} (β) & : = & n^{- 2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} Δ_{i} {e_{i} (β) - e_{j} (β)}^{-} Z_{i} \\ + \sum_{i = 1}^{p} λ_{nj} | β_{j} |, \end{array}

(2)

where the random variable Z_i satisfied E(Z_i) = 1 and V (Z_i) = 1. In the data analysis and simulation studies, the standard exponential distribution is used. The following theorem justifies the use of random perturbation method to distributional approximation of the estimator.

Theorem 2

(Asymptotic variance) Under (C) and conditions 1–4 of Ying (1993, p. 80), with probability tending to one, conditional on the data (T̂_i, Δ_i, X_i) (i = 1, …, n), ${\hat{β}}^{*} = {{({\hat{β}}_{1}^{*})}^{T}, {({\hat{β}}_{2}^{*})}^{T}}$ has the following properties: (a) ${\hat{β}}_{2}^{*} = 0$ ; (b) $n^{\frac{1}{2}} ({\hat{β}}_{1}^{*} - {\hat{β}}^{1}) \to N (0, V)$ .

In practice, we fix the tuning parameters which satisfy condition (C) in implementing the perturbation method.

3 Computation and tuning parameter selection

As pointed out in Jin et al. (2003), the minimization of L_P(β) can be carried out by linear programming and is equivalent to the minimization of

\begin{array}{l} n^{- 2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} Δ_{i} | e_{i} (β) - e_{j} (β) | \\ + | M - β^{T} \sum_{k = 1}^{n} \sum_{l = 1}^{n} Δ_{k} (X_{l} - X_{k}) | + \sum_{j = 1}^{p} λ_{nj} | β_{j} |, \end{array}

where M is a large constant. An implementation of the algorithm is discussed by Koenker and D’Orey (1987), with the code available in S-Plus and R and other softwares. The minimization of $L_{P}^{*} (β)$ can be implemented similarly. Condition (C) suggests pre-selection of {λ_nj}’s based on a preliminary estimate of β, and in this work we take

λ_{nj} = λ | {\tilde{β}}_{j} |^{- τ}, j = 1, \dots, p,

(3)

for τ > 0, as {β̃_j} are $\sqrt{n}$ -consistent. This choice is discussed by Zou (2006), Zhang and Lu (2007) and Wang et al. (2007b). For such λ_nj s, we have a_n = max{λ_nj, j = 1,…, s} = λO_p (1) and b_n = min{λ_nj, j = s + 1, …, p} = λO_p(n^τ/2)→ ∞. It is then easy to see that once λ satisfies

\sqrt{n} a_{n} \to 0 and \sqrt{n} b_{n} \to \infty,

Theorems 1 and 2 hold. This simplification is attractive from a computational viewpoint, since we only need to choose one tuning parameter λ instead of p tuning parameters λ_nj, j = 1, …, p. For later exposition, we shall fix {λ_nj} according to (3) with τ = 1.

In order to study the dependence of the selected model on the tuning parameter λ, we denote the model corresponding to β̂_λ as S_λ = {j : β̂_{λ, j} ≠ 0}. Write the derivative of Gehan loss function as

U_{G} (β) = n^{- 1} \sum_{i = 1}^{n} Δ_{i} S^{(0)} {β; e_{i} (β)} [X_{i} - \bar{X} {β; e_{i} (β)}] .

Then $\sqrt{n} U_{G} (β_{0})$ is asymptotically normal with mean zero and variance

B_{Gn} = n^{- 1} \sum_{i = 1}^{n} Δ_{i} S^{(0)} {β; e_{i} (β)}^{2} {[X_{i} - \bar{X} {β; e_{i} (β)}]}^{\otimes 2} .

Consider the following χ² type statistics:

T_{λ} = n U_{G} {({\hat{β}}_{λ})}^{T} B_{Gn}^{- 1} ({\hat{β}}_{G}) U_{G} ({\hat{β}}_{λ}),

and by applying arguments similar to those in Wei et al. (1990), when S_λ ⊇ {1, …, s} i.e. a correct model is identified (not necessarily the true model), T_λ follows the χ² distribution with degrees of freedom q which equals to the number of zero components in β̂_λ. T_λ is scale-invariant and has a likelihood interpretation and hence can be used for tuning parameter selection. An attractive property of T_λ is that B_Gn does not require density estimation and thus can be easily computed. Based on T_λ, we propose the Bayesian information criterion (BIC)

{BIC}_{λ} = T_{λ} + \log n \cdot {df}_{λ},

where df_λ is the number of nonzero components in β̂_λ. Similarly, the Akaike information criterion (AIC) can be defined as

{AIC}_{λ} = T_{λ} + 2 \cdot {df}_{λ} .

Replacing the ℓ₂ loss function in the generalized cross validation (GCV) with the Gehan loss function, GCV can be defined as

{GCV}_{λ} = \frac{L_{G} ({\hat{β}}_{λ})}{{(1 - {df}_{λ} / n)}^{2}} .

We define R₀ = {λ ≥ 0 : S_λ = {1, …, s}} as the set of λ’s such that the true model is identified. In addition, we define a reference tuning parameter sequence ${λ_{n} = 1 / [\sqrt{n} \log (n)]}_{n = 1}^{\infty}$ . By Theorem 1, it follows that with probability one, S_{λ_n} = {1, …, s}. We have the following consistency theorem for the BIC method.

Theorem 3

Under the assumptions in Theorem 1, P(inf_λ∉R₀ BIC_λ > BIC_{λ_n}) → 1.

This theorem states that for any λ which can not choose the true model, its associated BIC is larger than the one identified by the reference sequence. Therefore, the optimal λ which minimizes BIC must correspond to the true model. The proof of this theorem is similar to that of Theorem 4 in Wang and Leng (2007) and is therefore omitted. Note that neither AIC nor GCV yields consistent model selection results if a true sparse model exists (Wang et al. 2007c).

4 Extensions to multivariate failure time data

Following Jin et al. (2006), in this section, we consider the extension of ℓ₁ regularized method to multivariate failure time data. The oracle properties for the estimators defined in this section, similar to those in Theorems 1–3, can be shown by using the similar techniques and are therefore omitted. Additionally, computing the estimates relies on the linear programming technique, thus can be easily implemented.

4.1 Multiple events data

Multiple events data arise when a subject can potentially experience several types of event or failure. For k = 1, …, K and i = 1, …, n, let T_ki be the time to the kth failure of the ith subject, let C_ki be the censoring time on T_ki, and let X_ki be the corresponding p_k-vector of covariates. We assume that (T_1i, …, T_Ki) is independent of (C_1i, …, C_Ki) conditional on (X_1i, …, X_Ki). The marginal accelerated failure time models take the form

\log T_{ki} = X_{ki}^{T} β_{k} + ε_{ki} (k = 1, \dots, K; i = 1, \dots, n),

where β_k is a p_k-vector of unknown regression parameters, and (ε_1i, …, ε_Ki), i = 1, …, n are independent random vectors from an unspecified joint distribution with marginal distribution functions F₁, …, F_K. The data consists of (T̃_ki, δ_ki, X_ki), k = 1, … K; i = 1, …, n, where T̃_ki = min(T_ki, C_ki) and δ_ki = I_{{T_ki≤C_ki}}.

Let e_ki(β) = log T̃_ki − β^T X_ki and the Gehan-type loss function for β_k is then

L_{k, G} (β) = n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{n} δ_{ki} {e_{ki} (β) - e_{kj} (β)}^{-} .

For each k = 1, …, K, correspondingly, the ℓ₁ regularized loss function is

n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{n} δ_{ki} {e_{ki} (β) - e_{kj} (β)}^{-} + λ \sum_{j = 1}^{p_{k}} λ_{j} | β_{jk} | .

Similarly, the minimization problem can be reduced to K standard linear programming problems and the distribution of the estimator can be approximated by the perturbation method.

4.2 Clustered failure time data

The clustered data arise when we have a random sample of n clusters and there are K_i members in the ith cluster. Let T_ik and C_ik be the failure time and censoring time for the kth member of the ith cluster, and let X_ik be the corresponding p × 1 vector of covariates. We assume that (T_i1, …, T_{i K_i}) and (C_i1, …, C_{i K_i}) are independent conditional on (X_i1, …, X_{i K_i}). The data consist of (T̃_ik, δ_ik, X_ik), k = 1, …, K_i; i = 1, …, n, where T̃_ik = min(T_ik, C_ik) and Δ_ik = I_{{T_ik≤C_ik}}.

Suppose that the marginal distribution of the T_ik satisfy the accelerated failure time model

\log T_{ik} = X_{ik}^{T} β_{0} + ε_{ik} (k = 1, \dots, K_{i}; i = 1, \dots, n),

where β₀ is a p-vector and (ε_i1, …, ε_{i K_i}), i = 1, …, n are independent random vectors. Define, $e_{ik} (β) = \log {\tilde{T}}_{ik} - X_{ik}^{T} β$ , the Gehan type loss function for β is

L_{G} (β) = n^{- 1} \sum_{i = 1}^{n} \sum_{k = 1}^{K_{i}} \sum_{j = 1}^{n} \sum_{ℓ = 1}^{K_{j}} δ_{ik} {e_{ik} (β) - e_{jℓ} (β)}^{-} .

The ℓ₁ regularized loss function is

n^{- 1} \sum_{i = 1}^{n} \sum_{k = 1}^{K_{i}} \sum_{j = 1}^{n} \sum_{ℓ = 1}^{K_{j}} δ_{ik} {e_{ik} (β) - e_{jℓ} (β)}^{-} + λ \sum_{j = 1}^{p} λ_{j} | β_{j} | .

4.3 Recurrent events data

With a random sample of n subjects, let T_ki be the time to the kth recurrent event on the ith subject; let C_i and X_i be the censoring time and the p × 1 vector of covariates for the ith subject. Assume that C_i is independent of T_ki (k = 1, …) conditional on X_i. Let

N_{i}^{*} (t) = \sum_{k = 1}^{\infty} I (T_{ki} \leq t) .

We specify the following accelerated time model for the mean frequency function:

E {N_{i}^{*} (t) ∣ X_{i}} = μ_{0} (t e^{- β_{0}^{T} X_{i}}),

where β₀ is a p × 1 vector and μ₀(·) is an unspecified baseline mean function. The Gehan type loss function for β is

\begin{array}{l} L_{G} (β) & = & n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \sum_{k = 1}^{\infty} I (T_{ki} \leq C_{i}) \\ \times {\log T_{ki} - \log C_{j} - β^{T} (X_{i} - X_{j})}^{-} . \end{array}

The ℓ₁ regularized Gehan loss function is

\begin{array}{l} L_{G} (β) & = & n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \sum_{k = 1}^{\infty} I (T_{ki} \leq C_{i}) {\log T_{ki} - \log C_{j} \\ - β^{T} {(X_{i} - X_{j})}}^{-} + λ \sum_{j = 1}^{p} λ_{j} | β_{j} | . \end{array}

5 Two real examples

In this section, we apply our method to analyze two well known datasets.

5.1 Primary biliary cirrhosis data

The primary biliary cirrhosis (PBC) data, provided in Therneau and Grambsch (2001), came from the Mayo Clinic trial in primary biliary cirrhosis of the liver conducted between 1974 and 1984. The data contain information about the survival time and 17 prognostic variables for 424 PBC patients who met eligibility criteria for the randomized placebo-controlled trial of the drug D-penicillamine. We considered the 276 patients with the complete information of all 17 variables and used the accelerated failure time model to study the relationship of the survival time and the prognostic variables. In our analysis, the 17 variables are drug, age, sex, ascites, hepatomegaly, spiders, edema, bilirubin, cholesterol, albumin, urine copper, alkaline phosphotase, SGOT, triglycerides, platelets, prothrombin time and histologic stage of disease. Albumin, alkaline phosphatase, bilirubin and prothrombin time have all been transformed on the natural logarithmic scale. The variables were also standardized to have mean zero and unit variance. The R package quantreg was used in both data analysis and simulation studies. Table 1 summarizes the estimated coefficients of the Gehan estimate, the Lasso estimate and the adaptive Lasso estimate. The AIC, BIC or GCV criteria were used to choose the tuning parameters respectively. For this data set, GCV and AIC yield the same estimates for the same penalty function (Lasso or adaptive Lasso) and thus only the estimated coefficients via AIC were reported. The standard errors were computed via 1000 random perturbations for the standard exponential distribution. However, we only recorded the standard errors of those which were identified as nonzero coefficients. To appreciate the relationship between various estimates and the tuning parameters, for a shrinkage estimator, we calculated the shrinkage parameter as

s = \frac{\sum_{j = 1}^{p} | \hat{β} |}{\sum_{j = 1}^{p} | {\hat{β}}^{G} |},

where β̂^G is the unpenalized Gehan estimator, β̂ is either the Lasso (β̂^L) or the adaptive Lasso (β̂^A) estimator.

Table 1.

Primary biliary cirrhosis data.

	β̂^G	β̂^L(AIC)	β̂^L(BIC)	β̂^A(AIC)	β̂^A(BIC)
Drug	−0.068(0.058)	−0.018(0.039)	0(−)	0(−)	0(−)
Age	−0.227(0.060)	−0.180(0.064)	−0.157(0.062)	−0.210(0.066)	−0.185(0.062)
Sex	0.108(0.045)	0.073(0.045)	0.042(0.040)	0.060(0.045)	0(−)
Asc	−0.097(0.077)	−0.115(0.080)	−0.113(0.086)	−0.083(0.071)	0(−)
Hep	0.053(0.064)	0(−)	0(−)	0(−)	0(−)
Spid	−0.136(0.070)	−0.099(0.065)	−0.073(0.061)	−0.101(0.066)	0(−)
Ede	−0.245(0.087)	−0.241(0.088)	−0.247(0.095)	−0.278(0.085)	−0.336(0.085)
Logbil	−0.461(0.073)	−0.448(0.069)	−0.437(0.067)	−0.484(0.070)	−0.541(0.072)
Chol	0.046(0.059)	0.020(0.040)	0(−)	0(−)	0(−)
Logalb	0.143(0.072)	0.120(0.071)	0.103(0.072)	0.106(0.072)	0.039(0.056)
Cop	−0.001(0.065)	0(−)	0(−)	0(−)	0(−)
Logalk	0.037(0.055)	0(−)	0(−)	0(−)
Sgot	0.117(0.056)	0.058(0.048)	0.021(0.039)	0.061(0.050)	0(−)
Trig	0.062(0.059)	0(−)	0(−)	0(−)	0(−)
Plat	−0.057(0.065)	0(−)	0(−)	0(−)	0(−)
Logprot	−0.067(0.061)	−0.025(0.042)	0(−)	0(−)	0(−)
Stage	−0.203(0.077)	−0.174(0.060)	−0.161(0.058)	−0.192(0.062)	−0.180(0.064)

Open in a new tab

Estimated coefficients of the Gehan estimate: β̂^G, the Lasso estimates: β̂^L, and the adaptive Lasso estimates: β̂^A

In Figs. 1 and 2, both the coefficients and the criterion function (AIC/BIC/GCV) are plotted against the shrinkage parameter for Lasso and adaptive Lasso estimates respectively.

Fig. 1 — The *left panel* displays the Lasso estimates as a function of s. The *right panel* shows the AIC, BIC and GCV curves plotted against s

Fig. 2 — The *left panel* displays the adaptive Lasso estimates as a function of s. The *right panel* shows the AIC, BIC and GCV curves plotted against s

A few observations can be made from Table 1. First, GCV and AIC perform similarly for Lasso and adaptive Lasso penalties. Secondly, BIC yields smaller models than AIC and GCV. The combination of adaptive Lasso and BIC gives the model with the most zero coefficients. From Figs. 1 and 2, we see that BIC tends to shrink more than AIC and GCV and hence gives sparser models.

As noted by an anonymous referee, BIC usually has less favorable out-of-sample prediction performance than the cross-validation method. Here we use a tenfold crossvalidation scheme to compare the out-of-sample performance of AIC, BIC, GCV and cross-validation. To be specific, we left out one tenth of the data and use the remaining data to obtain the Lasso and adaptive Lasso estimates while choosing the tuning parameters via AIC, BIC, GCV or a tenfold cross-validation. Then we estimate the prediction error by evaluating the Gehan loss function on the one-tenth leftout sample. Table 2 shows the results for the PBC data. It indicates BIC performs a bit worse than the other criteria but the sacrificed prediction accuracy is not much.

Table 2.

AIC, BIC, GCV and cross validation’s predictive performance

Prediction error	Lasso	Adaptive Lasso
AIC	30.50	31.64
BIC	30.60	31.69
GCV	30.31	31.47
Cross-validation	30.28	31.42

Open in a new tab

5.2 Framingham heart data

The Framingham Heart Study (Dawber 1980) was initiated in 1948, with 2336 men and 2873 women aged between 30 and 62 years at their baseline examination. Individuals were examined every 2 years after participating in the 30-year follow-up study. Multiple events, e.g., times to coronary heart disease (CHD), denoted by T₁; and cerebrovascular accident (CVA), T₂, were observed from the same individual. Data used here included the participants in the study who had an examination at age 44 or 45 and were disease-free at that examination in the sense that there existed no history of hyper-tension or glucose intolerance and no previous experiences of a CHD or CVA. The original dataset contains a total of 1571 disease-free individuals. The risk factors of interest were age, x₁; body mass index, x₂; cholesterol level, x₃; systolic blood pressure, x₄; cigarette smoking, x₅; and gender, x₆. Since modeling biases can possibly be reduced by introducing interactions, we consider the marginal bivariate accelerated failure time model using both the main effects and first-order interactions. The variables were standardized to have zero mean and unit variance. Table 3 summarizes the estimated coefficients of the Gehan estimate, the Lasso estimate and the adaptive Lasso estimate for both CHD and CVA. The BIC criterion was used to choose the tuning parameters. It shows that there are many interactions among the risk factors on both CHD and CVA and as expected, adaptive Lasso yields sparser models than Lasso.

Table 3.

Framingham heart data.

	CHD			CVA
	β̂^G	β̂^L	β̂^A	β̂^G	β̂^L	β̂^A
x₁	0.035	0.025	0.028	−0.131	−0.052	−0.045
x₂	0.047	0.037	0.037	−0.356	−0.091	−0.081
x₃	−0.052	−0.045	−0.046	0.071	0.044	0.024
x₄	−0.048	−0.041	−0.041	0.098	0.022	0.017
x₅	−0.046	−0.038	−0.039	−0.694	−0.160	−0.140
x₆	0.023	0.019	0.018	0.012	0	0
x₁ * x₂	0.003	0	0	−0.039	−0.028	−0.007
x₁ * x₃	−0.016	−0.013	−0.013	−0.013	0	0
x₁ * x₄	−0.033	−0.025	−0.024	−0.006	0.007	0
x₁ * x₅	−0.082	−0.071	−0.074	0.273	0.080	0.116
x₁ * x₆	0.015	0.011	0.007	−0.044	−0.033	−0.005
x₂ * x₃	0.037	0.029	0.024	−0.041	0	0
x₂ * x₄	0.008	0.008	0	0.050	0.006	0
x₂ * x₅	−0.006	0	0	0.415	0.066	0.097
x₂ * x₆	−0.026	−0.023	−0.021	0.004	−0.030	0
x₃ * x₄	0.022	0.015	0.010	0.100	0.060	0.029
x₃ * x₅	−0.010	−0.011	−0.005	−0.217	−0.118	−0.083
x₃ * x₆	−0.003	0	0	−0.019	0.010	0
x₄ * x₅	−0.004	−0.002	0	−0.249	−0.104	−0.096
x₄ * x₆	0.012	0.006	0	0.073	0.043	0.021
x₅ * x₆	0.041	0.032	0.031	−0.006	0	0

Open in a new tab

Estimated coefficients of the Gehan estimate: β̂^G, the Lasso estimates: β̂^L, and the adaptive Lasso estimates: β̂^A

6 Simulations

We conducted extensive simulation study for univariate and multivariate failure time data in this session. All the simulations were conducted using R package quantreg.

6.1 Univariate failure time data

We simulated datasets consisting of 100 observations from the accelerated failure time model

\log T_{i} = β^{T} X_{i} + ε_{i},

where β = (3, 1.5, 0, 0, 2, 0, 0, 0)^T, the x_i s were marginally standard normal and the correlation between x_i and x_j was ρ^|i−j| with ρ = 0.5. The censoring times were generated from U n(0, τ) distribution, where τ was chosen to be 142 or 50 yielding the censoring level 30% or 50% respectively. The distribution of ε was set to be N (0, 1), t₃ and 0.5 N (0, 1) + 0.5 N (0, 9) respectively to assess the robustness of our proposed method. For the estimator β̂, its performance is gauged by the model error which is defined as

E {(\hat{β} - β_{0})}^{T} X^{T} X (\hat{β} - β_{0}) .

The ideal oracle estimator which knows the true nonzero co-efficients but not their exact values will apply the rank-based estimation procedure by considering only covariates x₁, x₂ and x₅. The Lasso and adaptive Lasso were used to penalize the Gehan-loss function and the tuning parameters were chosen by AIC, BIC or GCV as defined in Sect. 3. For each method, its relative model error compared to that of the oracle estimator was computed.

In Table 4, the median relative model errors (MRME) based on 1000 simulated datasets as well the average number of correctly selected (C) and incorrectly selected (IC) variables are summarized for the three error distributions. It shows that adaptive Lasso with BIC yields estimates with smaller models and more accurate estimates. Furthermore, the adaptive Lasso outperforms the Lasso in terms of variable selection and mean squares errors and the proposed estimator is very robust to the heavy tailed distribution t₃ and contamination normal distribution 0.5 N (0, 1) + 0.5 N (0, 9).

Table 4.

Simulation results for the univariate failure time data. Mix denotes the mixture error distribution 0.5N (0,1) + 0.5N (0,9). Sample size is 100

Error	Censoring	Method	MRME	C	IC
N (0,1)	30%	Lasso(AIC)	2.32(2.45)	3(0)	2.43(1.38)
		Lasso(BIC)	2.37(2.75)	3(0)	1.46(1.15)
		Lasso(GCV)	2.28(2.80)	3(0)	1.72(1.26)
		ALasso(AIC)	1.48(1.55)	3(0)	1.00(1.12)
		ALasso(BIC)	1.13(0.77)	3(0)	0.27(0.57)
		ALasso(GCV)	1.27(0.97)	3(0)	0.61(0.86)
	50%	Lasso(AIC)	2.27(2.53)	3(0)	2.62(1.39)
		Lasso(BIC)	2.61(2.46)	3(0)	1.63(1.25)
		Lasso(GCV)	2.41(2.37)	3(0)	1.84(1.35)
		ALasso(AIC)	1.46(1.58)	3(0)	1.04(1.21)
		ALasso(BIC)	1.16(0.84)	3(0)	0.24(0.51)
		ALasso(GCV)	1.22(1.11)	3(0)	0.63(0.91)

t₃	30%	Lasso(AIC)	2.41(2.07)	3(0)	2.39(1.43)
		Lasso(BIC)	2.55(2.45)	3(0)	1.39(1.05)
		Lasso(GCV)	2.60(2.37)	3(0)	1.46(1.08)
		ALasso(AIC)	1.53(1.39)	3(0)	0.98(1.07)
		ALasso(BIC)	1.30(0.93)	3(0)	0.28(0.65)
		ALasso(GCV)	1.24(0.79)	3(0)	0.35(0.67)
	50%	Lasso(AIC)	2.61(2.27)	3(0)	2.40(1.38)
		Lasso(BIC)	2.78(2.47)	3(0)	1.56(1.06)
		Lasso(GCV)	2.68(2.31)	3(0)	1.58(1.04)
		ALasso(AIC)	1.75(1.70)	3(0)	1.02(1.06)
		ALasso(BIC)	1.35(1.04)	3(0)	0.27(0.58)
		ALasso(GCV)	1.38(1.08)	3(0)	0.41(0.73)

Mix	30%	Lasso(AIC)	2.30(2.70)	3(0)	2.30(1.37)
		Lasso(BIC)	2.37(2.57)	3(0)	1.52(1.26)
		Lasso(GCV)	2.38(2.75)	3(0)	1.52(1.28)
		ALasso(AIC)	1.64(1.71)	3(0)	1.06(1.15)
		ALasso(BIC)	1.46(1.47)	3(0)	0.47(0.82)
		ALasso(GCV)	1.50(1.47)	3(0)	0.56(0.88)
	50%	Lasso(AIC)	2.15(2.61)	3(0)	2.34(1.30)
		Lasso(BIC)	2.16(2.99)	3(0)	1.55(1.21)
		Lasso(GCV)	2.38(2.95)	3(0)	1.68(1.27)
		ALasso(AIC)	1.82(1.77)	3(0)	1.25(1.20)
		ALasso(BIC)	1.44(1.56)	3(0)	0.49(0.80)
		ALasso(GCV)	1.59(1.68)	3(0)	0.65(0.88)

Open in a new tab

To investigate the performance of the adaptive Lasso and the Lasso when the signal-to-noise ratio is small, following Leeb and Pötscher (2008), we multiply the regression coefficient vector by $1 / \sqrt{n}$ . With sample size 100 and censoring 30%, the results are summarized in Table 5. It can be seen that the adaptive Lasso performs very poorly and even a bit worse than the Lasso.

Table 5.

Simulation results when the signal-to-noise is small

Method	MRME	C	IC
Lasso(AIC)	1.92(3.71)	2.26(0.76)	1.79(1.41)
Lasso(BIC)	2.43(4.15)	1.44(0.94)	0.58(0.98)
Lasso(GCV)	2.03(3.58)	1.82(0.87)	0.82(1.03)
ALasso(AIC)	2.27(3.30)	2.01(0.69)	1.32(1.18)
ALasso(BIC)	2.44(3.78)	1.46(0.77)	0.52(0.80)
ALasso(GCV)	2.28(3.50)	1.71(0.73)	0.66(0.84)

Open in a new tab

6.2 Multivariate failure time data

For multiple events and clustered data, two failure times T₁ and T₂ were generated from Gumbel (1960) bivariate distribution:

F (t_{1}, t_{2}) = F_{1} (t_{1}) F_{2} (t_{2}) [1 + θ {1 - F_{1} (t_{1})} {1 - F_{2} (t_{2})}],

where −1 ≤ θ ≤ 1. The correlation between T₁ and T₂ is θ/4. The two marginal distributions F_k(t_k), k = 1, 2, were generated from the exponential distribution with hazard rate $λ_{k} = e^{β_{k}^{T} X_{k}}$ . We simulated 100 datasets consisting of 100 observations from the model where X_k = (x_1k, …, x_pk), p = 8, the x_ik s were marginally standard normal and the correlation between x_ik and x_jk was ρ^|i−j| with ρ = 0.5. The censoring times were generated from U n(0, τ) distribution, where τ was chosen to be 1.5 yielding the censoring level 50%. For multiple events, we set β₁₀ = (3, 1.5, 0, 0, 2, 0, 0, 0)^T, β₂₀ = (0, 0, 2, 0, 0, 1.5, 3, 0)^T and X_k = X. For clustered data, X₁ and X₂ were generated inependently and β_k0 = β₀ = (3, 1.5, 0, 0, 2, 0, 0, 0)^T. For recurrent events, we set β₀ = (3, 1.5, 0, 0, 2, 0, 0, 0)^T and the covariates were generated in the same manner as in the case of multiple events. The gap times between successive events were generated from the aforementioned Gumbel’s bivariate exponential distribution. The resultant recurrent event process is Poisson under θ = 0 and non-Poisson under θ ≠ 0. The follow-up time was an independent U n(0, 2.5) random variable, which on average yielded approximately 2.60 and 2.86 events per subject for the Poisson and non-Poisson cases respectively.

For multiple events, the performance of the estimator is gauged by the model error which is defined as

\sum_{k = 1}^{2} E {({\hat{β}}_{k} - β_{0 k})}^{T} X^{T} X ({\hat{β}}_{k} - β_{0 k}) .

The ideal oracle estimator which knows the true nonzero coefficients but not their exact values will apply the rank-based estimation procedure by considering only covariates x₁, x₂ and x₅ for T₁ and x₃, x₆ and x₇. For clustered data, the performance of the estimator is gauged by the model error

\sum_{k = 1}^{2} E {(\hat{β} - β_{0})}^{T} X_{k}^{T} X_{k} (\hat{β} - β_{0}) .

For recurrent events, the performance of the estimator is gauged by the model error

E {(\hat{β} - β_{0})}^{T} X^{T} X (\hat{β} - β_{0}) .

For clustered data and recurrent events, the ideal oracle estimator just considers variables x₁, x₂ and x₅.

The Lasso and adaptive Lasso were used to penalize the Gehan-loss function and the tuning parameters were chosen by AIC, BIC or GCV as defined in Sect. 4. For each method, its relative model error compared to that of the oracle estimator was computed.

In Table 6, the median relative model errors (MRME) based on 100 simulated datasets as well the average number of correctly selected (C) and incorrectly selected (IC) variables are summarized for multiple events, clustered and recurrent data respectively. We can see again that adaptive Lasso with BIC performs the best.

Table 6.

Simulation results for the multivariate failure time data. Sample size is 100. Censoring level is 50%

	θ	Method	MRME	C	IC
Multiple events	0	Lasso(AIC)	2.36(1.58)	6(0)	5.83(1.86)
		Lasso(BIC)	2.79(1.65)	6(0)	3.92(1.74)
		Lasso(GCV)	2.43(1.61)	6(0)	5.29(2.08)
		ALasso(AIC)	1.77(1.38)	6(0)	2.67(1.83)
		ALasso(BIC)	1.41(0.67)	6(0)	0.64(0.89)
		ALasso(GCV)	1.68(1.18)	6(0)	2.45(1.82)

	1	Lasso(AIC)	2.42(1.78)	6(0)	5.63(1.86)
		Lasso(BIC)	2.69(1.95)	6(0)	3.89(1.75)
		Lasso(GCV)	2.44(1.79)	6(0)	5.32(2.02)
		ALasso(AIC)	1.81(1.44)	6(0)	2.68(1.80)
		ALasso(BIC)	1.41(0.70)	6(0)	0.68(0.89)
		ALasso(GCV)	1.74(1.55)	6(0)	2.46(1.90)

Clustered data	0	Lasso(AIC)	2.23(2.42)	3(0)	2.57(1.36)
		Lasso(BIC)	2.56(2.53)	3(0)	1.55(1.17)
		Lasso(GCV)	2.36(2.32)	3(0)	1.79(1.30)
		ALasso(AIC)	1.28(1.45)	3(0)	0.94(1.27)
		ALasso(BIC)	1.04(0.86)	3(0)	0.21(0.40)
		ALasso(GCV)	1.08(1.10)	3(0)	0.54(0.84)

	1	Lasso(AIC)	2.33(2.02)	3(0)	2.40(1.32)
		Lasso(BIC)	2.70(3.70)	3(0)	1.24(1.10)
		Lasso(GCV)	2.54(2.46)	3(0)	1.68(1.14)
		ALasso(AIC)	1.12(1.36)	3(0)	0.72(0.96)
		ALasso(BIC)	1.00(0.53)	3(0)	0.11(0.25)
		ALasso(GCV)	1.03(0.80)	3(0)	0.32(0.78)

Recurrent events	0	Lasso(AIC)	2.63(2.29)	3(0)	2.42(1.39)
		Lasso(BIC)	2.76(2.48)	3(0)	1.58(1.04)
		Lasso(GCV)	2.64(2.32)	3(0)	1.61(1.06)
		ALasso(AIC)	1.74(1.68)	3(0)	1.04(1.09)
		ALasso(BIC)	1.38(1.02)	3(0)	0.29(0.56)
		ALasso(GCV)	1.42(1.10)	3(0)	0.43(0.76)

	1	Lasso(AIC)	2.36(2.54)	3(0)	2.76(1.34)
		Lasso(BIC)	2.80(3.90)	3(0)	1.26(1.08)
		Lasso(GCV)	2.50(2.76)	3(0)	1.78(1.36)
		ALasso(AIC)	1.48(1.60)	3(0)	1.16(1.26)
		ALasso(BIC)	1.26(0.86)	3(0)	0.18(0.42)
		ALasso(GCV)	1.32(1.26)	3(0)	0.50(0.69)

Open in a new tab

7 Discussion

We propose in this paper an ℓ₁ regularized procedure for variable selection in the accelerated failure time model. The resulting estimates possess the oracle properties if the adaptive Lasso penalty is used for penalization and BIC is used for tuning parameter selection. Additionally, we extend the ℓ₁ regularized procedure to multivariate failure time models, including multiple events data, clustered survival data and recurrent events data. Extensive simulation study and a real data analysis of primary biliary cirrhosis data illustrate the usefulness of our approach in terms of both variable selection and coefficient estimation. Although we only considered the Gehan statistics based loss function, it is rather straightforward to extend our approach to other weighting schemes discussed in Jin et al. (2003). Our current implementation via quantreg uses a grid of tuning parameter values and alternatively, we could implement the path following algorithm as detailed in Li and Zhu (2008), which, as noted by an anonymous referee, has been recently investigated in Cai et al. (2009). We modeled multivariate survival data via the marginal approach. It would be interesting to investigate how to conduct variable selection while accounting for correlations among multiple failure times.

Acknowledgments

This research was supported by grants from the U.S. National Institutes of Health, the U.S. National Science Foundation and the National University of Singapore (R-155-000-075-112 and R-155-000-080-112). We are very grateful to the editor, the associate editor and the referees for their helpful comments which have greatly improved the paper.

Appendix

Proofs of Theorems 1 and 2

Proof of Theorem 1 proceeds by first establishing the local quadratic property (Proposition 1) of the Gehan loss function and an inequality (Proposition 2) which relates the minimizer of the penalized loss function to the minimizer of a penalized quadratic function. Then the $\sqrt{n}$ consistency of the penalized estimator and the oracle properties follow by applying the arguments similar to those of Fan and Li (2001).

Define

\begin{array}{l} A_{G} & = & \lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} \int_{- \infty}^{\infty} S^{(0)} (β_{0}; t) \\ \times {X_{i} - \bar{X} (β_{0}; t)}^{\otimes 2} {\dot{λ} (t) / λ (t)} d N_{i} (β_{0}; t), \end{array}

\begin{array}{l} B_{G} & = & \lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} \int_{- \infty}^{\infty} {[S^{(0)} (β_{0}; t)]}^{2} \\ \times {X_{i} - \bar{X} (β_{0}; t)}^{\otimes 2} d N_{i} (β_{0}; t) . \end{array}

Proposition 1

Under conditions 1–4 of Ying (1993, p. 80), for every sequence d_n > 0 with d_n → 0 a.s., we have

\begin{array}{l} n^{- 1} L_{G} (β) - n^{- 1} L_{G} (β_{0}) \\ = n^{- 1} U_{G}^{T} (β_{0}) (β - β_{0}) + {(β - β_{0})}^{T} A_{G} (β - β_{0}) / 2 \\ + o ({‖ β - β_{0} ‖}^{2} + n^{- 1}) \end{array}

(4)

holds uniformly in ∥β − β₀ ∥ ≤ d_n.

Proof

It follows from Theorem 2 of Ying (1993) that almost surely, uniformly in ∥β − β₀∥ ≤ d_n, we have

\begin{array}{l} U_{G} (β) & = & U_{G} (β_{0}) + n A_{G} (β - β_{0}) \\ + o (n^{1 / 2} + n ‖ β - β_{0} ‖) \end{array}

(5)

and denote U_G = (U_G1, …, U_Gp)^T, A_G = (a_ij), 1 ≤ i, j ≤ p, β = (β₁, …, β_p)^T, β₀ = (β₀₁, …, β_0p)^T, and β^j = (β₁, …, β_p−j, β_0(p−j+1), …, β_0p)^T, 1 ≤ j ≤ p, β⁰ = β, noticing that β^p = β₀, we have

\begin{array}{l} L_{G} (β) - L_{G} (β_{0}) \\ = L_{G} (β^{0}) - L_{G} (β^{p}) \\ = \sum_{k = 1}^{p} [L_{G} (β^{k - 1}) - L_{G} (β^{k})], \\ = \sum_{k = 1}^{p} \int_{β_{0 (p - k + 1)}}^{β_{p - k + 1}} U_{G (p - k + 1)} (β_{1}, \dots, β_{p - k}, s_{p - k + 1}, \\ β_{0 (p - j k + 2)}, \dots, β_{0 p}) d s_{p - k + 1} . \end{array}

By (5),

\begin{array}{l} U_{G (p - k + 1)} (β_{1}, \dots, β_{p - k}, s_{p - k + 1}, β_{0 (p - j k + 2)}, \dots, β_{0 p}) \\ = U_{G (p - k + 1)} (β_{0}) + n \sum_{l = 1}^{p - k} a_{(p - k + 1) l} (β_{l} - β_{0 l}) \\ + n a_{(p - k + 1) (p - k + 1)} (s_{p - k + 1} - β_{0 (p - k + 1)}) \\ + o (n^{1 / 2} + n ‖ β - β_{0} ‖) \end{array}

then

\begin{array}{l} L_{G} (β) - L_{G} (β_{0}) \\ = \sum_{k = 1}^{p} U_{G (p - k + 1)} (β_{0}) (β_{p - k + 1} - β_{0 (p - k + 1)}) \\ + \frac{n}{2} {(β_{p - k + 1} - β_{0 (p - k + 1)})}^{2} \\ + n \sum_{k = 1}^{p} \sum_{l = 1}^{p - k} a_{(p - k + 1) l} (β_{l} - β_{0 l}) (β_{p - k + 1} - β_{0 (p - k + 1)}) \\ + o (n^{1 / 2} ‖ β - β_{0} ‖ + n {‖ β - β_{0} ‖}^{2}) \\ = U_{G}^{T} (β_{0}) (β - β_{0}) + n {(β - β_{0})}^{T} A_{G} (β - β_{0}) / 2 \\ + o (n^{1 / 2} ‖ β - β_{0} ‖ + n {‖ β - β_{0} ‖}^{2}) . \end{array}

Hence

\begin{array}{l} n^{- 1} L_{G} (β) - n^{- 1} L_{G} (β_{0}) \\ = n^{- 1} U_{G}^{T} (β_{0}) (β - β_{0}) + {(β - β_{0})}^{T} A_{G} (β - β_{0}) / 2 \\ + o (n^{- 1 / 2} ‖ β - β_{0} ‖ + {‖ β - β_{0} ‖}^{2}) \\ = n^{- 1} U_{G}^{T} (β_{0}) (β - β_{0}) + {(β - β_{0})}^{T} A_{G} (β - β_{0}) / 2 \\ + o ({‖ β - β_{0} ‖}^{2} + n^{- 1}) . \end{array}

This completes the proof.

Consider the object function:

C (u) = u^{T} Du / 2 - a^{T} u + \sum_{j = 1}^{s} λ_{j} u_{j} + \sum_{j = s + 1}^{p} λ_{j} | u_{j} |

where u ∈ R^p, D is a positive definite matrix, λ₁, …, λ_s are constants, λ_s+1,…, λ_p are nonnegative constants, and suppose that û is a minimizer of c(u), we have the following proposition.

Proposition 2

For any u, we have C(u) − C(û) ≥ (u − û)^T D(u − û)/2.

Suppose that û_n is a minimizer of the objective function

\begin{array}{l} B_{n} (u) & = & n^{- 1 / 2} U_{G}^{T} (β_{0}) u + u^{T} A_{G} u / 2 \\ + \sum_{i = 1}^{s} \sqrt{n} λ_{n i} sgn (β_{0 i}) u_{i} + \sum_{i = s + 1}^{p} \sqrt{n} λ_{n i} | u_{i} | . \end{array}

(6)

By Propositions 1 and 2, it can be shown that $n^{\frac{1}{2}} (\hat{β} - β_{0})$ and û_n have the same asymptotic distribution. Then it is straightforward to obtain the $\sqrt{n}$ consistency of the penalized estimator and the oracle properties by following the same arguments in Fan and Li (2001).

Proof of Theorem 2 can be established similarly by looking at the perturbed penalized loss function and applying the conditional arguments as in Jin et al. (2003) and is thus omitted.

Contributor Information

Jinfeng Xu, Email: staxj@nus.edu.sg, Department of Statistics and Applied Probability, Risk Management Institute, National University of Singapore, 117546 Singapore, Singapore.

Chenlei Leng, Email: stalc@nus.edu.sg, Department of Statistics and Applied Probability, Risk Management Institute, National University of Singapore, 117546 Singapore, Singapore.

Zhiliang Ying, Email: zying@stat.columbia.edu, Department of Statistics, Columbia University, New York, NY 10027, USA.

References

Cai T, Huang J, Lu T. Regularized estimation for the accelerated failure time model. Biometrics. 2009 doi: 10.1111/j.1541-0420.2008.01074.x. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cox DR. Regression models and life-tables (with Discussion) J R Stat Soc B. 1972;34:187–220. [Google Scholar]
Dawber TR. The Epidemiology of Atherosclerotic Disease. Harvard University Press; Cambridge: 1980. The Framingham Study. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Stat. 2002;30:74–99. [Google Scholar]
Gehan EA. A generalized Wilcoxon test for comparing arbitrarily single-censored samples. Biometrika. 1965;52:203–223. [PubMed] [Google Scholar]
Gumbel EJ. Bivariate exponential distributions. J Am Stat Assoc. 1960;55:698–707. [Google Scholar]
Jin Z, Ying Z, Wei LJ. A simple resampling method by perturbing the minimand. Biometrika. 2001;88:381–390. [Google Scholar]
Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika. 2003;90:341–353. [Google Scholar]
Jin Z, Lin DY, Ying Z. Rank regression analysis of multivariate failure time data based on marginal linear models. Scand J Stat. 2006;33:1–23. [Google Scholar]
Johnson BA. Variable selection in semiparametric linear regression with censored data. J R Stat Soc Ser B. 2008;70:351–370. [Google Scholar]
Johnson BA, Peng LM. Rank-based variable selection. J Nonparametric Stat. 2008;20:241–252. [Google Scholar]
Johnson BA, Lin DY, Zeng D. Penalized estimating functions and variable selection in semiparametric regression models. J Am Stat Assoc. 2008;103:672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Time Data. 2. Wiley; New York: 2002. [Google Scholar]
Koenker R, D’Orey V. Computing regression quantiles. Appl Stat. 1987;36:383–393. [Google Scholar]
Leeb H, Pötscher BM. Sparse estimators and the oracle property, or the return of Hodges’ estimator. J Econom. 2008;142:201–211. [Google Scholar]
Li Y, Zhu J. L1-norm quantile regression. J Comput Graph Stat. 2008;17:163–185. [Google Scholar]
Lu W, Zhang HH. Variable selection for proportional odds model. Stat Med. 2007;26:3771–3781. doi: 10.1002/sim.2833. [DOI] [PubMed] [Google Scholar]
Parzen MI, Wei LJ, Ying Z. A resampling method based on pivotal estimating functions. Biometrika. 1994;81:341–350. [Google Scholar]
Rao CR, Zhao LC. Approximation to the distribution of M-estimates in linear models by randomly weighted bootstrap. Sankhyā A. 1992;54:323–331. [Google Scholar]
Therneau TM, Grambsch PM. Introduction to Nonparametric Regression. Springer; New York: 2001. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc B. 1996;58:267–288. [Google Scholar]
Tibshirani R. The Lasso method for variable selection in the cox model. Stat Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
Wang H, Leng C. Unified Lasso estimation via least squares approximation. J Am Stat Assoc. 2007;102(479):1039–1048. [Google Scholar]
Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection via the LAD-LASSO. J Bus Econ Stat. 2007a;25:347–355. [Google Scholar]
Wang H, Li G, Tsai CL. Regression coefficients and autoregressive order shrinkage and selection via the Lasso. J R Stat Soc Ser B. 2007b;69:63–78. [Google Scholar]
Wang H, Li R, Tsai CL. Tuning parameter selector for SCAD. Biometrika. 2007c;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wei LJ, Ying Z, Lin DY. Linear regression analysis for censored observations based on rank tests. Biometrika. 1990;77:845–851. [Google Scholar]
Ying Z. A large sample study of rank estimation for censored regression data. Ann Stat. 1993;21:76–99. [Google Scholar]
Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
Zou H. The adaptive Lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–1429. [Google Scholar]
Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]

[R1] Cai T, Huang J, Lu T. Regularized estimation for the accelerated failure time model. Biometrics. 2009 doi: 10.1111/j.1541-0420.2008.01074.x. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Cox DR. Regression models and life-tables (with Discussion) J R Stat Soc B. 1972;34:187–220. [Google Scholar]

[R3] Dawber TR. The Epidemiology of Atherosclerotic Disease. Harvard University Press; Cambridge: 1980. The Framingham Study. [Google Scholar]

[R4] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–1360. [Google Scholar]

[R5] Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Stat. 2002;30:74–99. [Google Scholar]

[R6] Gehan EA. A generalized Wilcoxon test for comparing arbitrarily single-censored samples. Biometrika. 1965;52:203–223. [PubMed] [Google Scholar]

[R7] Gumbel EJ. Bivariate exponential distributions. J Am Stat Assoc. 1960;55:698–707. [Google Scholar]

[R8] Jin Z, Ying Z, Wei LJ. A simple resampling method by perturbing the minimand. Biometrika. 2001;88:381–390. [Google Scholar]

[R9] Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika. 2003;90:341–353. [Google Scholar]

[R10] Jin Z, Lin DY, Ying Z. Rank regression analysis of multivariate failure time data based on marginal linear models. Scand J Stat. 2006;33:1–23. [Google Scholar]

[R11] Johnson BA. Variable selection in semiparametric linear regression with censored data. J R Stat Soc Ser B. 2008;70:351–370. [Google Scholar]

[R12] Johnson BA, Peng LM. Rank-based variable selection. J Nonparametric Stat. 2008;20:241–252. [Google Scholar]

[R13] Johnson BA, Lin DY, Zeng D. Penalized estimating functions and variable selection in semiparametric regression models. J Am Stat Assoc. 2008;103:672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Time Data. 2. Wiley; New York: 2002. [Google Scholar]

[R15] Koenker R, D’Orey V. Computing regression quantiles. Appl Stat. 1987;36:383–393. [Google Scholar]

[R16] Leeb H, Pötscher BM. Sparse estimators and the oracle property, or the return of Hodges’ estimator. J Econom. 2008;142:201–211. [Google Scholar]

[R17] Li Y, Zhu J. L1-norm quantile regression. J Comput Graph Stat. 2008;17:163–185. [Google Scholar]

[R18] Lu W, Zhang HH. Variable selection for proportional odds model. Stat Med. 2007;26:3771–3781. doi: 10.1002/sim.2833. [DOI] [PubMed] [Google Scholar]

[R19] Parzen MI, Wei LJ, Ying Z. A resampling method based on pivotal estimating functions. Biometrika. 1994;81:341–350. [Google Scholar]

[R20] Rao CR, Zhao LC. Approximation to the distribution of M-estimates in linear models by randomly weighted bootstrap. Sankhyā A. 1992;54:323–331. [Google Scholar]

[R21] Therneau TM, Grambsch PM. Introduction to Nonparametric Regression. Springer; New York: 2001. [Google Scholar]

[R22] Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc B. 1996;58:267–288. [Google Scholar]

[R23] Tibshirani R. The Lasso method for variable selection in the cox model. Stat Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]

[R24] Wang H, Leng C. Unified Lasso estimation via least squares approximation. J Am Stat Assoc. 2007;102(479):1039–1048. [Google Scholar]

[R25] Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection via the LAD-LASSO. J Bus Econ Stat. 2007a;25:347–355. [Google Scholar]

[R26] Wang H, Li G, Tsai CL. Regression coefficients and autoregressive order shrinkage and selection via the Lasso. J R Stat Soc Ser B. 2007b;69:63–78. [Google Scholar]

[R27] Wang H, Li R, Tsai CL. Tuning parameter selector for SCAD. Biometrika. 2007c;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Wei LJ, Ying Z, Lin DY. Linear regression analysis for censored observations based on rank tests. Biometrika. 1990;77:845–851. [Google Scholar]

[R29] Ying Z. A large sample study of rank estimation for censored regression data. Ann Stat. 1993;21:76–99. [Google Scholar]

[R30] Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]

[R31] Zou H. The adaptive Lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–1429. [Google Scholar]

[R32] Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]

PERMALINK

Rank-based variable selection with censored data

Jinfeng Xu

Chenlei Leng

Zhiliang Ying

Abstract

1 Introduction

2 The ℓ1 regularized Gehan estimator

Theorem 1

Remark 1

Theorem 2

3 Computation and tuning parameter selection

Theorem 3

4 Extensions to multivariate failure time data

4.1 Multiple events data

4.2 Clustered failure time data

4.3 Recurrent events data

5 Two real examples

5.1 Primary biliary cirrhosis data

Table 1.

Fig. 1.

Fig. 2.

Table 2.

5.2 Framingham heart data

Table 3.

6 Simulations

6.1 Univariate failure time data

Table 4.

Table 5.

6.2 Multivariate failure time data

Table 6.

7 Discussion

Acknowledgments

Appendix

Proofs of Theorems 1 and 2

Proposition 1

Proof

Proposition 2

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2 The ℓ₁ regularized Gehan estimator