Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Sep 4.
Published in final edited form as: Stat Comput. 2010 Apr 1;20(2):165–176. doi: 10.1007/s11222-009-9126-y

Rank-based variable selection with censored data

Jinfeng Xu 1,, Chenlei Leng 2, Zhiliang Ying 3
PMCID: PMC3762511  NIHMSID: NIHMS491952  PMID: 24013588

Abstract

A rank-based variable selection procedure is developed for the semiparametric accelerated failure time model with censored observations where the penalized likelihood (partial likelihood) method is not directly applicable.

The new method penalizes the rank-based Gehan-type loss function with the 1 penalty. To correctly choose the tuning parameters, a novel likelihood-based χ2-type criterion is proposed. Desirable properties of the estimator such as the oracle properties are established through the local quadratic expansion of the Gehan loss function.

In particular, our method can be easily implemented by the standard linear programming packages and hence numerically convenient. Extensions to marginal models for multivariate failure time are also considered. The performance of the new procedure is assessed through extensive simulation studies and illustrated with two real examples.

Keywords: Accelerated failure time model, Adaptive Lasso, BIC, Gehan-type loss function, Lasso, Variable selection

1 Introduction

As an attractive alternative to the proportional hazards model (Cox 1972), the accelerated failure time model specifies directly a regression model of the log transformed survival time on a set of covariates,

logTi=β0TXi+εi,i=1,,n,

where Ti is the survival time, Xi is a p-dimensional covariate, β0 is a p-vector of unknown regression parameters and εi (i = 1, …, n) are independent error terms with a common, but completely unspecified, distribution. Because of this direct relationship between survival time and covariates, the accelerated failure time model is physically interpretable and in many ways more appealing than the proportional hazards model (Kalbfleisch and Prentice 2002, Chap. 7).

A common phenomenon of survival analysis is that data are subject to right censoring. Due to the censoring, the observed data are (i, Δi, Xi, i = 1, …, n), where Ci is the censoring time, i = TiCi is the observed time and Δi = 1{TiCi} is the censoring indicator. Define ei(β) = log iβT Xi, Ni(β; t) = Δi 1{ei(β)≤t} and Yi(β; t) = 1{ei(β)≥t}. Note that Ni and Yi are the counting processes and at-risk process on the time scale of the residual. Write

S(0)(β;t)=n1i=1nYi(β;t),S(1)(β;t)=n1i=1nYi(β;t)Xi.

The weighted log-rank estimating function for β0 takes the form

Uϕ(β)=i=1nΔiϕ{β;ei(β)}[XiX¯{β;ei(β)}],orUϕ(β)=i=1nϕ(β;t){XiX¯(β;t)}dNi(β;t),

where (β; t) = S(1) (β; t)/S(0) (β; t), and ϕ is a possibly data-dependent weight function, the choice of ϕ = S(0) leads to the Gehan statistics. In this case, Uϕ can be written as

UG(β)=i=1nΔiS(0){β;ei(β)}[XiX¯{β;ei(β)}],

or

UG(β)=n1i=1nj=1nΔi(XiXj)1{ei(β)ej(β)},

which is the gradient of the convex function

LG(β):=n1i=1nj=1nΔi{ei(β)ej(β)}

for |a| = |a| 1a<0. Based on this fact, Jin et al. (2003) provided simple and reliable methods to obtain the rank estimators. They showed that the rank estimator with the Gehantype (1965) weight function can be readily obtained by minimizing a convex objective function through a standard linear programming technique.

An important objective of survival analysis is to identify a subset of significant variables from a large number of covariates which are often collected to reduce possible modeling bias. Variable selection is fundamental to statistical modeling and recently, a number of approaches based on the penalized partial likelihood have been applied to the Cox proportional hazards model and gained increasing popularity. See, for example, Lasso (Tibshirani, 1996, 1997), Scad (Fan and Li, 2001, 2002), Alasso (Zou, 2006, 2008; Zhang and Lu 2007; Lu and Zhang 2007; Wang et al. 2007b), and LSA (Wang and Leng 2007). Wang et al. (2007a) discussed the penalized least absolute deviation estimation without considering censoring. In the accelerated failure time model aforementioned, we are particularly interested in estimating the unknown vector β0 and identifying any nonzero components. Zhang and Lu (2007) proposed the penalized partial likelihood method for variable selection in the Cox model. However, in the accelerated failure time model, partial likelihood is not available and semiparametric estimation of the regression coefficient vector relies on the rank-based methods. The estimates are usually obtained by minimizing a non-smooth objective function or solving the estimating equations which are step functions with potentially multiple roots (Jin et al. 2003). Johnson (2008) and Johnson et al. (2008) proposed a general variable selection procedure by penalizing estimation equations with broad and important applications especially including the censored accelerated failure time model. For uncensored data, Johnson and Peng (2008) developed a rank-based variable selection procedure in the linear model and the desirable properties such as the robustness and the oracle properties are obtained. In this article, we propose the 1 regularized Gehan estimator for simultaneous estimation and variable selection which yields advantages in two fronts. First, the shrinkage property of the 1 penalty and proper choice of tuning parameters build sparse models without sacrificing accuracy. Secondly, the single criterion function with both components being of 1-type reduces (numerically) the minimization to a strictly linear programming problem, making any resulting methodology extremely easy to implement.

The rest of the paper is organized as follows. Section 2 introduces the 1 regularized Gehan estimator and gives its asymptotic properties. A novel χ2 type criterion is proposed to choose the tuning parameters in Sect. 3. Extensions to multivariate failure time data are considered in Sect. 4. The proposed methodology is illustrated with the applications to two datasets in Sect. 5. In Sect. 6, simulations are conducted to assess the finite-sample performance of the proposed methods. Section 7 concludes with a general discussion. All the proofs are relegated to the Appendix.

2 The 1 regularized Gehan estimator

Define the 1 regularized Gehan loss function as

LP(β):=n1LG(β)+j=1pλnj|βj|=n2i=1nj=1nΔi{ei(β)ej(β)}+j=1pλnj|βj|, (1)

where λnj, j = 1, …, p are tuning parameters and βj is the j th component of the vector β, j = 1, …, p. The regularized estimate of β0 is a minimizer of Lp(β) and denoted by β̂.

Let β0={(β01)T,(β02)T}T, where β01 is an s-vector and β02 is a (ps)-vector. Without loss of generality, assume β02 is the zero vector and β01 is the nonzero vector. Suppose further that {λnj} satisfy the following conditions:

(C)nλnj0for1jsandnλnjfors+1jp.

This assumption states that the penalties applied on the zero entries in β dominate those on the nonzero entries. Intuitively, the requirement that nλnj0 for 1 ≤ js enables the resulting estimates of β01 to be n-consistent; and the condition that nλnj for s + 1 ≤ jp shrinks the estimates of β02 to zero. This observation is made rigorous by the following theorem.

Theorem 1

(Oracle properties) Under the conditions 1–4 of Ying (1993) and (C), with probability tending to one, the penalized estimator β̂ = {(β̂1)T, (β̂2)T}T has the following properties:

  1. β̂2 = 0;

  2. n12(β^1β01)N(0,V), where, V=AG11BG1AG11, AG1 and BG1 are defined in the Appendix.

Remark 1

Theorem 1 states that the proposed estimator estimates the coefficients of the important variables as if the true were known in advance, which is referred to as the oracle properties by Fan and Li (2001). However, as noted by an anonymous referee, the model selection consistency of the adaptive Lasso is obtained through the large sample theory and for finite sample size and small signal-to-noise ratio, it can perform rather poorly, sometimes even worse than the ordinary Lasso. Furthermore, Leeb and Pötscher (2008) showed that the maximal risk of any sparse estimator goes to infinity as the sample size increases. Hence the minimax efficiency and the model selection consistency seem to be two irreconcilable properties. In practice, we should always be cautious about using which to choose a good estimator.

The limiting covariance matrix V involves the unknown hazard function, so that it is difficult to estimate the covariance matrix analytically. Here we apply the random perturbation method (Rao and Zhao 1992; Parzen et al. 1994; Jin et al. 2001) to approximate the distribution of β̂. To be specific, define β̂* as a minimizer of the perturbed 1 regularized Gehan loss function

LP(β):=n2i=1nj=1nΔi{ei(β)ej(β)}Zi+i=1pλnj|βj|, (2)

where the random variable Zi satisfied E(Zi) = 1 and V (Zi) = 1. In the data analysis and simulation studies, the standard exponential distribution is used. The following theorem justifies the use of random perturbation method to distributional approximation of the estimator.

Theorem 2

(Asymptotic variance) Under (C) and conditions 1–4 of Ying (1993, p. 80), with probability tending to one, conditional on the data (i, Δi, Xi) (i = 1, …, n), β^={(β^1)T,(β^2)T} has the following properties: (a) β^2=0; (b) n12(β^1β^1)N(0,V).

In practice, we fix the tuning parameters which satisfy condition (C) in implementing the perturbation method.

3 Computation and tuning parameter selection

As pointed out in Jin et al. (2003), the minimization of LP(β) can be carried out by linear programming and is equivalent to the minimization of

n2i=1nj=1nΔi|ei(β)ej(β)|+|MβTk=1nl=1nΔk(XlXk)|+j=1pλnj|βj|,

where M is a large constant. An implementation of the algorithm is discussed by Koenker and D’Orey (1987), with the code available in S-Plus and R and other softwares. The minimization of LP(β) can be implemented similarly. Condition (C) suggests pre-selection of {λnj}’s based on a preliminary estimate of β, and in this work we take

λnj=λ|βj|τ,j=1,,p, (3)

for τ > 0, as {β̃j} are n-consistent. This choice is discussed by Zou (2006), Zhang and Lu (2007) and Wang et al. (2007b). For such λnj s, we have an = max{λnj, j = 1,…, s} = λOp (1) and bn = min{λnj, j = s + 1, …, p} = λOp(nτ/2)→ ∞. It is then easy to see that once λ satisfies

nan0andnbn,

Theorems 1 and 2 hold. This simplification is attractive from a computational viewpoint, since we only need to choose one tuning parameter λ instead of p tuning parameters λnj, j = 1, …, p. For later exposition, we shall fix {λnj} according to (3) with τ = 1.

In order to study the dependence of the selected model on the tuning parameter λ, we denote the model corresponding to β̂λ as Sλ = {j : β̂λ, j ≠ 0}. Write the derivative of Gehan loss function as

UG(β)=n1i=1nΔiS(0){β;ei(β)}[XiX¯{β;ei(β)}].

Then nUG(β0) is asymptotically normal with mean zero and variance

BGn=n1i=1nΔiS(0){β;ei(β)}2[XiX¯{β;ei(β)}]2.

Consider the following χ2 type statistics:

Tλ=nUG(β^λ)TBGn1(β^G)UG(β^λ),

and by applying arguments similar to those in Wei et al. (1990), when Sλ ⊇ {1, …, s} i.e. a correct model is identified (not necessarily the true model), Tλ follows the χ2 distribution with degrees of freedom q which equals to the number of zero components in β̂λ. Tλ is scale-invariant and has a likelihood interpretation and hence can be used for tuning parameter selection. An attractive property of Tλ is that BGn does not require density estimation and thus can be easily computed. Based on Tλ, we propose the Bayesian information criterion (BIC)

BICλ=Tλ+logn·dfλ,

where dfλ is the number of nonzero components in β̂λ. Similarly, the Akaike information criterion (AIC) can be defined as

AICλ=Tλ+2·dfλ.

Replacing the 2 loss function in the generalized cross validation (GCV) with the Gehan loss function, GCV can be defined as

GCVλ=LG(β^λ)(1dfλ/n)2.

We define R0 = {λ ≥ 0 : Sλ = {1, …, s}} as the set of λ’s such that the true model is identified. In addition, we define a reference tuning parameter sequence {λn=1/[nlog(n)]}n=1. By Theorem 1, it follows that with probability one, Sλn = {1, …, s}. We have the following consistency theorem for the BIC method.

Theorem 3

Under the assumptions in Theorem 1, P(infλR0 BICλ > BICλn) → 1.

This theorem states that for any λ which can not choose the true model, its associated BIC is larger than the one identified by the reference sequence. Therefore, the optimal λ which minimizes BIC must correspond to the true model. The proof of this theorem is similar to that of Theorem 4 in Wang and Leng (2007) and is therefore omitted. Note that neither AIC nor GCV yields consistent model selection results if a true sparse model exists (Wang et al. 2007c).

4 Extensions to multivariate failure time data

Following Jin et al. (2006), in this section, we consider the extension of 1 regularized method to multivariate failure time data. The oracle properties for the estimators defined in this section, similar to those in Theorems 1–3, can be shown by using the similar techniques and are therefore omitted. Additionally, computing the estimates relies on the linear programming technique, thus can be easily implemented.

4.1 Multiple events data

Multiple events data arise when a subject can potentially experience several types of event or failure. For k = 1, …, K and i = 1, …, n, let Tki be the time to the kth failure of the ith subject, let Cki be the censoring time on Tki, and let Xki be the corresponding pk-vector of covariates. We assume that (T1i, …, TKi) is independent of (C1i, …, CKi) conditional on (X1i, …, XKi). The marginal accelerated failure time models take the form

logTki=XkiTβk+εki(k=1,,K;i=1,,n),

where βk is a pk-vector of unknown regression parameters, and (ε1i, …, εKi), i = 1, …, n are independent random vectors from an unspecified joint distribution with marginal distribution functions F1, …, FK. The data consists of (ki, δki, Xki), k = 1, … K; i = 1, …, n, where ki = min(Tki, Cki) and δki = I{Tki≤Cki}.

Let eki(β) = log kiβT Xki and the Gehan-type loss function for βk is then

Lk,G(β)=n1i=1nj=1nδki{eki(β)ekj(β)}.

For each k = 1, …, K, correspondingly, the 1 regularized loss function is

n1i=1nj=1nδki{eki(β)ekj(β)}+λj=1pkλj|βjk|.

Similarly, the minimization problem can be reduced to K standard linear programming problems and the distribution of the estimator can be approximated by the perturbation method.

4.2 Clustered failure time data

The clustered data arise when we have a random sample of n clusters and there are Ki members in the ith cluster. Let Tik and Cik be the failure time and censoring time for the kth member of the ith cluster, and let Xik be the corresponding p × 1 vector of covariates. We assume that (Ti1, …, Ti Ki) and (Ci1, …, Ci Ki) are independent conditional on (Xi1, …, Xi Ki). The data consist of (ik, δik, Xik), k = 1, …, Ki; i = 1, …, n, where ik = min(Tik, Cik) and Δik = I{TikCik}.

Suppose that the marginal distribution of the Tik satisfy the accelerated failure time model

logTik=XikTβ0+εik(k=1,,Ki;i=1,,n),

where β0 is a p-vector and (εi1, …, εi Ki), i = 1, …, n are independent random vectors. Define, eik(β)=logTikXikTβ, the Gehan type loss function for β is

LG(β)=n1i=1nk=1Kij=1n=1Kjδik{eik(β)ejℓ(β)}.

The 1 regularized loss function is

n1i=1nk=1Kij=1n=1Kjδik{eik(β)ejℓ(β)}+λj=1pλj|βj|.

4.3 Recurrent events data

With a random sample of n subjects, let Tki be the time to the kth recurrent event on the ith subject; let Ci and Xi be the censoring time and the p × 1 vector of covariates for the ith subject. Assume that Ci is independent of Tki (k = 1, …) conditional on Xi. Let

Ni(t)=k=1I(Tkit).

We specify the following accelerated time model for the mean frequency function:

E{Ni(t)Xi}=μ0(teβ0TXi),

where β0 is a p × 1 vector and μ0(·) is an unspecified baseline mean function. The Gehan type loss function for β is

LG(β)=n1i=1nj=1nk=1I(TkiCi)×{logTkilogCjβT(XiXj)}.

The 1 regularized Gehan loss function is

LG(β)=n1i=1nj=1nk=1I(TkiCi){logTkilogCjβT(XiXj)}+λj=1pλj|βj|.

5 Two real examples

In this section, we apply our method to analyze two well known datasets.

5.1 Primary biliary cirrhosis data

The primary biliary cirrhosis (PBC) data, provided in Therneau and Grambsch (2001), came from the Mayo Clinic trial in primary biliary cirrhosis of the liver conducted between 1974 and 1984. The data contain information about the survival time and 17 prognostic variables for 424 PBC patients who met eligibility criteria for the randomized placebo-controlled trial of the drug D-penicillamine. We considered the 276 patients with the complete information of all 17 variables and used the accelerated failure time model to study the relationship of the survival time and the prognostic variables. In our analysis, the 17 variables are drug, age, sex, ascites, hepatomegaly, spiders, edema, bilirubin, cholesterol, albumin, urine copper, alkaline phosphotase, SGOT, triglycerides, platelets, prothrombin time and histologic stage of disease. Albumin, alkaline phosphatase, bilirubin and prothrombin time have all been transformed on the natural logarithmic scale. The variables were also standardized to have mean zero and unit variance. The R package quantreg was used in both data analysis and simulation studies. Table 1 summarizes the estimated coefficients of the Gehan estimate, the Lasso estimate and the adaptive Lasso estimate. The AIC, BIC or GCV criteria were used to choose the tuning parameters respectively. For this data set, GCV and AIC yield the same estimates for the same penalty function (Lasso or adaptive Lasso) and thus only the estimated coefficients via AIC were reported. The standard errors were computed via 1000 random perturbations for the standard exponential distribution. However, we only recorded the standard errors of those which were identified as nonzero coefficients. To appreciate the relationship between various estimates and the tuning parameters, for a shrinkage estimator, we calculated the shrinkage parameter as

s=j=1p|β^|j=1p|β^G|,

where β̂G is the unpenalized Gehan estimator, β̂ is either the Lasso (β̂L) or the adaptive Lasso (β̂A) estimator.

Table 1.

Primary biliary cirrhosis data.

β̂G β̂L(AIC) β̂L(BIC) β̂A(AIC) β̂A(BIC)
Drug −0.068(0.058) −0.018(0.039) 0(−) 0(−) 0(−)
Age −0.227(0.060) −0.180(0.064) −0.157(0.062) −0.210(0.066) −0.185(0.062)
Sex 0.108(0.045) 0.073(0.045) 0.042(0.040) 0.060(0.045) 0(−)
Asc −0.097(0.077) −0.115(0.080) −0.113(0.086) −0.083(0.071) 0(−)
Hep 0.053(0.064) 0(−) 0(−) 0(−) 0(−)
Spid −0.136(0.070) −0.099(0.065) −0.073(0.061) −0.101(0.066) 0(−)
Ede −0.245(0.087) −0.241(0.088) −0.247(0.095) −0.278(0.085) −0.336(0.085)
Logbil −0.461(0.073) −0.448(0.069) −0.437(0.067) −0.484(0.070) −0.541(0.072)
Chol 0.046(0.059) 0.020(0.040) 0(−) 0(−) 0(−)
Logalb 0.143(0.072) 0.120(0.071) 0.103(0.072) 0.106(0.072) 0.039(0.056)
Cop −0.001(0.065) 0(−) 0(−) 0(−) 0(−)
Logalk 0.037(0.055) 0(−) 0(−) 0(−)
Sgot 0.117(0.056) 0.058(0.048) 0.021(0.039) 0.061(0.050) 0(−)
Trig 0.062(0.059) 0(−) 0(−) 0(−) 0(−)
Plat −0.057(0.065) 0(−) 0(−) 0(−) 0(−)
Logprot −0.067(0.061) −0.025(0.042) 0(−) 0(−) 0(−)
Stage −0.203(0.077) −0.174(0.060) −0.161(0.058) −0.192(0.062) −0.180(0.064)

Estimated coefficients of the Gehan estimate: β̂G, the Lasso estimates: β̂L, and the adaptive Lasso estimates: β̂A

In Figs. 1 and 2, both the coefficients and the criterion function (AIC/BIC/GCV) are plotted against the shrinkage parameter for Lasso and adaptive Lasso estimates respectively.

Fig. 1.

Fig. 1

The left panel displays the Lasso estimates as a function of s. The right panel shows the AIC, BIC and GCV curves plotted against s

Fig. 2.

Fig. 2

The left panel displays the adaptive Lasso estimates as a function of s. The right panel shows the AIC, BIC and GCV curves plotted against s

A few observations can be made from Table 1. First, GCV and AIC perform similarly for Lasso and adaptive Lasso penalties. Secondly, BIC yields smaller models than AIC and GCV. The combination of adaptive Lasso and BIC gives the model with the most zero coefficients. From Figs. 1 and 2, we see that BIC tends to shrink more than AIC and GCV and hence gives sparser models.

As noted by an anonymous referee, BIC usually has less favorable out-of-sample prediction performance than the cross-validation method. Here we use a tenfold crossvalidation scheme to compare the out-of-sample performance of AIC, BIC, GCV and cross-validation. To be specific, we left out one tenth of the data and use the remaining data to obtain the Lasso and adaptive Lasso estimates while choosing the tuning parameters via AIC, BIC, GCV or a tenfold cross-validation. Then we estimate the prediction error by evaluating the Gehan loss function on the one-tenth leftout sample. Table 2 shows the results for the PBC data. It indicates BIC performs a bit worse than the other criteria but the sacrificed prediction accuracy is not much.

Table 2.

AIC, BIC, GCV and cross validation’s predictive performance

Prediction error Lasso Adaptive Lasso
AIC 30.50 31.64
BIC 30.60 31.69
GCV 30.31 31.47
Cross-validation 30.28 31.42

5.2 Framingham heart data

The Framingham Heart Study (Dawber 1980) was initiated in 1948, with 2336 men and 2873 women aged between 30 and 62 years at their baseline examination. Individuals were examined every 2 years after participating in the 30-year follow-up study. Multiple events, e.g., times to coronary heart disease (CHD), denoted by T1; and cerebrovascular accident (CVA), T2, were observed from the same individual. Data used here included the participants in the study who had an examination at age 44 or 45 and were disease-free at that examination in the sense that there existed no history of hyper-tension or glucose intolerance and no previous experiences of a CHD or CVA. The original dataset contains a total of 1571 disease-free individuals. The risk factors of interest were age, x1; body mass index, x2; cholesterol level, x3; systolic blood pressure, x4; cigarette smoking, x5; and gender, x6. Since modeling biases can possibly be reduced by introducing interactions, we consider the marginal bivariate accelerated failure time model using both the main effects and first-order interactions. The variables were standardized to have zero mean and unit variance. Table 3 summarizes the estimated coefficients of the Gehan estimate, the Lasso estimate and the adaptive Lasso estimate for both CHD and CVA. The BIC criterion was used to choose the tuning parameters. It shows that there are many interactions among the risk factors on both CHD and CVA and as expected, adaptive Lasso yields sparser models than Lasso.

Table 3.

Framingham heart data.

CHD
CVA
β̂G β̂L β̂A β̂G β̂L β̂A
x1 0.035 0.025 0.028 −0.131 −0.052 −0.045
x2 0.047 0.037 0.037 −0.356 −0.091 −0.081
x3 −0.052 −0.045 −0.046 0.071 0.044 0.024
x4 −0.048 −0.041 −0.041 0.098 0.022 0.017
x5 −0.046 −0.038 −0.039 −0.694 −0.160 −0.140
x6 0.023 0.019 0.018 0.012 0 0
x1 * x2 0.003 0 0 −0.039 −0.028 −0.007
x1 * x3 −0.016 −0.013 −0.013 −0.013 0 0
x1 * x4 −0.033 −0.025 −0.024 −0.006 0.007 0
x1 * x5 −0.082 −0.071 −0.074 0.273 0.080 0.116
x1 * x6 0.015 0.011 0.007 −0.044 −0.033 −0.005
x2 * x3 0.037 0.029 0.024 −0.041 0 0
x2 * x4 0.008 0.008 0 0.050 0.006 0
x2 * x5 −0.006 0 0 0.415 0.066 0.097
x2 * x6 −0.026 −0.023 −0.021 0.004 −0.030 0
x3 * x4 0.022 0.015 0.010 0.100 0.060 0.029
x3 * x5 −0.010 −0.011 −0.005 −0.217 −0.118 −0.083
x3 * x6 −0.003 0 0 −0.019 0.010 0
x4 * x5 −0.004 −0.002 0 −0.249 −0.104 −0.096
x4 * x6 0.012 0.006 0 0.073 0.043 0.021
x5 * x6 0.041 0.032 0.031 −0.006 0 0

Estimated coefficients of the Gehan estimate: β̂G, the Lasso estimates: β̂L, and the adaptive Lasso estimates: β̂A

6 Simulations

We conducted extensive simulation study for univariate and multivariate failure time data in this session. All the simulations were conducted using R package quantreg.

6.1 Univariate failure time data

We simulated datasets consisting of 100 observations from the accelerated failure time model

logTi=βTXi+εi,

where β = (3, 1.5, 0, 0, 2, 0, 0, 0)T, the xi s were marginally standard normal and the correlation between xi and xj was ρ|ij| with ρ = 0.5. The censoring times were generated from U n(0, τ) distribution, where τ was chosen to be 142 or 50 yielding the censoring level 30% or 50% respectively. The distribution of ε was set to be N (0, 1), t3 and 0.5 N (0, 1) + 0.5 N (0, 9) respectively to assess the robustness of our proposed method. For the estimator β̂, its performance is gauged by the model error which is defined as

E(β^β0)TXTX(β^β0).

The ideal oracle estimator which knows the true nonzero co-efficients but not their exact values will apply the rank-based estimation procedure by considering only covariates x1, x2 and x5. The Lasso and adaptive Lasso were used to penalize the Gehan-loss function and the tuning parameters were chosen by AIC, BIC or GCV as defined in Sect. 3. For each method, its relative model error compared to that of the oracle estimator was computed.

In Table 4, the median relative model errors (MRME) based on 1000 simulated datasets as well the average number of correctly selected (C) and incorrectly selected (IC) variables are summarized for the three error distributions. It shows that adaptive Lasso with BIC yields estimates with smaller models and more accurate estimates. Furthermore, the adaptive Lasso outperforms the Lasso in terms of variable selection and mean squares errors and the proposed estimator is very robust to the heavy tailed distribution t3 and contamination normal distribution 0.5 N (0, 1) + 0.5 N (0, 9).

Table 4.

Simulation results for the univariate failure time data. Mix denotes the mixture error distribution 0.5N (0,1) + 0.5N (0,9). Sample size is 100

Error Censoring Method MRME C IC
N (0,1) 30% Lasso(AIC) 2.32(2.45) 3(0) 2.43(1.38)
Lasso(BIC) 2.37(2.75) 3(0) 1.46(1.15)
Lasso(GCV) 2.28(2.80) 3(0) 1.72(1.26)
ALasso(AIC) 1.48(1.55) 3(0) 1.00(1.12)
ALasso(BIC) 1.13(0.77) 3(0) 0.27(0.57)
ALasso(GCV) 1.27(0.97) 3(0) 0.61(0.86)
50% Lasso(AIC) 2.27(2.53) 3(0) 2.62(1.39)
Lasso(BIC) 2.61(2.46) 3(0) 1.63(1.25)
Lasso(GCV) 2.41(2.37) 3(0) 1.84(1.35)
ALasso(AIC) 1.46(1.58) 3(0) 1.04(1.21)
ALasso(BIC) 1.16(0.84) 3(0) 0.24(0.51)
ALasso(GCV) 1.22(1.11) 3(0) 0.63(0.91)

t3 30% Lasso(AIC) 2.41(2.07) 3(0) 2.39(1.43)
Lasso(BIC) 2.55(2.45) 3(0) 1.39(1.05)
Lasso(GCV) 2.60(2.37) 3(0) 1.46(1.08)
ALasso(AIC) 1.53(1.39) 3(0) 0.98(1.07)
ALasso(BIC) 1.30(0.93) 3(0) 0.28(0.65)
ALasso(GCV) 1.24(0.79) 3(0) 0.35(0.67)
50% Lasso(AIC) 2.61(2.27) 3(0) 2.40(1.38)
Lasso(BIC) 2.78(2.47) 3(0) 1.56(1.06)
Lasso(GCV) 2.68(2.31) 3(0) 1.58(1.04)
ALasso(AIC) 1.75(1.70) 3(0) 1.02(1.06)
ALasso(BIC) 1.35(1.04) 3(0) 0.27(0.58)
ALasso(GCV) 1.38(1.08) 3(0) 0.41(0.73)

Mix 30% Lasso(AIC) 2.30(2.70) 3(0) 2.30(1.37)
Lasso(BIC) 2.37(2.57) 3(0) 1.52(1.26)
Lasso(GCV) 2.38(2.75) 3(0) 1.52(1.28)
ALasso(AIC) 1.64(1.71) 3(0) 1.06(1.15)
ALasso(BIC) 1.46(1.47) 3(0) 0.47(0.82)
ALasso(GCV) 1.50(1.47) 3(0) 0.56(0.88)
50% Lasso(AIC) 2.15(2.61) 3(0) 2.34(1.30)
Lasso(BIC) 2.16(2.99) 3(0) 1.55(1.21)
Lasso(GCV) 2.38(2.95) 3(0) 1.68(1.27)
ALasso(AIC) 1.82(1.77) 3(0) 1.25(1.20)
ALasso(BIC) 1.44(1.56) 3(0) 0.49(0.80)
ALasso(GCV) 1.59(1.68) 3(0) 0.65(0.88)

To investigate the performance of the adaptive Lasso and the Lasso when the signal-to-noise ratio is small, following Leeb and Pötscher (2008), we multiply the regression coefficient vector by 1/n. With sample size 100 and censoring 30%, the results are summarized in Table 5. It can be seen that the adaptive Lasso performs very poorly and even a bit worse than the Lasso.

Table 5.

Simulation results when the signal-to-noise is small

Method MRME C IC
Lasso(AIC) 1.92(3.71) 2.26(0.76) 1.79(1.41)
Lasso(BIC) 2.43(4.15) 1.44(0.94) 0.58(0.98)
Lasso(GCV) 2.03(3.58) 1.82(0.87) 0.82(1.03)
ALasso(AIC) 2.27(3.30) 2.01(0.69) 1.32(1.18)
ALasso(BIC) 2.44(3.78) 1.46(0.77) 0.52(0.80)
ALasso(GCV) 2.28(3.50) 1.71(0.73) 0.66(0.84)

6.2 Multivariate failure time data

For multiple events and clustered data, two failure times T1 and T2 were generated from Gumbel (1960) bivariate distribution:

F(t1,t2)=F1(t1)F2(t2)[1+θ{1F1(t1)}{1F2(t2)}],

where −1 ≤ θ ≤ 1. The correlation between T1 and T2 is θ/4. The two marginal distributions Fk(tk), k = 1, 2, were generated from the exponential distribution with hazard rate λk=eβkTXk. We simulated 100 datasets consisting of 100 observations from the model where Xk = (x1k, …, xpk), p = 8, the xik s were marginally standard normal and the correlation between xik and xjk was ρ|ij| with ρ = 0.5. The censoring times were generated from U n(0, τ) distribution, where τ was chosen to be 1.5 yielding the censoring level 50%. For multiple events, we set β10 = (3, 1.5, 0, 0, 2, 0, 0, 0)T, β20 = (0, 0, 2, 0, 0, 1.5, 3, 0)T and Xk = X. For clustered data, X1 and X2 were generated inependently and βk0 = β0 = (3, 1.5, 0, 0, 2, 0, 0, 0)T. For recurrent events, we set β0 = (3, 1.5, 0, 0, 2, 0, 0, 0)T and the covariates were generated in the same manner as in the case of multiple events. The gap times between successive events were generated from the aforementioned Gumbel’s bivariate exponential distribution. The resultant recurrent event process is Poisson under θ = 0 and non-Poisson under θ ≠ 0. The follow-up time was an independent U n(0, 2.5) random variable, which on average yielded approximately 2.60 and 2.86 events per subject for the Poisson and non-Poisson cases respectively.

For multiple events, the performance of the estimator is gauged by the model error which is defined as

k=12E(β^kβ0k)TXTX(β^kβ0k).

The ideal oracle estimator which knows the true nonzero coefficients but not their exact values will apply the rank-based estimation procedure by considering only covariates x1, x2 and x5 for T1 and x3, x6 and x7. For clustered data, the performance of the estimator is gauged by the model error

k=12E(β^β0)TXkTXk(β^β0).

For recurrent events, the performance of the estimator is gauged by the model error

E(β^β0)TXTX(β^β0).

For clustered data and recurrent events, the ideal oracle estimator just considers variables x1, x2 and x5.

The Lasso and adaptive Lasso were used to penalize the Gehan-loss function and the tuning parameters were chosen by AIC, BIC or GCV as defined in Sect. 4. For each method, its relative model error compared to that of the oracle estimator was computed.

In Table 6, the median relative model errors (MRME) based on 100 simulated datasets as well the average number of correctly selected (C) and incorrectly selected (IC) variables are summarized for multiple events, clustered and recurrent data respectively. We can see again that adaptive Lasso with BIC performs the best.

Table 6.

Simulation results for the multivariate failure time data. Sample size is 100. Censoring level is 50%

θ Method MRME C IC
Multiple events 0 Lasso(AIC) 2.36(1.58) 6(0) 5.83(1.86)
Lasso(BIC) 2.79(1.65) 6(0) 3.92(1.74)
Lasso(GCV) 2.43(1.61) 6(0) 5.29(2.08)
ALasso(AIC) 1.77(1.38) 6(0) 2.67(1.83)
ALasso(BIC) 1.41(0.67) 6(0) 0.64(0.89)
ALasso(GCV) 1.68(1.18) 6(0) 2.45(1.82)

1 Lasso(AIC) 2.42(1.78) 6(0) 5.63(1.86)
Lasso(BIC) 2.69(1.95) 6(0) 3.89(1.75)
Lasso(GCV) 2.44(1.79) 6(0) 5.32(2.02)
ALasso(AIC) 1.81(1.44) 6(0) 2.68(1.80)
ALasso(BIC) 1.41(0.70) 6(0) 0.68(0.89)
ALasso(GCV) 1.74(1.55) 6(0) 2.46(1.90)

Clustered data 0 Lasso(AIC) 2.23(2.42) 3(0) 2.57(1.36)
Lasso(BIC) 2.56(2.53) 3(0) 1.55(1.17)
Lasso(GCV) 2.36(2.32) 3(0) 1.79(1.30)
ALasso(AIC) 1.28(1.45) 3(0) 0.94(1.27)
ALasso(BIC) 1.04(0.86) 3(0) 0.21(0.40)
ALasso(GCV) 1.08(1.10) 3(0) 0.54(0.84)

1 Lasso(AIC) 2.33(2.02) 3(0) 2.40(1.32)
Lasso(BIC) 2.70(3.70) 3(0) 1.24(1.10)
Lasso(GCV) 2.54(2.46) 3(0) 1.68(1.14)
ALasso(AIC) 1.12(1.36) 3(0) 0.72(0.96)
ALasso(BIC) 1.00(0.53) 3(0) 0.11(0.25)
ALasso(GCV) 1.03(0.80) 3(0) 0.32(0.78)

Recurrent events 0 Lasso(AIC) 2.63(2.29) 3(0) 2.42(1.39)
Lasso(BIC) 2.76(2.48) 3(0) 1.58(1.04)
Lasso(GCV) 2.64(2.32) 3(0) 1.61(1.06)
ALasso(AIC) 1.74(1.68) 3(0) 1.04(1.09)
ALasso(BIC) 1.38(1.02) 3(0) 0.29(0.56)
ALasso(GCV) 1.42(1.10) 3(0) 0.43(0.76)

1 Lasso(AIC) 2.36(2.54) 3(0) 2.76(1.34)
Lasso(BIC) 2.80(3.90) 3(0) 1.26(1.08)
Lasso(GCV) 2.50(2.76) 3(0) 1.78(1.36)
ALasso(AIC) 1.48(1.60) 3(0) 1.16(1.26)
ALasso(BIC) 1.26(0.86) 3(0) 0.18(0.42)
ALasso(GCV) 1.32(1.26) 3(0) 0.50(0.69)

7 Discussion

We propose in this paper an 1 regularized procedure for variable selection in the accelerated failure time model. The resulting estimates possess the oracle properties if the adaptive Lasso penalty is used for penalization and BIC is used for tuning parameter selection. Additionally, we extend the 1 regularized procedure to multivariate failure time models, including multiple events data, clustered survival data and recurrent events data. Extensive simulation study and a real data analysis of primary biliary cirrhosis data illustrate the usefulness of our approach in terms of both variable selection and coefficient estimation. Although we only considered the Gehan statistics based loss function, it is rather straightforward to extend our approach to other weighting schemes discussed in Jin et al. (2003). Our current implementation via quantreg uses a grid of tuning parameter values and alternatively, we could implement the path following algorithm as detailed in Li and Zhu (2008), which, as noted by an anonymous referee, has been recently investigated in Cai et al. (2009). We modeled multivariate survival data via the marginal approach. It would be interesting to investigate how to conduct variable selection while accounting for correlations among multiple failure times.

Acknowledgments

This research was supported by grants from the U.S. National Institutes of Health, the U.S. National Science Foundation and the National University of Singapore (R-155-000-075-112 and R-155-000-080-112). We are very grateful to the editor, the associate editor and the referees for their helpful comments which have greatly improved the paper.

Appendix

Proofs of Theorems 1 and 2

Proof of Theorem 1 proceeds by first establishing the local quadratic property (Proposition 1) of the Gehan loss function and an inequality (Proposition 2) which relates the minimizer of the penalized loss function to the minimizer of a penalized quadratic function. Then the n consistency of the penalized estimator and the oracle properties follow by applying the arguments similar to those of Fan and Li (2001).

Define

AG=limnn1i=1nS(0)(β0;t)×{XiX¯(β0;t)}2{λ˙(t)/λ(t)}dNi(β0;t),
BG=limnn1i=1n[S(0)(β0;t)]2×{XiX¯(β0;t)}2dNi(β0;t).

Proposition 1

Under conditions 1–4 of Ying (1993, p. 80), for every sequence dn > 0 with dn → 0 a.s., we have

n1LG(β)n1LG(β0)=n1UGT(β0)(ββ0)+(ββ0)TAG(ββ0)/2+o(ββ02+n1) (4)

holds uniformly inββ0 ∥ ≤ dn.

Proof

It follows from Theorem 2 of Ying (1993) that almost surely, uniformly in ∥ββ0∥ ≤ dn, we have

UG(β)=UG(β0)+nAG(ββ0)+o(n1/2+nββ0) (5)

and denote UG = (UG1, …, UGp)T, AG = (aij), 1 ≤ i, jp, β = (β1, …, βp)T, β0 = (β01, …, β0p)T, and βj = (β1, …, βp−j, β0(pj+1), …, β0p)T, 1 ≤ jp, β0 = β, noticing that βp = β0, we have

LG(β)LG(β0)=LG(β0)LG(βp)=k=1p[LG(βk1)LG(βk)],=k=1pβ0(pk+1)βpk+1UG(pk+1)(β1,,βpk,spk+1,β0(pjk+2),,β0p)dspk+1.

By (5),

UG(pk+1)(β1,,βpk,spk+1,β0(pjk+2),,β0p)=UG(pk+1)(β0)+nl=1pka(pk+1)l(βlβ0l)+na(pk+1)(pk+1)(spk+1β0(pk+1))+o(n1/2+nββ0)

then

LG(β)LG(β0)=k=1pUG(pk+1)(β0)(βpk+1β0(pk+1))+n2(βpk+1β0(pk+1))2+nk=1pl=1pka(pk+1)l(βlβ0l)(βpk+1β0(pk+1))+o(n1/2ββ0+nββ02)=UGT(β0)(ββ0)+n(ββ0)TAG(ββ0)/2+o(n1/2ββ0+nββ02).

Hence

n1LG(β)n1LG(β0)=n1UGT(β0)(ββ0)+(ββ0)TAG(ββ0)/2+o(n1/2ββ0+ββ02)=n1UGT(β0)(ββ0)+(ββ0)TAG(ββ0)/2+o(ββ02+n1).

This completes the proof.

Consider the object function:

C(u)=uTDu/2aTu+j=1sλjuj+j=s+1pλj|uj|

where uRp, D is a positive definite matrix, λ1, …, λs are constants, λs+1,…, λp are nonnegative constants, and suppose that û is a minimizer of c(u), we have the following proposition.

Proposition 2

For any u, we have C(u) − C(û) ≥ (uû)T D(uû)/2.

Suppose that ûn is a minimizer of the objective function

Bn(u)=n1/2UGT(β0)u+uTAGu/2+i=1snλnisgn(β0i)ui+i=s+1pnλni|ui|. (6)

By Propositions 1 and 2, it can be shown that n12(β^β0) and ûn have the same asymptotic distribution. Then it is straightforward to obtain the n consistency of the penalized estimator and the oracle properties by following the same arguments in Fan and Li (2001).

Proof of Theorem 2 can be established similarly by looking at the perturbed penalized loss function and applying the conditional arguments as in Jin et al. (2003) and is thus omitted.

Contributor Information

Jinfeng Xu, Email: staxj@nus.edu.sg, Department of Statistics and Applied Probability, Risk Management Institute, National University of Singapore, 117546 Singapore, Singapore.

Chenlei Leng, Email: stalc@nus.edu.sg, Department of Statistics and Applied Probability, Risk Management Institute, National University of Singapore, 117546 Singapore, Singapore.

Zhiliang Ying, Email: zying@stat.columbia.edu, Department of Statistics, Columbia University, New York, NY 10027, USA.

References

  1. Cai T, Huang J, Lu T. Regularized estimation for the accelerated failure time model. Biometrics. 2009 doi: 10.1111/j.1541-0420.2008.01074.x. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Cox DR. Regression models and life-tables (with Discussion) J R Stat Soc B. 1972;34:187–220. [Google Scholar]
  3. Dawber TR. The Epidemiology of Atherosclerotic Disease. Harvard University Press; Cambridge: 1980. The Framingham Study. [Google Scholar]
  4. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–1360. [Google Scholar]
  5. Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Stat. 2002;30:74–99. [Google Scholar]
  6. Gehan EA. A generalized Wilcoxon test for comparing arbitrarily single-censored samples. Biometrika. 1965;52:203–223. [PubMed] [Google Scholar]
  7. Gumbel EJ. Bivariate exponential distributions. J Am Stat Assoc. 1960;55:698–707. [Google Scholar]
  8. Jin Z, Ying Z, Wei LJ. A simple resampling method by perturbing the minimand. Biometrika. 2001;88:381–390. [Google Scholar]
  9. Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika. 2003;90:341–353. [Google Scholar]
  10. Jin Z, Lin DY, Ying Z. Rank regression analysis of multivariate failure time data based on marginal linear models. Scand J Stat. 2006;33:1–23. [Google Scholar]
  11. Johnson BA. Variable selection in semiparametric linear regression with censored data. J R Stat Soc Ser B. 2008;70:351–370. [Google Scholar]
  12. Johnson BA, Peng LM. Rank-based variable selection. J Nonparametric Stat. 2008;20:241–252. [Google Scholar]
  13. Johnson BA, Lin DY, Zeng D. Penalized estimating functions and variable selection in semiparametric regression models. J Am Stat Assoc. 2008;103:672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Time Data. 2. Wiley; New York: 2002. [Google Scholar]
  15. Koenker R, D’Orey V. Computing regression quantiles. Appl Stat. 1987;36:383–393. [Google Scholar]
  16. Leeb H, Pötscher BM. Sparse estimators and the oracle property, or the return of Hodges’ estimator. J Econom. 2008;142:201–211. [Google Scholar]
  17. Li Y, Zhu J. L1-norm quantile regression. J Comput Graph Stat. 2008;17:163–185. [Google Scholar]
  18. Lu W, Zhang HH. Variable selection for proportional odds model. Stat Med. 2007;26:3771–3781. doi: 10.1002/sim.2833. [DOI] [PubMed] [Google Scholar]
  19. Parzen MI, Wei LJ, Ying Z. A resampling method based on pivotal estimating functions. Biometrika. 1994;81:341–350. [Google Scholar]
  20. Rao CR, Zhao LC. Approximation to the distribution of M-estimates in linear models by randomly weighted bootstrap. Sankhyā A. 1992;54:323–331. [Google Scholar]
  21. Therneau TM, Grambsch PM. Introduction to Nonparametric Regression. Springer; New York: 2001. [Google Scholar]
  22. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc B. 1996;58:267–288. [Google Scholar]
  23. Tibshirani R. The Lasso method for variable selection in the cox model. Stat Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
  24. Wang H, Leng C. Unified Lasso estimation via least squares approximation. J Am Stat Assoc. 2007;102(479):1039–1048. [Google Scholar]
  25. Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection via the LAD-LASSO. J Bus Econ Stat. 2007a;25:347–355. [Google Scholar]
  26. Wang H, Li G, Tsai CL. Regression coefficients and autoregressive order shrinkage and selection via the Lasso. J R Stat Soc Ser B. 2007b;69:63–78. [Google Scholar]
  27. Wang H, Li R, Tsai CL. Tuning parameter selector for SCAD. Biometrika. 2007c;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wei LJ, Ying Z, Lin DY. Linear regression analysis for censored observations based on rank tests. Biometrika. 1990;77:845–851. [Google Scholar]
  29. Ying Z. A large sample study of rank estimation for censored regression data. Ann Stat. 1993;21:76–99. [Google Scholar]
  30. Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
  31. Zou H. The adaptive Lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–1429. [Google Scholar]
  32. Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]

RESOURCES