Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Oct 1.
Published in final edited form as: Stat Probab Lett. 2014 Jul 5;93:126–133. doi: 10.1016/j.spl.2014.06.024

A Modified Adaptive Lasso for Identifying Interactions in the Cox Model with the Heredity Constraint

Lu Wang a,*, Jincheng Shen a, Peter F Thall b
PMCID: PMC4111275  NIHMSID: NIHMS612537  PMID: 25071299

Abstract

In many biomedical studies, identifying effects of covariate interactions on survival is a major goal. Important examples are treatment-subgroup interactions in clinical trials, and gene-gene or gene-environment interactions in genomic studies. A common problem when implementing a variable selection algorithm in such settings is the requirement that the model must satisfy the strong heredity constraint, wherein an interaction may be included in the model only if the interaction’s component variables are included as main effects. We propose a modified Lasso method for the Cox regression model that adaptively selects important single covariates and pairwise interactions while enforcing the strong heredity constraint. The proposed method is based on a modified log partial likelihood including two adaptively weighted penalties, one for main effects and one for interactions. A two-dimensional tuning parameter for the penalties is determined by generalized cross-validation. Asymptotic properties are established, including consistency and rate of convergence, and it is shown that the proposed selection procedure has oracle properties, given proper choice of regularization parameters. Simulations illustrate that the proposed method performs reliably across a range of different scenarios.

Keywords: Modified adaptive Lasso, Oracle property, Penalized partial likelihood, Regularization, Variable selection

1. Introduction

The Cox proportional hazards regression model (Cox, 1972, 1975) is the most widely used statistical model for evaluating relationships between an event time, T, and baseline covariates, X = (X1, ⋯, Xp). The Cox model is characterized by the hazard function

h(tX,β)=h0(t)exp{g(X,β)},t>0, (1)

for a subject with covariates X, where h0(t) is an unspecified baseline hazard and the linear component g(X,β)=Σj=1pβjXj is characterized by a vector β = (β1, ⋯, βp)T of unknown regression coefficients. In many applications, this simple form of g(X, β) does not adequately describe the relationship between T and X due to interactions between elements of X. For example, an anti-cancer agent tailored to attack a certain biological target typically has effects hypothesized to differ between the subgroups of patients who do and do not have the target, identified by a binary “biomarker” covariate. In a randomised trial of the targeted agent versus standard therapy, such a differential effect is characterized as a treatment-biomarker interaction which, if found to be sufficiently large, may lead to regulatory approval of the agent for patients who are biomarker positive. Identifying such treatment-biomarker interactions thus is a key step in developing personalized treatments. The presence of other covariates that may or may not be associated with T, and that also may interact with treatment, complicates identification and estimation of biomarker effects. Fitting the Cox model with interactions is challenging since, even with a moderate number of covariates, the number of interaction terms may be large, and not all covariates and interactions may have meaningful effects on h(t|X, β).

There is a large literature on variable selection methods for survival models. A family of penalized partial likelihood methods have been proposed for the Cox proportional hazard model, including the Lasso (Tibshirani et al., 1997) and the smoothly clipped absolute deviation method (Fan and Li, 2002). By shrinking some regression coefficients to zero, these methods simultaneously select important variables and estimate the regression model parameters. Zhang and Lu (2007) proposed an adaptive Lasso estimator for variable selection in the Cox model, with an adaptively weighted L1 penalty on the regression coefficients that has a convex form. They showed that this method enjoys the oracle properties, global optima exist, and it can be implemented efficiently using standard numerical algorithms (Boyd and Vandenberghe, 2004).

All the above variable selection methods for the Cox model treat the candidate variables equally. However, when interactions are included, there is a natural hierarchical ordering among the variables in the model (Chipman, 1996; Joseph, 2006; Yuan et al., 2007). This motivates the strong heredity requirement that an interaction can be included in the model only if the interaction’s component variables are included as main effects (Hamada and Wu, 1992), since models that violate this property are difficult to interpret. For linear and generalized linear regression models, Yuan et al. (2009) proposed non-negative garrote methods that naturally incorporate a general hierarchical structure among predictors. Along this line, Choi et al. (2010) extended the Lasso to identify interaction terms while obeying the strong heredity constraint, which is achieved by reparameterizing the coefficients of the interaction terms. Bien et al. (2013) investigated a Lasso for hierarchical interactions, and Radchenko and James (2010) considered a more general case with nonlinear interactions. To our knowledge, none of these papers studied the setting with time-to-event data, and none of the variable selection methods for the Cox model noted above satisfy the strong heredity constraint when interactions are included. This paper aims at filling this gap.

We propose a modified Lasso procedure for the Cox model to adaptively select covariates and interactions while automatically enforcing the strong heredity constraint. The main challenges compared to linear/generalized linear regression are that the Cox model is semiparametric and involves right-censored data. We carry out estimation and variable selection by optimizing a modified log partial likelihood that includes two adaptively weighted penalty terms, one for main effects and one for reparameterized interactions, with each penalty multiplied by a tuning parameter. The main reason that we chose the reparameterization approach is that this method can handle all p main effects simultaneously in one iteration, which has tremendous numerical advantage, especially for survival analysis when no closed form can be found. The two-dimensional tuning parameter is determined by generalized cross-validation. The proposed method is computationally convenient, has good convergence properties, and implementation is straightforward. We establish asymptotic properties, including selection consistency, rate of convergence, and the oracle property (Fan and Li, 2001; Fan and Peng, 2004) that it performs as well as if the correct underlying model were known in advance. These theoretical properties and algorithms have not been studied previously for variable selection for Cox models with interaction terms subject to the strong hierarchy constraint.

2. Adaptive Lasso with Strong Heredity Constraint Using Penalized Partial Like-lihood

Let Ti denote the failure time, Ci the censoring time, and Xi = (Xi1, ⋯, Xip) the covariate vector of the ith subject, for i = 1, ⋯, n, with i = min(Ti, Ci) and δi = I(TiCi). Suppose that (X1, T1, C1), , (Xn, Tn, Cn) are independent and identically distributed. We further assume non-informative censoring, Ti ⫫ Ci | Xi, i.e., Ti is conditionally independent of Ci given Xi. Although our methods can be generalized to handle higher order interactions, for the ease of exposition we consider the Cox model having linear component with all possible two-way interactions. That is, in model (1)

g(X,β,α)=j=1pβjXj+1j<jpαj,jXjXj, (2)

where α = (α1,2, ⋯, αp−1,p)T. Our goals are to provide a method that determines which terms in g(X, β, α) have important effects on the hazard, and develop a corresponding computational algorithm and parameter estimators having desirable properties. The existing variable selection methods for the Cox model do not guarantee the strong heredity constraint, as they treat all elements of (β, α) equally and do not distinguish between elements of β and α.

We first re-parameterize the coefficients for the interaction terms in (2) as αj,j′ = γj,jβjβj, so that the linear term becomes

g(X,θ)=j=1pβjXj+1j<jpγj,jβjβjXjXj (3)

and the parameter vector is θ = (β, γ) = (β1, ⋯ βp, γ1,2, ⋯, γp−1,p)T. With this reparameterization, the coefficient for an interaction term XjXj′ must be 0 if either of its two main effects Xj or Xj′ has coefficient 0. Conversely, if γj,j′βjβj′ ≠ 0, this implies that both βj ≠= 0 and βj′ ≠= 0, which guarantees the strong heredity constraint. With the reparameterization (3), the log partial likelihood is

ln(θ)=i=1nδi(g(Xi,θ)log[r=1nI(T~rT~i)exp{g(Xr,θ)}]), (4)

where I(A) denotes the indicator of the event A. For the variable selection problem at hand, we will minimize the adaptive penalized negative log partial likelihood

Qn(θ,λβ,λγ)=ln(θ)+nλβj=1pwjββj+nλγ1j<jpwj,jγγj,j, (5)

where {wjβ} and {wj,jγ} are prespecified weights, and λβ, λγ are tuning parameters. Following Breiman (1995) and Zou (2006), we compute the weights in (5) using

wjβ=log(n)n1β~j,wj,jγ=log(n)nβ~jβ~jα~j,j,

where β~j’s and α~j,j’s are the estimates from a usual, unpenalized fitted Cox model with linear component (2). The multiplier log(n)/n is included to satisfy the convergence rate and the asymptotic properties introduced in Section 3, below. The objective function becomes

Qn(θ,λβ,λγ)=ln(θ)+log(n)λβj=1pβjβ~j+log(n)λγ1j<jpγj,jβ~jβ~jα~j,j. (6)

The two tuning parameters in (6) control the coefficient estimates at different levels. The first tuning parameter λβ controls main effect estimates. If βj is shrunk to zero, all terms involving Xj, including βjXj and the interactions γj,j′βjβj′ XjXj′, for any j′, are removed from the model. The second tuning parameter λγ controls the estimates only at the interaction effect level. Even if both βj ≠ 0 and βj′ ≠ 0, it is possible that γj,j′ = 0 if Xj and Xj′ do not interact. The penalty term controlled by λγ thus provides the flexibility of selecting only main effects of Xj and Xj′ but not their interaction.

The weights act on the objective function, Qn(θ, λβ, λγ), as follows. If the initial estimate β~j is close to 0 then wjβ will be large and hence, as can be seen from (5), the coefficient βj of Xj will be heavily penalized. Similarly, if α~j,j is small relative to β~jβ~j then wj,jγ will be large and the coefficient γj,j of the interaction term XjXj′ will be heavily penalized.

3. Theoretical Properties of the Estimator

In this section, we study the asymptotic properties of our proposed variable selection procedure and the corresponding estimator. As n → ∞, our estimator possesses the oracle property under certain regularity conditions, that is, it performs as well as if the true model were known in advance (Fan and Li, 2001). The regularity conditions that we need throughout the development are given in Appendix 1 in the web supplementary materials.

Let θ0=(β0T,γ0T)T denote the true parameter vector, where

γ0j,j={α0j,jβ0jβ0jifβ0j0andβ0j00otherwise} (7)

This guarantees that the true model obeys the strong heredity constraint, that is, α0j,j = γ0j,j = 0 if either β0j = 0 or β0j = 0.

Define the covariate-specific tuning parameters λj,nβ=λβwjβ=n1log(n)λββ~j for j = 1, ⋯, p and the interaction-specific tuning parameters λj,j,nγ=λγwj,jγ=n1log(n)λγγ~j,j for 1 ≤ j < j′ ≤ p, where γ~j,j=α~j,jβ~jβ~j. Our proposed estimator is

θ^=argminθ{ln(θ)+nj=1pλj,nββj+n1j<jpλj,j,nγγj,j}. (8)

Note that the criterion in (8) is the same as Qn(θ, λβ, λγ) in (6) with different notation.

Without loss of generality, we denote β0=(βa0T,βb0T)T, where βa0 consists of all nonzero components and βb0 consists of the remaining zero components of β0. Similarly, for the true coefficient of interactions, write γ0=(γa0T,γb0T,γc0T)T, where γa0 contains all nonzero components of γ0, γb0 contains the zero components of γ0 whose corresponding main effects both are nonzero, and γc0 contains the remaining zero components of γ0 whose corresponding components of main effects have at least one zero. Correspondingly, we denote the maximizer of (8) as θ^n=(β^anT,β^bnT,γ^anT,γ^bnT,γ^cnT)T, the covariate-specific tuning parameters as {(λanβ)T,(λbnβ)T}, and the interaction-specific tuning parameters as {(λanγ)T,(λbnγ)T,(λcnγ)T}.

Let

ξn=max{(λanβ)T,(λanγ)T},ζn=min{(λbnβ)T,(λbmγ)T}.

Then ξn is the maximum among both covariate-specific and interaction-specific tuning parameters that correspond to nonzero coefficients in the model. But for ζn, we only consider those specific tuning parameters corresponding to zero main effects and zero interaction terms when coefficients are in γb0. That is, we consider all zero cases for the main effects, but for the interactions, we only consider those zero terms whose corresponding main effects are nonzeros. We will refer to such terms in the definition of ζn as non-trivial zero terms.

For the proposed variable selection procedure to perform properly, as n → ∞ the covariate-specific and interaction-specific penalties for the terms whose true coefficients are nonzeros should converge to 0, and the penalties for those non-trivial zero terms should be large enough so that the estimates shrink to 0. In fact, if the tuning parameters in (8) that correspond to βa0 and γa0 converge to 0, and those corresponding to βb0 and γb0 are sufficiently large, then our proposed estimating procedure will have the so-called oracle property (Fan and Li, 2001). This can be guaranteed when n–consistent estimates of θ0 are used in the definition of λj,nβ and λj,j,nγ, where one can easily show that nξn → 0 and nζn → ∞.

Denote the initial estimates as β~=(β~1,,β~p), γ~=(γ~1,2,,γ~p1,p), and we summarize the above results in Theorem 1. A sketch of the proof of Theorem 1 is given in Appendix 2.

Theorem 1. When n(β~β0)=Op(1) and n(γ~γ0)=Op(1) in (8) to calculate λj,nβ and λj,j,nγ, nξn and nζn as n → ∞. Under the regularity conditions (1)-(3) in Appendix 1, there exists a local minimizer θ^n of Qn(θ) such that

  1. (Sparsity) P(β^bn=0)1, P(γ^bn=0)1, and P(γ^cn=0)1 as n → ∞.

  2. (Asymptotic normality)

    n{(β^anγ^an)(βa0γa0)}dN{0,Ia1(βa0,γa0)},

where Ia(βa0,γa0) is the Fisher information matrix evaluated at βa0 and γa0 assuming that βb0 = 0, γb0 = 0, and γc0 = 0 is known in advance.

Part (i) of Theorem 1 presents the sparsity property and shows that our proposed method can consistently remove the zero-effect terms with probability tending to 1. This implies that, with a sufficiently large sample, in practice our method can select the underlying true model with high probability. In part (ii) of Theorem 1, we establish that the estimates of nonzero elements of θ0 are n–consistent and asymptotically normal. The asymptotic distribution is the same as what would be obtained if it were known in advance which elements of θ0 are 0 and which are not 0, the so-called oracle property.

Theorem 2, below, establishes the asymptotic behavior of the proposed method when p is very large, especially when p → ∞ as n → ∞. This is an important case in many biomedical studies, with the advance of new high throughput technologies. A proof of Theorem 2 is given in the Appendix 3, provided in the web supplementary materials.

When the number of predictors may increase with the sample size n, we denote p as pn, which allows the possibility that pn → ∞ as n → ∞. Including all main effects and pairwise interactions, the total number of parameters is qn = (pn + 1)pn/2. Similarly, when appropriate we add a subscript n to other notation, and we let dn denote the number of non-zero coefficients in the underlying true model. Then we have

Theorem 2. Under the regularity conditions (4)-(6) in Appendix 1, if pn = o(n1/5) and nqnξn0, nqnζn as n → ∞, then there exists a local minimizer θ^n of Qn(θ) such that

  1. (Sparsity) P(β^bn=0)1, P(γ^bn=0)1, and P(γ^cn=0)1 as n → ∞.

  2. (Asymptotic Normality)

    nΩnIan12(βa0,γa0){(β^anγ^an)(βa0γa0)}dN{0,Σ},

where Ωn is any arbitrary d × dn matrix with a finite d such that ΩnΩnTΣ, Σ is a d × d semipositive definite symmetric matrix, and Ian(βa0,γa0) is the dn × dn Fisher information matrix evaluated at (βa0, γa0) assuming that βb0 = 0, γb0 = 0, and γc0 = 0 is known in advance.

The reason we consider an arbitrary linear combination Ωn(β^anT,γ^anT)T in Theorem 2, instead of (β^anT,γ^anT)T as in Theorem 1, is because the latter has dimension dn → ∞ as n → ∞ in the current setup, while the former has finite dimension d.

4. Computational Algorithm

Expanding ln(θ) in the objective function (6), Qn(θ, λβ, λγ) becomes

i=1nδi[(j=1pβkXij+1j<jpγj,jβjβjXijXij)log{r=1nI(T~rT~i)×exp(j=1pβjXrj+1j<jpγj,jβjβjXrjXrj)}]+log(n)λβj=1pβjβ~j+log(n)λγ1j<jpγjjβ~jβ~jα~jj

Our estimator minimizes Qn(θ, λβ, λγ), that is, θ^n=argminθQn(θ,λβ,λγ). To carry out the optimization, we apply a modified version of the iteratively reweighted least squares algorithm (Green, 1984) with weighted L1 penalties. Denoting the gradient vector of the partial likelihood by l.(θ)=ln(θ)θ and the Hessian matrix by l¨(θ)=2ln(θ)θθT, and using the Cholesky decomposition l¨(θ)=MMT, where M is an invertible lower triangular matrix, we define the pseudo response vector Y=(M)1{l¨(θ)θl.(θ)}. By the usual second-order Taylor expansion, −ln(θ) in (6) can be approximated by the quadratic form (YMTθ)T (Y − MTθ)/2, and at each penalized iteratively reweighted least squares iteration we minimize

12(YMTθ)T(YMTθ)+log(n)λβj=1pβjβ~j+log(n)λγ1j<jpγj,jβ~jβ~jα~j,j. (9)

Because the main effect coefficients β and the interaction coefficients γ are controlled at different levels, in each step we also iterate between these two sets, first fixing β to estimate γ, then fixing γ to estimate β, and iterating until convergence. This algorithm is guaranteed to converge, since the objective function decreases at each step. When β is fixed, the optimization in γ becomes a Lasso problem, hence one can use either the LARS/Lasso algorithm (Efron et al., 2004) or quadratic programming to solve for γ efficiently. When γ is fixed, we solve for β1, ⋯, βp sequentially. For each j = 1, ⋯, p, we fixγ and β[−j] = (β1, ⋯, βj−1, βj+1, ⋯, βp), and the optimization becomes a simple Lasso problem with only one parameter, βj. This is similar to the shooting algorithm (Fu, 1998; Zhang and Lu, 2007; Friedman et al., 2007). The tuning parameters are selected by minimizing the generalized cross validation statistic,

CpseudoGCV=l(θ^)(1dfn)2,

over a reasonable range of λβ and λγ, where df is the number of nonzero parameters in the fitted model.

For a fixed λβ and λγ, the optimization algorithm proceeds as follows:

Step 1. Center and normalize each term Xj, XjXj′, j < j′, and j, j′ = 1, …, p.

Step 2. Start with plausible initial values β^j(0) and γ^j,j(0), j < j′, and j, j′ = 1, …, p, such as the conventional Cox regression parameter estimates. Set m=1.

Step 3. Compute Y and M based on the current value θ^(m1). Denote

l^(θ)=12(YMTθ)T(YMTθ).

Step 4. To update γ^, let γ^(m)=argminγΣi=1n{l~(β^(m1),γ)+nλγΣj<jwj,jγγj,j}.

Step 5. To update β^ for each j = 1, …, p in sequence, let

β^j(m)=argminβji=1n{l~(β^[j](m1),βj,γ^(m))+λβwjββj}.

Step 6. If the relative difference between Qn(θ^(m1)) and Qn(θ^(m)),

Δ(m)=Qn(θ^(m1))Qn(θ^(m))Qn(θ^(m1)),

is small enough, then stop. Otherwise, increment m to m+1 and return to Step 3.

Since γj,j = αj,j /βjβj′, in Step 3 the minimization is actually over {αj,j} with β=β^(m1). This algorithm gives exact zeros for some coefficients and guarantees that the coefficients of the corresponding interactions are set to 0 whenever the corresponding is shrunk to 0. Based on our empirical experience, the algorithm converges quickly.

5. Numerical Results

In this section, we report results of a simulation study comparing our proposed method with the Lasso and adaptive Lasso, two popular variable selection methods for the Cox model, neither of which guarantees the strong heredity constraint. We follow the simulation setup in Zhang and Lu (2007), but also include two-way interaction terms as candidates for the variable selection. We consider sample size n = 200, and assume that there are p = 5 covariates of interest in each simulated dataset, denoted by X1, X2, ⋯, X5. Thus, the number of all possible two-way interaction terms is (p−1)/2 = 10, and there are a total of 15 candidate terms. We assume each covariate follows a standard normal distribution, and consider two scenarios: (i) the covariates are independent, and (ii) the covariates have pairwise correlations Corr(Xj, Xj′) = ρ|jj′|, with ρ = 0.5. We generate the censoring times from a Uniform distribution having support (0, τ), with τ chosen to obtain a specified censoring rate of either 25% or 40%.

Then suppose failure times are generated from a Cox model with constant baseline hazard λ0 = 0.1, where the coefficients of X2, X3, and the interaction term X2 X3 are non-zero, and the other 12 coefficients are zero. We consider the following two models:

Model 1: β20 = −0.8, β30 = −0.8, and α2,30 = −0.8, corresponding to large effects.

Model 2: β20 = −0.3, β30 = −0.3, and α2,30 = −0.3, corresponding to small effects.

We simulate each case 100 times, and apply all three methods to each simulated data set. For all methods, the tuning parameters are selected using the generalized cross validation criterion Cpseudo−GCV .

Table 1 summarizes the percentage of fully correct variable selection among 100 replications for each method under each scenario. For all cases, the proposed method correctly selects the true model more frequently than the regular Lasso and adaptive Lasso. Specifically, for all scenarios under Model 2, where important variables and interactions have small effects, the Lasso and adaptive Lasso perform about the same, with the latter slightly better, while our proposed method performs much better than these two methods. In Model 1, where important variables and interactions have large effects, the advantage of our proposed method is still substantial when the censoring percentage is high, 40%. When censoring is moderate, 25%, the adaptive Lasso becomes competitive, but still is worse than the proposed method. The performance of the Lasso is always the worst among the three methods.

Table 1.

Percentages of correctly selected models among 100 replications.

Method
Censoring
Percentage
ρ for
Correlation
Lasso Adaptive
Lasso
Proposed
Method
Model 1 25% 0 11% 70% 86%
0.5 6% 66% 79%
40% 0 3% 58% 80%
0.5 3% 54% 82%

Model 2 25% 0 43% 49% 74%
0.5 26% 31% 63%
40% 0 28% 32% 74%
0.5 22% 33% 62%

Table 2 summarizes the individual frequencies of being selected into the model for each main effect and interaction term. We only present the results here for 25% censoring under both models when the covariates are independent (ρ = 0) due to space limitations, but similar results are observed for the 40% censored case, and correlated covariates case. As shown in Table 2, the proposed variable selection procedure always chooses important variable and interaction terms much more often than the other two methods. Under Model 1 for large effects, the adaptive Lasso and the proposed method in general select the unimportant terms less often than the Lasso, while for Model 2 with small effects, the proposed method actually selects some of the unimportant terms more often. However, as illustrated in Table 1, the proposed method simultaneously selects all the important terms and removes the unimportant ones more frequently.

Table 2.

Term-specific percentages of being selected for each main effect and interaction. (25% censoring and ρ = 0 for independent covariates)

Method x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 1 x 3 x l x 4
Model 1 Lasso 22% 100% 100% 18% 25% 15% 12% 15%
adaptive Lasso 6% 100% 100% 3% 7% 2% 4% 1%
proposed method 5% 100% 100% 4% 6% 5% 5% 0%

Model 2 Lasso 6% 91% 91% 6% 5% 2% 6% 3%
adaptive Lasso 2% 87% 85% 4% 4% 1% 2% 1%
proposed method 10% 98% 99% 11% 8% 10% 10% 1%
Method x 1 x 5 x 2 x 3 x 2 x 4 x 2 x 5 x 3 x 4 x 3 x 5 x 4 x 5
Model 1 Lasso 15% 100% 13% 13% 11% 13% 15%
adaptive Lasso 2% 100% 3% 2% 2% 2% 0%
proposed method 1% 100% 4% 6% 4% 6% 0%

Model 2 Lasso 3% 89% 4% 3% 9% 5% 2%
adaptive Lasso 2% 82% 1% 2% 1% 0% 0%
proposed method 1% 97% 10% 8% 11% 8% 2%

A simple, ad hoc alternative approach often employed in practice for variable selection involving interaction terms is to run either the Lasso or the adaptive Lasso, and then, depending on which terms were selected, manually add back any main terms that were not selected but that were components of any interaction term that was selected. This ensures the strong heredity constraint. As seen in our simulation, this practice does not help much in terms of how often the true model is selected correctly, while the false positive error rate is greatly increased. For example, as one can see in Table 2 for model 2, since the proposed method selected the true interaction term x2x3 more often than the other two methods, using the above ad hoc approach following the Lasso or the adaptive Lasso still cannot beat the proposed method. For model 1 with large effects in Table 2, the frequency of x2, x3, and x2x3 being correctly selected already achieves 100%, so this ad hoc approach does not help at all.

To measure the prediction accuracy, we average the mean squared error CMSE=(θ^θ)TΓ(θ^θ) over 100 replications by following Tibshirani et al. (1997) and Zhang and Lu (2007), where is the population variance-covariance matrix of the covariates. Standard errors are given in parentheses. For all scenarios, the proposed method has the smallest mean squared error (Table 3), and thus outperform the other two competitors in terms of prediction accuracy.

Table 3.

Mean squared error, with standard errors in parentheses.

Censoring
Percentage
ρ for
Correlation
Method
Lasso Adaptive Lasso Proposed Method
Model 1 25% 0 0.213 (0.124) 0.072 (0.068) 0.063 (0.065)
0.5 0.219 (0.136) 0.071 (0.068) 0.069 (0.065)
40% 0 0.337 (0.209) 0.168 (0.144) 0.117 (0.219)
0.5 0.351 (0.215) 0.162 (0.172) 0.118 (0.216)

Model 2 25% 0 0.118 (0.053) 0.105 (0.055) 0.042 (0.041)
0.5 0.131 (0.059) 0.130 (0.059) 0.054 (0.048)
40% 0 0.117 (0.053) 0.122 (0.058) 0.061 (0.062)
0.5 0.137 (0.066) 0.142 (0.068) 0.082 (0.078)

6. Discussion

We have extended the adaptive Lasso method to accommodate the Cox proportional hazard model including interaction terms while ensuring that the strong heredity constraint is satisfied in the selected model. Hamada and Wu (1992) consider other constraints, such as weak heredity where only one of the two main terms is required to be included when an interaction term is considered. Although this situation is not of our main interest, our methods can easily be modified to handle it by employing a different reparameterization.

Similarly, as discussed in Zhang and Lu (2007), the adaptive choice of weights may become problematic when some elements of θ are not estimable. This may occur, for example, when strong collinearity exists among covariates, or when the number of covariates p is much larger than the sample size n in high-dimensional data. In such settings, one cannot obtain the initial estimates to determine the adaptive weights. Alternatively, robust estimation of θ, such as ridge regression, may be considered.

Our work was motivated by the desire to identify key treatment-biomarker interactions for developing personalized treatments, where the number of candidate biomarkers is usually fixed or may grow slowly with n. Even with a moderate number of candidate biomarkers, however, our methodology can have an important impact on physician’s actual behavior, if clinically meaningful treatment-biomarker effects are identified. The condition p = o(n1/5) that we used may be relaxed though. There have been recent theoretical developments on high-dimensional survival models such as those in Bradic et al. (2011) and Lin and Lv (2013), but their underlying theory cannot be applied directly to our setting, which uses a reparameterization approach for selecting important interactions with heredity constraint. In order to relax our condition p = o(n1/5), modification of the above theory to our setting would be required, and this wouldentail a substantial amount of work. While this is beyond the scope of the current paper, it certainly would be an interesting future research project.

Supplementary Material

01

Acknowledgements

The authors thank the Co-Editor-in-Chief, Associate Editor, and referee for their valuable comments.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The annals of statistics. 1982:1100–1120. [Google Scholar]
  2. Bien J, Taylor J, Tibshirani R. A Lasso for Hierarchical Interactions. The Annals of Statistics. 2013;41:1111–1141. doi: 10.1214/13-AOS1096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge: 2004. [Google Scholar]
  4. Bradic J, Fan J, Jiang J. Regularization for Cox’s Proportional Hazards Model With NP-Dimensionality. The Annals of Statistics. 2011;39(6):3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Breiman L. Better Subset Regression Using the Non-Negative Garrote. Technometrics. 1995;37:373–384. [Google Scholar]
  6. Chipman H. Bayesian variable selection with related predictors. Canadian Journal of Statistics. 1996;24:17–36. [Google Scholar]
  7. Choi NH, Li W, Zhu J. Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association. 2010;105:354–364. [Google Scholar]
  8. Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society. 1972;34:187–220. [Google Scholar]
  9. Cox DR. Partial likelihood. Biometrika. 1972;1975;62:269–276. [Google Scholar]
  10. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with Discussion) The Annals of statistics. 2004;32:407–499. [Google Scholar]
  11. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  12. Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
  13. Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]
  14. Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
  15. Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics. 1998;7:397–416. [Google Scholar]
  16. Green PJ. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society. Series B (Methodological) 1984:149–192. [Google Scholar]
  17. Hamada M, Wu CJ. Analysis of designed experiments with complex aliasing. Journal of Quality Technology. 1992;24:130–137. [Google Scholar]
  18. Joseph VR. A Bayesian approach to the design and analysis of fractionated experiments. Technometrics. 2006;48:219–229. [Google Scholar]
  19. Lin W, Lv J. High-Dimensional Sparse Additive Hazards Regression. Journal of the American Statistical Association. 2013;108:247–264. [Google Scholar]
  20. McCullagh P, Nelder JA. Generalized linear model. Vol. 37. Chapman & Hall/CRC; 1989. [Google Scholar]
  21. Nelder JA. The statistics of linear models: back to basics. Statistics and Computing. 1994;4:221–234. [Google Scholar]
  22. Radchenko P, James GM. Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions. Journal of the American Statistical Association. 2010;105:1541–1553. [Google Scholar]
  23. Tibshirani R, et al. The lasso method for variable selection in the Cox model. Statistics in medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
  24. Yuan M, Joseph VR, Lin Y. An efficient variable selection approach for analyzing designed experiments. Technometrics. 2007;49:430–439. [Google Scholar]
  25. Yuan M, Joseph VR, Zou H. Structured variable selection and estimation. The Annals of Applied Statistics. 2009:1738–1757. [Google Scholar]
  26. Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
  27. Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES