A Modified Adaptive Lasso for Identifying Interactions in the Cox Model with the Heredity Constraint

Lu Wang; Jincheng Shen; Peter F Thall

doi:10.1016/j.spl.2014.06.024

. Author manuscript; available in PMC: 2015 Oct 1.

Published in final edited form as: Stat Probab Lett. 2014 Jul 5;93:126–133. doi: 10.1016/j.spl.2014.06.024

A Modified Adaptive Lasso for Identifying Interactions in the Cox Model with the Heredity Constraint

Lu Wang ^a,^*, Jincheng Shen ^a, Peter F Thall ^b

PMCID: PMC4111275 NIHMSID: NIHMS612537 PMID: 25071299

Abstract

In many biomedical studies, identifying effects of covariate interactions on survival is a major goal. Important examples are treatment-subgroup interactions in clinical trials, and gene-gene or gene-environment interactions in genomic studies. A common problem when implementing a variable selection algorithm in such settings is the requirement that the model must satisfy the strong heredity constraint, wherein an interaction may be included in the model only if the interaction’s component variables are included as main effects. We propose a modified Lasso method for the Cox regression model that adaptively selects important single covariates and pairwise interactions while enforcing the strong heredity constraint. The proposed method is based on a modified log partial likelihood including two adaptively weighted penalties, one for main effects and one for interactions. A two-dimensional tuning parameter for the penalties is determined by generalized cross-validation. Asymptotic properties are established, including consistency and rate of convergence, and it is shown that the proposed selection procedure has oracle properties, given proper choice of regularization parameters. Simulations illustrate that the proposed method performs reliably across a range of different scenarios.

Keywords: Modified adaptive Lasso, Oracle property, Penalized partial likelihood, Regularization, Variable selection

1. Introduction

The Cox proportional hazards regression model (Cox, 1972, 1975) is the most widely used statistical model for evaluating relationships between an event time, T, and baseline covariates, X = (X₁, ⋯, X_p). The Cox model is characterized by the hazard function

h (t ∣ X, β) = h_{0} (t) \exp {g (X, β)}, t > 0,

(1)

for a subject with covariates X, where h₀(t) is an unspecified baseline hazard and the linear component $g (X, β) = Σ_{j = 1}^{p} β_{j} X_{j}$ is characterized by a vector β = (β₁, ⋯, β_p)^T of unknown regression coefficients. In many applications, this simple form of g(X, β) does not adequately describe the relationship between T and X due to interactions between elements of X. For example, an anti-cancer agent tailored to attack a certain biological target typically has effects hypothesized to differ between the subgroups of patients who do and do not have the target, identified by a binary “biomarker” covariate. In a randomised trial of the targeted agent versus standard therapy, such a differential effect is characterized as a treatment-biomarker interaction which, if found to be sufficiently large, may lead to regulatory approval of the agent for patients who are biomarker positive. Identifying such treatment-biomarker interactions thus is a key step in developing personalized treatments. The presence of other covariates that may or may not be associated with T, and that also may interact with treatment, complicates identification and estimation of biomarker effects. Fitting the Cox model with interactions is challenging since, even with a moderate number of covariates, the number of interaction terms may be large, and not all covariates and interactions may have meaningful effects on h(t|X, β).

There is a large literature on variable selection methods for survival models. A family of penalized partial likelihood methods have been proposed for the Cox proportional hazard model, including the Lasso (Tibshirani et al., 1997) and the smoothly clipped absolute deviation method (Fan and Li, 2002). By shrinking some regression coefficients to zero, these methods simultaneously select important variables and estimate the regression model parameters. Zhang and Lu (2007) proposed an adaptive Lasso estimator for variable selection in the Cox model, with an adaptively weighted L₁ penalty on the regression coefficients that has a convex form. They showed that this method enjoys the oracle properties, global optima exist, and it can be implemented efficiently using standard numerical algorithms (Boyd and Vandenberghe, 2004).

All the above variable selection methods for the Cox model treat the candidate variables equally. However, when interactions are included, there is a natural hierarchical ordering among the variables in the model (Chipman, 1996; Joseph, 2006; Yuan et al., 2007). This motivates the strong heredity requirement that an interaction can be included in the model only if the interaction’s component variables are included as main effects (Hamada and Wu, 1992), since models that violate this property are difficult to interpret. For linear and generalized linear regression models, Yuan et al. (2009) proposed non-negative garrote methods that naturally incorporate a general hierarchical structure among predictors. Along this line, Choi et al. (2010) extended the Lasso to identify interaction terms while obeying the strong heredity constraint, which is achieved by reparameterizing the coefficients of the interaction terms. Bien et al. (2013) investigated a Lasso for hierarchical interactions, and Radchenko and James (2010) considered a more general case with nonlinear interactions. To our knowledge, none of these papers studied the setting with time-to-event data, and none of the variable selection methods for the Cox model noted above satisfy the strong heredity constraint when interactions are included. This paper aims at filling this gap.

We propose a modified Lasso procedure for the Cox model to adaptively select covariates and interactions while automatically enforcing the strong heredity constraint. The main challenges compared to linear/generalized linear regression are that the Cox model is semiparametric and involves right-censored data. We carry out estimation and variable selection by optimizing a modified log partial likelihood that includes two adaptively weighted penalty terms, one for main effects and one for reparameterized interactions, with each penalty multiplied by a tuning parameter. The main reason that we chose the reparameterization approach is that this method can handle all p main effects simultaneously in one iteration, which has tremendous numerical advantage, especially for survival analysis when no closed form can be found. The two-dimensional tuning parameter is determined by generalized cross-validation. The proposed method is computationally convenient, has good convergence properties, and implementation is straightforward. We establish asymptotic properties, including selection consistency, rate of convergence, and the oracle property (Fan and Li, 2001; Fan and Peng, 2004) that it performs as well as if the correct underlying model were known in advance. These theoretical properties and algorithms have not been studied previously for variable selection for Cox models with interaction terms subject to the strong hierarchy constraint.

2. Adaptive Lasso with Strong Heredity Constraint Using Penalized Partial Like-lihood

Let T_i denote the failure time, C_i the censoring time, and X_i = (X_i1, ⋯, X_ip) the covariate vector of the i^th subject, for i = 1, ⋯, n, with T̃_i = min(T_i, C_i) and δ_i = I(T_i ≤ C_i). Suppose that (X₁, T₁, C₁), ⋯, (X_n, T_n, C_n) are independent and identically distributed. We further assume non-informative censoring, T_i ⫫ C_i | X_i, i.e., T_i is conditionally independent of C_i given X_i. Although our methods can be generalized to handle higher order interactions, for the ease of exposition we consider the Cox model having linear component with all possible two-way interactions. That is, in model (1)

g (X, β, α) = \sum_{j = 1}^{p} β_{j} X_{j} + \sum_{1 \leq j < j^{'} \leq p} α_{j, j^{'}} X_{j} X_{j^{'}},

(2)

where α = (α_1,2, ⋯, α_p−1,p)^T. Our goals are to provide a method that determines which terms in g(X, β, α) have important effects on the hazard, and develop a corresponding computational algorithm and parameter estimators having desirable properties. The existing variable selection methods for the Cox model do not guarantee the strong heredity constraint, as they treat all elements of (β, α) equally and do not distinguish between elements of β and α.

We first re-parameterize the coefficients for the interaction terms in (2) as α_j,j′ = γ_j,j′β_jβ_j′, so that the linear term becomes

g (X, θ) = \sum_{j = 1}^{p} β_{j} X_{j} + \sum_{1 \leq j < j^{'} \leq p} γ_{j, j^{'}} β_{j} β_{j^{'}} X_{j} X_{j^{'}}

(3)

and the parameter vector is θ = (β, γ) = (β₁, ⋯ β_p, γ_1,2, ⋯, γ_p−1,p)^T. With this reparameterization, the coefficient for an interaction term X_jX_j′ must be 0 if either of its two main effects X_j or X_j′ has coefficient 0. Conversely, if γ_j,j′β_jβ_j′ ≠ 0, this implies that both β_j ≠= 0 and β_j′ ≠= 0, which guarantees the strong heredity constraint. With the reparameterization (3), the log partial likelihood is

l_{n} (θ) = \sum_{i = 1}^{n} δ_{i} (g (X_{i}, θ) - \log [\sum_{r = 1}^{n} I ({\tilde{T}}_{r} \geq {\tilde{T}}_{i}) \exp {g (X_{r}, θ)}]),

(4)

where I(A) denotes the indicator of the event A. For the variable selection problem at hand, we will minimize the adaptive penalized negative log partial likelihood

Q_{n} (θ, λ_{β}, λ_{γ}) = - l_{n} (θ) + n λ_{β} \sum_{j = 1}^{p} w_{j}^{β} ∣ β_{j} ∣ + n λ_{γ} \sum_{1 \leq j < j^{'} \leq p} w_{j, j^{'}}^{γ} ∣ γ_{j, j^{'}} ∣,

(5)

where ${w_{j}^{β}}$ and ${w_{j, j^{'}}^{γ}}$ are prespecified weights, and λ_β, λ_γ are tuning parameters. Following Breiman (1995) and Zou (2006), we compute the weights in (5) using

w_{j}^{β} = \frac{\log (n)}{n} ∣ \frac{1}{{\tilde{β}}_{j}} ∣, w_{j, j^{'}}^{γ} = \frac{\log (n)}{n} ∣ \frac{{\tilde{β}}_{j} {\tilde{β}}_{j^{'}}}{{\tilde{α}}_{j, j^{'}}} ∣,

where ${\tilde{β}}_{j}$ ’s and ${\tilde{α}}_{j, j^{'}}$ ’s are the estimates from a usual, unpenalized fitted Cox model with linear component (2). The multiplier log(n)/n is included to satisfy the convergence rate and the asymptotic properties introduced in Section 3, below. The objective function becomes

Q_{n} (θ, λ_{β}, λ_{γ}) = - l_{n} (θ) + \log (n) λ_{β} \sum_{j = 1}^{p} \frac{∣ β_{j} ∣}{∣ {\tilde{β}}_{j} ∣} + \log (n) λ_{γ} \sum_{1 \leq j < j^{'} \leq p} ∣ γ_{j, j^{'}} ∣ ∣ \frac{{\tilde{β}}_{j} {\tilde{β}}_{j^{'}}}{{\tilde{α}}_{j, j^{'}}} ∣ .

(6)

The two tuning parameters in (6) control the coefficient estimates at different levels. The first tuning parameter λ_β controls main effect estimates. If β_j is shrunk to zero, all terms involving X_j, including β_jX_j and the interactions γ_j,j′β_jβ_j′ X_jX_j′, for any j′, are removed from the model. The second tuning parameter λ_γ controls the estimates only at the interaction effect level. Even if both β_j ≠ 0 and β_j′ ≠ 0, it is possible that γ_j,j′ = 0 if X_j and X_j′ do not interact. The penalty term controlled by λ_γ thus provides the flexibility of selecting only main effects of X_j and X_j′ but not their interaction.

The weights act on the objective function, Q_n(θ, λ_β, λ_γ), as follows. If the initial estimate ${\tilde{β}}_{j}$ is close to 0 then $w_{j}^{β}$ will be large and hence, as can be seen from (5), the coefficient β_j of X_j will be heavily penalized. Similarly, if ${\tilde{α}}_{j, j^{'}}$ is small relative to ${\tilde{β}}_{j} {\tilde{β}}_{j^{'}}$ then $w_{j, j^{'}}^{γ}$ will be large and the coefficient $γ_{j, j^{'}}$ of the interaction term X_jX_j′ will be heavily penalized.

3. Theoretical Properties of the Estimator

In this section, we study the asymptotic properties of our proposed variable selection procedure and the corresponding estimator. As n → ∞, our estimator possesses the oracle property under certain regularity conditions, that is, it performs as well as if the true model were known in advance (Fan and Li, 2001). The regularity conditions that we need throughout the development are given in Appendix 1 in the web supplementary materials.

Let $θ_{0} = {(β_{0}^{T}, γ_{0}^{T})}^{T}$ denote the true parameter vector, where

γ_{0 j, j^{'}} = {\begin{matrix} α_{0 j, j^{'}} ∕ β_{0 j} β_{0 j^{'}} if β_{0 j} \neq 0 and β_{0 j^{'}} \neq 0 \\ 0 otherwise \end{matrix}

(7)

This guarantees that the true model obeys the strong heredity constraint, that is, α_0j,j′ = γ_0j,j′ = 0 if either β_0j = 0 or β_0j′ = 0.

Define the covariate-specific tuning parameters $λ_{j, n}^{β} = λ_{β} w_{j}^{β} = n^{- 1} \log (n) λ_{β} ∕ ∣ {\tilde{β}}_{j} ∣$ for j = 1, ⋯, p and the interaction-specific tuning parameters $λ_{j, j^{'}, n}^{γ} = λ_{γ} w_{j, j^{'}}^{γ} = n^{- 1} \log (n) λ_{γ} ∕ ∣ {\tilde{γ}}_{j, j^{'}} ∣$ for 1 ≤ j < j′ ≤ p, where ${\tilde{γ}}_{j, j^{'}} = {\tilde{α}}_{j, j^{'}} ∕ {\tilde{β}}_{j} {\tilde{β}}_{j^{'}}$ . Our proposed estimator is

\hat{θ} = {argmin}_{θ} {- l_{n} (θ) + n \sum_{j = 1}^{p} λ_{j, n}^{β} ∣ β_{j} ∣ + n \sum_{1 \leq j < j^{'} \leq p} λ_{j, j^{'}, n}^{γ} ∣ γ_{j, j^{'}} ∣} .

(8)

Note that the criterion in (8) is the same as Q_n(θ, λ_β, λ_γ) in (6) with different notation.

Without loss of generality, we denote $β_{0} = {(β_{a 0}^{T}, β_{b 0}^{T})}^{T}$ , where β_a0 consists of all nonzero components and β_b0 consists of the remaining zero components of β₀. Similarly, for the true coefficient of interactions, write $γ_{0} = (γ_{a 0}^{T}, γ_{b 0}^{T}, γ_{c 0}^{T}) T$ , where γ_a0 contains all nonzero components of γ₀, γ_b0 contains the zero components of γ₀ whose corresponding main effects both are nonzero, and γ_c0 contains the remaining zero components of γ₀ whose corresponding components of main effects have at least one zero. Correspondingly, we denote the maximizer of (8) as ${\hat{θ}}_{n} = {({\hat{β}}_{a n}^{T}, {\hat{β}}_{b n}^{T}, {\hat{γ}}_{a n}^{T}, {\hat{γ}}_{b n}^{T}, {\hat{γ}}_{c n}^{T})}^{T}$ , the covariate-specific tuning parameters as ${{(λ_{a n}^{β})}^{T}, {(λ_{b n}^{β})}^{T}}$ , and the interaction-specific tuning parameters as ${{(λ_{a n}^{γ})}^{T}, {(λ_{b n}^{γ})}^{T}, {(λ_{c n}^{γ})}^{T}}$ .

Let

ξ_{n} = \max {{(λ_{a n}^{β})}^{T}, {(λ_{a n}^{γ})}^{T}}, ζ_{n} = \min {{(λ_{b n}^{β})}^{T}, {(λ_{b m}^{γ})}^{T}} .

Then ξ_n is the maximum among both covariate-specific and interaction-specific tuning parameters that correspond to nonzero coefficients in the model. But for ζ_n, we only consider those specific tuning parameters corresponding to zero main effects and zero interaction terms when coefficients are in γ_b0. That is, we consider all zero cases for the main effects, but for the interactions, we only consider those zero terms whose corresponding main effects are nonzeros. We will refer to such terms in the definition of ζ_n as non-trivial zero terms.

For the proposed variable selection procedure to perform properly, as n → ∞ the covariate-specific and interaction-specific penalties for the terms whose true coefficients are nonzeros should converge to 0, and the penalties for those non-trivial zero terms should be large enough so that the estimates shrink to 0. In fact, if the tuning parameters in (8) that correspond to β_a0 and γ_a0 converge to 0, and those corresponding to β_b0 and γ_b0 are sufficiently large, then our proposed estimating procedure will have the so-called oracle property (Fan and Li, 2001). This can be guaranteed when $\sqrt{n}$ –consistent estimates of θ₀ are used in the definition of $λ_{j, n}^{β}$ and $λ_{j, j^{'}, n}^{γ}$ , where one can easily show that $\sqrt{n} ξ_{n}$ → 0 and $\sqrt{n} ζ_{n}$ → ∞.

Denote the initial estimates as $\tilde{β} = ({\tilde{β}}_{1}, \dots, {\tilde{β}}_{p})$ , $\tilde{γ} = ({\tilde{γ}}_{1, 2}, \dots, {\tilde{γ}}_{p - 1, p})$ , and we summarize the above results in Theorem 1. A sketch of the proof of Theorem 1 is given in Appendix 2.

Theorem 1. When $\sqrt{n} (\tilde{β} - β_{0}) = O_{p} (1)$ and $\sqrt{n} (\tilde{γ} - γ_{0}) = O_{p} (1)$ in (8) to calculate $λ_{j, n}^{β}$ and $λ_{j, j^{'}, n}^{γ}$ , $\sqrt{n} ξ_{n} \to \infty$ and $\sqrt{n} ζ_{n} \to \infty$ as n → ∞. Under the regularity conditions (1)-(3) in Appendix 1, there exists a local minimizer ${\hat{θ}}_{n}$ of Q_n(θ) such that

(Sparsity) $P ({\hat{β}}_{b n} = 0) \to 1$ , $P ({\hat{γ}}_{b n} = 0) \to 1$ , and $P ({\hat{γ}}_{c n} = 0) \to 1$ as n → ∞.
(Asymptotic normality)

$\sqrt{n} {(\begin{matrix} {\hat{β}}_{a n} \\ {\hat{γ}}_{a n} \end{matrix}) - (\begin{matrix} β_{a 0} \\ γ_{a 0} \end{matrix})} \overset{d}{\to} N {0, I_{a}^{- 1} (β_{a 0}, γ_{a 0})},$

where $I_{a} (β_{a 0}, γ_{a 0})$ is the Fisher information matrix evaluated at β_a0 and γ_a0 assuming that β_b0 = 0, γ_b0 = 0, and γ_c0 = 0 is known in advance.

Part (i) of Theorem 1 presents the sparsity property and shows that our proposed method can consistently remove the zero-effect terms with probability tending to 1. This implies that, with a sufficiently large sample, in practice our method can select the underlying true model with high probability. In part (ii) of Theorem 1, we establish that the estimates of nonzero elements of θ₀ are $\sqrt{n}$ –consistent and asymptotically normal. The asymptotic distribution is the same as what would be obtained if it were known in advance which elements of θ₀ are 0 and which are not 0, the so-called oracle property.

Theorem 2, below, establishes the asymptotic behavior of the proposed method when p is very large, especially when p → ∞ as n → ∞. This is an important case in many biomedical studies, with the advance of new high throughput technologies. A proof of Theorem 2 is given in the Appendix 3, provided in the web supplementary materials.

When the number of predictors may increase with the sample size n, we denote p as p_n, which allows the possibility that p_n → ∞ as n → ∞. Including all main effects and pairwise interactions, the total number of parameters is q_n = (p_n + 1)p_n/2. Similarly, when appropriate we add a subscript n to other notation, and we let d_n denote the number of non-zero coefficients in the underlying true model. Then we have

Theorem 2. Under the regularity conditions (4)-(6) in Appendix 1, if p_n = o(n^1/5) and $\sqrt{n q n} ξ_{n} \to 0$ , $\sqrt{n ∕ q n} ζ_{n} \to \infty$ as n → ∞, then there exists a local minimizer ${\hat{θ}}_{n}$ of Q_n(θ) such that

(Sparsity) $P ({\hat{β}}_{b n} = 0) \to 1$ , $P ({\hat{γ}}_{b n} = 0) \to 1$ , and $P ({\hat{γ}}_{c n} = 0) \to 1$ as n → ∞.
(Asymptotic Normality)

$\sqrt{n} Ω_{n} I_{a n}^{1 ∕ 2} (β_{a 0}, γ_{a 0}) {(\begin{matrix} {\hat{β}}_{a n} \\ {\hat{γ}}_{a n} \end{matrix}) - (\begin{matrix} β_{a 0} \\ γ_{a 0} \end{matrix})} \overset{d}{\to} N {0, Σ},$

where Ω_n is any arbitrary d × d_n matrix with a finite d such that $Ω_{n} Ω_{n}^{T} \to Σ$ , Σ is a d × d semipositive definite symmetric matrix, and $I_{a n} (β_{a 0}, γ_{a 0})$ is the d_n × d_n Fisher information matrix evaluated at (β_a0, γ_a0) assuming that β_b0 = 0, γ_b0 = 0, and γ_c0 = 0 is known in advance.

The reason we consider an arbitrary linear combination $Ω_{n} {({\hat{β}}_{a n}^{T}, {\hat{γ}}_{a n}^{T})}^{T}$ in Theorem 2, instead of ${({\hat{β}}_{a n}^{T}, {\hat{γ}}_{a n}^{T})}^{T}$ as in Theorem 1, is because the latter has dimension d_n → ∞ as n → ∞ in the current setup, while the former has finite dimension d.

4. Computational Algorithm

Expanding l_n(θ) in the objective function (6), Q_n(θ, λ_β, λ_γ) becomes

- \sum_{i = 1}^{n} δ_{i} [(\sum_{j = 1}^{p} β_{k} X_{i j} + \sum_{1 \leq j < j^{'} \leq p} γ_{j, j^{'}} β_{j} β_{j^{'}} X_{i j} X_{i j^{'}}) - \log {\sum_{r = 1}^{n} I ({\tilde{T}}_{r} \geq {\tilde{T}}_{i}) \times \exp (\sum_{j = 1}^{p} β_{j} X_{r j} + \sum_{1 \leq j < j^{'} \leq p} γ_{j, j^{'}} β_{j} β_{j^{'}} X_{r j} X_{r j^{'}})}] + \log (n) λ_{β} \sum_{j = 1}^{p} \frac{∣ β_{j} ∣}{∣ {\tilde{β}}_{j} ∣} + l o g (n) λ_{γ} \sum_{1 \leq j < j^{'} \leq p} ∣ γ_{j j^{'}} ∣ ∣ \frac{{\tilde{β}}_{j} {\tilde{β}}_{j^{'}}}{{\tilde{α}}_{j j^{'}}} ∣

Our estimator minimizes Q_n(θ, λ_β, λ_γ), that is, ${\hat{θ}}_{n} = {\arg \min}_{θ} Q_{n} (θ, λ_{β}, λ_{γ})$ . To carry out the optimization, we apply a modified version of the iteratively reweighted least squares algorithm (Green, 1984) with weighted L₁ penalties. Denoting the gradient vector of the partial likelihood by $\dot{l} (θ) = - \partial l_{n} (θ) ∕ \partial θ$ and the Hessian matrix by $\ddot{l} (θ) = - \partial^{2} l_{n} (θ) ∕ \partial θ \partial θ^{T}$ , and using the Cholesky decomposition $\ddot{l} (θ) = M M^{T}$ , where M is an invertible lower triangular matrix, we define the pseudo response vector $Y = {(M)}^{- 1} {\ddot{l} (θ) θ - \dot{l} (θ)}$ . By the usual second-order Taylor expansion, −l_n(θ) in (6) can be approximated by the quadratic form (Y − M^Tθ)^T (Y − M^Tθ)/2, and at each penalized iteratively reweighted least squares iteration we minimize

\frac{1}{2} {(Y - M^{T} θ)}^{T} (Y - M^{T} θ) + \log (n) λ_{β} \sum_{j = 1}^{p} \frac{∣ β_{j} ∣}{∣ {\tilde{β}}_{j} ∣} + l o g (n) λ_{γ} \sum_{1 \leq j < j^{'} \leq p} ∣ γ_{j, j^{'}} ∣ ∣ \frac{{\tilde{β}}_{j} {\tilde{β}}_{j^{'}}}{{\tilde{α}}_{j, j^{'}}} ∣ .

(9)

Because the main effect coefficients β and the interaction coefficients γ are controlled at different levels, in each step we also iterate between these two sets, first fixing β to estimate γ, then fixing γ to estimate β, and iterating until convergence. This algorithm is guaranteed to converge, since the objective function decreases at each step. When β is fixed, the optimization in γ becomes a Lasso problem, hence one can use either the LARS/Lasso algorithm (Efron et al., 2004) or quadratic programming to solve for γ efficiently. When γ is fixed, we solve for β₁, ⋯, β_p sequentially. For each j = 1, ⋯, p, we fixγ and β_[−j] = (β₁, ⋯, β_j−1, β_j+1, ⋯, β_p), and the optimization becomes a simple Lasso problem with only one parameter, β_j. This is similar to the shooting algorithm (Fu, 1998; Zhang and Lu, 2007; Friedman et al., 2007). The tuning parameters are selected by minimizing the generalized cross validation statistic,

C_{p s e u d o - G C V} = \frac{l (\hat{θ})}{{(1 - d f ∕ n)}^{2}},

over a reasonable range of λ_β and λ_γ, where df is the number of nonzero parameters in the fitted model.

For a fixed λ_β and λ_γ, the optimization algorithm proceeds as follows:

Step 1. Center and normalize each term X_j, X_jX_j′, j < j′, and j, j′ = 1, …, p.

Step 2. Start with plausible initial values ${\hat{β}}_{j}^{(0)}$ and ${\hat{γ}}_{j, j^{'}}^{(0)}$ , j < j′, and j, j′ = 1, …, p, such as the conventional Cox regression parameter estimates. Set m=1.

Step 3. Compute Y and M based on the current value ${\hat{θ}}^{(m - 1)}$ . Denote

\hat{l} (θ) = - \frac{1}{2} {(Y - M^{T} θ)}^{T} (Y - M^{T} θ) .

Step 4. To update $\hat{γ}$ , let ${\hat{γ}}^{(m)} = \arg \min_{γ} Σ_{i = 1}^{n} {- \tilde{l} ({\hat{β}}^{(m - 1)}, γ) + n λ_{γ} Σ_{j < j^{'}} w_{j, j^{'}}^{γ} ∣ γ_{j, j^{'}} ∣}$ .

Step 5. To update $\hat{β}$ for each j = 1, …, p in sequence, let

{\hat{β}}_{j}^{(m)} = \arg \min_{β_{j}} \sum_{i = 1}^{n} {- \tilde{l} ({\hat{β}}_{[- j]}^{(m - 1)}, β_{j}, {\hat{γ}}^{(m)}) + λ_{β} w_{j}^{β} ∣ β_{j} ∣} .

Step 6. If the relative difference between $Q_{n} ({\hat{θ}}^{(m - 1)})$ and $Q_{n} ({\hat{θ}}^{(m)})$ ,

Δ^{(m)} = \frac{∣ Q_{n} ({\hat{θ}}^{(m - 1)}) - Q_{n} ({\hat{θ}}^{(m)}) ∣}{∣ Q_{n} ({\hat{θ}}^{(m - 1)}) ∣},

is small enough, then stop. Otherwise, increment m to m+1 and return to Step 3.

Since γ_j,j′ = α_j,j′ /β_jβ_j′, in Step 3 the minimization is actually over {α_j,j′} with $β = {\hat{β}}^{(m - 1)}$ . This algorithm gives exact zeros for some coefficients and guarantees that the coefficients of the corresponding interactions are set to 0 whenever the corresponding is shrunk to 0. Based on our empirical experience, the algorithm converges quickly.

5. Numerical Results

In this section, we report results of a simulation study comparing our proposed method with the Lasso and adaptive Lasso, two popular variable selection methods for the Cox model, neither of which guarantees the strong heredity constraint. We follow the simulation setup in Zhang and Lu (2007), but also include two-way interaction terms as candidates for the variable selection. We consider sample size n = 200, and assume that there are p = 5 covariates of interest in each simulated dataset, denoted by X₁, X₂, ⋯, X₅. Thus, the number of all possible two-way interaction terms is p×(p−1)/2 = 10, and there are a total of 15 candidate terms. We assume each covariate follows a standard normal distribution, and consider two scenarios: (i) the covariates are independent, and (ii) the covariates have pairwise correlations Corr(X_j, X_j′) = ρ^|j−j′|, with ρ = 0.5. We generate the censoring times from a Uniform distribution having support (0, τ), with τ chosen to obtain a specified censoring rate of either 25% or 40%.

Then suppose failure times are generated from a Cox model with constant baseline hazard λ₀ = 0.1, where the coefficients of X₂, X₃, and the interaction term X₂ X₃ are non-zero, and the other 12 coefficients are zero. We consider the following two models:

Model 1: $β_{2}^{0}$ = −0.8, $β_{3}^{0}$ = −0.8, and $α_{2, 3}^{0}$ = −0.8, corresponding to large effects.

Model 2: $β_{2}^{0}$ = −0.3, $β_{3}^{0}$ = −0.3, and $α_{2, 3}^{0}$ = −0.3, corresponding to small effects.

We simulate each case 100 times, and apply all three methods to each simulated data set. For all methods, the tuning parameters are selected using the generalized cross validation criterion C_pseudo−GCV .

Table 1 summarizes the percentage of fully correct variable selection among 100 replications for each method under each scenario. For all cases, the proposed method correctly selects the true model more frequently than the regular Lasso and adaptive Lasso. Specifically, for all scenarios under Model 2, where important variables and interactions have small effects, the Lasso and adaptive Lasso perform about the same, with the latter slightly better, while our proposed method performs much better than these two methods. In Model 1, where important variables and interactions have large effects, the advantage of our proposed method is still substantial when the censoring percentage is high, 40%. When censoring is moderate, 25%, the adaptive Lasso becomes competitive, but still is worse than the proposed method. The performance of the Lasso is always the worst among the three methods.

Table 1.

Percentages of correctly selected models among 100 replications.

			Method
	Censoring Percentage	ρ for Correlation	Lasso	Adaptive Lasso	Proposed Method
Model 1	25%	0	11%	70%	86%
		0.5	6%	66%	79%
	40%	0	3%	58%	80%
		0.5	3%	54%	82%

Model 2	25%	0	43%	49%	74%
		0.5	26%	31%	63%
	40%	0	28%	32%	74%
		0.5	22%	33%	62%

Open in a new tab

Table 2 summarizes the individual frequencies of being selected into the model for each main effect and interaction term. We only present the results here for 25% censoring under both models when the covariates are independent (ρ = 0) due to space limitations, but similar results are observed for the 40% censored case, and correlated covariates case. As shown in Table 2, the proposed variable selection procedure always chooses important variable and interaction terms much more often than the other two methods. Under Model 1 for large effects, the adaptive Lasso and the proposed method in general select the unimportant terms less often than the Lasso, while for Model 2 with small effects, the proposed method actually selects some of the unimportant terms more often. However, as illustrated in Table 1, the proposed method simultaneously selects all the important terms and removes the unimportant ones more frequently.

Table 2.

Term-specific percentages of being selected for each main effect and interaction. (25% censoring and ρ = 0 for independent covariates)

	Method	x ₁	x ₂	x ₃	x ₄	x ₅	x ₁ x ₂	x ₁ x ₃	x _l x ₄
Model 1	Lasso	22%	100%	100%	18%	25%	15%	12%	15%
	adaptive Lasso	6%	100%	100%	3%	7%	2%	4%	1%
	proposed method	5%	100%	100%	4%	6%	5%	5%	0%

Model 2	Lasso	6%	91%	91%	6%	5%	2%	6%	3%
	adaptive Lasso	2%	87%	85%	4%	4%	1%	2%	1%
	proposed method	10%	98%	99%	11%	8%	10%	10%	1%

	Method	x ₁ x ₅	x ₂ x ₃	x ₂ x ₄	x ₂ x ₅	x ₃ x ₄	x ₃ x ₅	x ₄ x ₅
Model 1	Lasso	15%	100%	13%	13%	11%	13%	15%
	adaptive Lasso	2%	100%	3%	2%	2%	2%	0%
	proposed method	1%	100%	4%	6%	4%	6%	0%

Model 2	Lasso	3%	89%	4%	3%	9%	5%	2%
	adaptive Lasso	2%	82%	1%	2%	1%	0%	0%
	proposed method	1%	97%	10%	8%	11%	8%	2%

Open in a new tab

A simple, ad hoc alternative approach often employed in practice for variable selection involving interaction terms is to run either the Lasso or the adaptive Lasso, and then, depending on which terms were selected, manually add back any main terms that were not selected but that were components of any interaction term that was selected. This ensures the strong heredity constraint. As seen in our simulation, this practice does not help much in terms of how often the true model is selected correctly, while the false positive error rate is greatly increased. For example, as one can see in Table 2 for model 2, since the proposed method selected the true interaction term x₂x₃ more often than the other two methods, using the above ad hoc approach following the Lasso or the adaptive Lasso still cannot beat the proposed method. For model 1 with large effects in Table 2, the frequency of x₂, x₃, and x₂x₃ being correctly selected already achieves 100%, so this ad hoc approach does not help at all.

To measure the prediction accuracy, we average the mean squared error $C_{M S E} = {(\hat{θ} - θ)}^{T} Γ (\hat{θ} - θ)$ over 100 replications by following Tibshirani et al. (1997) and Zhang and Lu (2007), where is the population variance-covariance matrix of the covariates. Standard errors are given in parentheses. For all scenarios, the proposed method has the smallest mean squared error (Table 3), and thus outperform the other two competitors in terms of prediction accuracy.

Table 3.

Mean squared error, with standard errors in parentheses.

	Censoring Percentage	ρ for Correlation	Method
	Censoring Percentage	ρ for Correlation	Lasso	Adaptive Lasso	Proposed Method
Model 1	25%	0	0.213 (0.124)	0.072 (0.068)	0.063 (0.065)
		0.5	0.219 (0.136)	0.071 (0.068)	0.069 (0.065)
	40%	0	0.337 (0.209)	0.168 (0.144)	0.117 (0.219)
		0.5	0.351 (0.215)	0.162 (0.172)	0.118 (0.216)

Model 2	25%	0	0.118 (0.053)	0.105 (0.055)	0.042 (0.041)
		0.5	0.131 (0.059)	0.130 (0.059)	0.054 (0.048)
	40%	0	0.117 (0.053)	0.122 (0.058)	0.061 (0.062)
		0.5	0.137 (0.066)	0.142 (0.068)	0.082 (0.078)

Open in a new tab

6. Discussion

We have extended the adaptive Lasso method to accommodate the Cox proportional hazard model including interaction terms while ensuring that the strong heredity constraint is satisfied in the selected model. Hamada and Wu (1992) consider other constraints, such as weak heredity where only one of the two main terms is required to be included when an interaction term is considered. Although this situation is not of our main interest, our methods can easily be modified to handle it by employing a different reparameterization.

Similarly, as discussed in Zhang and Lu (2007), the adaptive choice of weights may become problematic when some elements of θ are not estimable. This may occur, for example, when strong collinearity exists among covariates, or when the number of covariates p is much larger than the sample size n in high-dimensional data. In such settings, one cannot obtain the initial estimates to determine the adaptive weights. Alternatively, robust estimation of θ, such as ridge regression, may be considered.

Our work was motivated by the desire to identify key treatment-biomarker interactions for developing personalized treatments, where the number of candidate biomarkers is usually fixed or may grow slowly with n. Even with a moderate number of candidate biomarkers, however, our methodology can have an important impact on physician’s actual behavior, if clinically meaningful treatment-biomarker effects are identified. The condition p = o(n^1/5) that we used may be relaxed though. There have been recent theoretical developments on high-dimensional survival models such as those in Bradic et al. (2011) and Lin and Lv (2013), but their underlying theory cannot be applied directly to our setting, which uses a reparameterization approach for selecting important interactions with heredity constraint. In order to relax our condition p = o(n^1/5), modification of the above theory to our setting would be required, and this wouldentail a substantial amount of work. While this is beyond the scope of the current paper, it certainly would be an interesting future research project.

Supplementary Material

NIHMS612537-supplement-01.pdf^{(280.8KB, pdf)}

Acknowledgements

The authors thank the Co-Editor-in-Chief, Associate Editor, and referee for their valuable comments.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The annals of statistics. 1982:1100–1120. [Google Scholar]
Bien J, Taylor J, Tibshirani R. A Lasso for Hierarchical Interactions. The Annals of Statistics. 2013;41:1111–1141. doi: 10.1214/13-AOS1096. [DOI] [PMC free article] [PubMed] [Google Scholar]
Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge: 2004. [Google Scholar]
Bradic J, Fan J, Jiang J. Regularization for Cox’s Proportional Hazards Model With NP-Dimensionality. The Annals of Statistics. 2011;39(6):3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breiman L. Better Subset Regression Using the Non-Negative Garrote. Technometrics. 1995;37:373–384. [Google Scholar]
Chipman H. Bayesian variable selection with related predictors. Canadian Journal of Statistics. 1996;24:17–36. [Google Scholar]
Choi NH, Li W, Zhu J. Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association. 2010;105:354–364. [Google Scholar]
Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society. 1972;34:187–220. [Google Scholar]
Cox DR. Partial likelihood. Biometrika. 1972;1975;62:269–276. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with Discussion) The Annals of statistics. 2004;32:407–499. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]
Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics. 1998;7:397–416. [Google Scholar]
Green PJ. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society. Series B (Methodological) 1984:149–192. [Google Scholar]
Hamada M, Wu CJ. Analysis of designed experiments with complex aliasing. Journal of Quality Technology. 1992;24:130–137. [Google Scholar]
Joseph VR. A Bayesian approach to the design and analysis of fractionated experiments. Technometrics. 2006;48:219–229. [Google Scholar]
Lin W, Lv J. High-Dimensional Sparse Additive Hazards Regression. Journal of the American Statistical Association. 2013;108:247–264. [Google Scholar]
McCullagh P, Nelder JA. Generalized linear model. Vol. 37. Chapman & Hall/CRC; 1989. [Google Scholar]
Nelder JA. The statistics of linear models: back to basics. Statistics and Computing. 1994;4:221–234. [Google Scholar]
Radchenko P, James GM. Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions. Journal of the American Statistical Association. 2010;105:1541–1553. [Google Scholar]
Tibshirani R, et al. The lasso method for variable selection in the Cox model. Statistics in medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
Yuan M, Joseph VR, Lin Y. An efficient variable selection approach for analyzing designed experiments. Technometrics. 2007;49:430–439. [Google Scholar]
Yuan M, Joseph VR, Zou H. Structured variable selection and estimation. The Annals of Applied Statistics. 2009:1738–1757. [Google Scholar]
Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS612537-supplement-01.pdf^{(280.8KB, pdf)}

[R1] Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The annals of statistics. 1982:1100–1120. [Google Scholar]

[R2] Bien J, Taylor J, Tibshirani R. A Lasso for Hierarchical Interactions. The Annals of Statistics. 2013;41:1111–1141. doi: 10.1214/13-AOS1096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge: 2004. [Google Scholar]

[R4] Bradic J, Fan J, Jiang J. Regularization for Cox’s Proportional Hazards Model With NP-Dimensionality. The Annals of Statistics. 2011;39(6):3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Breiman L. Better Subset Regression Using the Non-Negative Garrote. Technometrics. 1995;37:373–384. [Google Scholar]

[R6] Chipman H. Bayesian variable selection with related predictors. Canadian Journal of Statistics. 1996;24:17–36. [Google Scholar]

[R7] Choi NH, Li W, Zhu J. Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association. 2010;105:354–364. [Google Scholar]

[R8] Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society. 1972;34:187–220. [Google Scholar]

[R9] Cox DR. Partial likelihood. Biometrika. 1972;1975;62:269–276. [Google Scholar]

[R10] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with Discussion) The Annals of statistics. 2004;32:407–499. [Google Scholar]

[R11] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R12] Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]

[R13] Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]

[R14] Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]

[R15] Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics. 1998;7:397–416. [Google Scholar]

[R16] Green PJ. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society. Series B (Methodological) 1984:149–192. [Google Scholar]

[R17] Hamada M, Wu CJ. Analysis of designed experiments with complex aliasing. Journal of Quality Technology. 1992;24:130–137. [Google Scholar]

[R18] Joseph VR. A Bayesian approach to the design and analysis of fractionated experiments. Technometrics. 2006;48:219–229. [Google Scholar]

[R19] Lin W, Lv J. High-Dimensional Sparse Additive Hazards Regression. Journal of the American Statistical Association. 2013;108:247–264. [Google Scholar]

[R20] McCullagh P, Nelder JA. Generalized linear model. Vol. 37. Chapman & Hall/CRC; 1989. [Google Scholar]

[R21] Nelder JA. The statistics of linear models: back to basics. Statistics and Computing. 1994;4:221–234. [Google Scholar]

[R22] Radchenko P, James GM. Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions. Journal of the American Statistical Association. 2010;105:1541–1553. [Google Scholar]

[R23] Tibshirani R, et al. The lasso method for variable selection in the Cox model. Statistics in medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]

[R24] Yuan M, Joseph VR, Lin Y. An efficient variable selection approach for analyzing designed experiments. Technometrics. 2007;49:430–439. [Google Scholar]

[R25] Yuan M, Joseph VR, Zou H. Structured variable selection and estimation. The Annals of Applied Statistics. 2009:1738–1757. [Google Scholar]

[R26] Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]

[R27] Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

A Modified Adaptive Lasso for Identifying Interactions in the Cox Model with the Heredity Constraint

Lu Wang

Jincheng Shen

Peter F Thall

Abstract

1. Introduction

2. Adaptive Lasso with Strong Heredity Constraint Using Penalized Partial Like-lihood

3. Theoretical Properties of the Estimator

4. Computational Algorithm

5. Numerical Results

Table 1.

Table 2.

Table 3.

6. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Modified Adaptive Lasso for Identifying Interactions in the Cox Model with the Heredity Constraint

Lu Wang

Jincheng Shen

Peter F Thall

Abstract

1. Introduction

2. Adaptive Lasso with Strong Heredity Constraint Using Penalized Partial Like-lihood

3. Theoretical Properties of the Estimator

4. Computational Algorithm

5. Numerical Results

Table 1.

Table 2.

Table 3.

6. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases