Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Jun 28.
Published in final edited form as: Biometrics. 2012 Apr 16;68(2):419–428. doi: 10.1111/j.1541-0420.2011.01692.x

Model Selection for Cox Models with Time-Varying Coefficients

Jun Yan 1,2,3,*, Jian Huang 4,5
PMCID: PMC3384767  NIHMSID: NIHMS332898  PMID: 22506825

Summary

Cox models with time-varying coefficients offer great flexibility in capturing the temporal dynamics of covariate effects on right censored failure times. Since not all covariate coefficients are time-varying, model selection for such models presents an additional challenge, which is to distinguish covariates with time-varying coefficient from those with time-independent coefficient. We propose an adaptive group lasso method that not only selects important variables but also selects between time-independent and time-varying specifications of their presence in the model. Each covariate effect is partitioned into a time-independent part and a time-varying part, the latter of which is characterized by a group of coefficients of basis splines without intercept. Model selection and estimation are carried out through a fast, iterative group shooting algorithm. Our approach is shown to have good properties in a simulation study that mimics realistic situations with up to 20 variables. A real example illustrates the utility of the method.

Keywords: B-spline, Group lasso, Varying-coefficient

1. Introduction

Cox models with time-varying coefficients offer great flexibility in assessing the temporal dynamics of covariate effects on right censored failure times. When a large number of covariates are available, it is important to select a subset of significant variables and the forms of their effect, time-varying or time independent. Therefore, an ideal model selection procedure for Cox models with time-varying coefficients should distinguish three kinds of covariates: 1) those not in the model; 2) those in the model with time-independent coefficients; and 3) those in the model with time-varying coefficients.

For standard Cox models with time-independent coefficients, effective variable selection techniques have been available. The Lasso approach, widely used in variable selection for linear regression models (Tibshirani, 1996), has been extended to Cox models (Tibshirani, 1997). Zhang and Lu (2007) further proposed adaptive Lasso where the penalty on each coefficient is weighted by the inverse magnitude of an initial estimate of the efficient. Fan and Li (2002) proposed a general nonconcave penalized partial likelihood approach, extending their methods for linear models (Fan and Li, 2001). Both approaches have the oracle property; that is, the asymptotic distribution of an estimated coefficient is the same as that when it is known a priori which variables are in the model.

For Cox models with time-varying coefficients, model selection has not been extensively studied. In fact, the literature on model selection for varying coefficient in general appears to be limited. In the framework of smoothing spline analysis of variance, Lin and Zhang (2006) proposed a component selection and smoothing operator (COSSO) that replaces the squared norm penalty in traditional smoothing spline methods with L1 norm. This approach was later extended to varying coefficient Cox models (Leng and Zhang, 2006). Li and Liang (2008) proposed a two-part variable selection approach for semiparametric regression models. Variables in the parametric component are selected with the method of Fan and Li (2001), with unknown nonparametric coefficients replaced with estimates obtained from maximizing kernel based local likelihood. Variables in the nonparametric component are selected with backward elimination via a sequence of generalized quasi-likelihood ratio test (Fan et al., 2001). For varying-coefficient models of repeated measurements, Wang et al. (2008) proposed a regularized estimation using smoothly clipped absolute deviation (SCAD) with nonparametric coefficients expanded by basis functions. For nonparametric additive models, Huang et al. (2010) approximated additive components with B-spline expansions and selected nonzero components by selecting the groups of coefficients in the expansion via adaptive group lasso. An interesting work that selects between constant coefficient and varying coefficient is Leng (2009), where the COSSO penalty was redesigned to distinguish the two types of coefficients; this approach, however, does not select between nonzero and zero coefficients. Most recently, Zhang et al. (2011) proposed an automatic approach to discover whether a covariate effect is linear or nonlinear in addition to whether it is nonzero with different penalty terms on linear and nonlinear effects. In the context of Cox models with time-varying coefficients, simultaneous selection between varying coefficient and fixed coefficient in addition to selection between nonzero and zero coefficient has not been studied.

Two main classes of approaches for varying coefficient Cox models have been studied in the literature. The penalized partial likelihood approach uses smooth functions for coefficients, maximizing the log partial likelihood with a penalty on the roughness of the coefficients (Zucker and Karr, 1990). The kernel-weighted partial likelihood approach finds point estimator at each time by maximizing a weighted “local” log partial likelihood function (Cai and Sun, 2003; Tian et al., 2005). We focus on the first class of models, where each time-varying coefficient is expanded over a B-spline basis. Each coefficient is then characterized by a set of basis coefficients which is further treated as two groups. The first group captures the time-independent, overall level of the covariate effect, while the second group captures the temporal changes relative to the overall level over time. We propose to select significant variables and the temporal dynamics of their effects by a applying the group lasso approach (Yuan and Lin, 2006) over these groups of coefficients.

The rest of the article is organized as follows. An adaptive group lasso method with penalized partial likelihood based on B-splines is proposed in Section 2. Computation details of the proposed model selection procedure is presented in Section 3. Numerical studies on finite sample performance of the procedure are summarized in Section 4 The method is applied to a real data in Section 5. A discussion concludes in Section 6.

2. Adaptive Group Lasso with B-Splines

Consider a random sample of size n. Let Ti be the failure time and Ci the censoring time of subject i, i = 1, …, n. Let Xi = (Xi1, …, Xip) be the vector of covariate for subject i. Define Ti=min(Ti,Ci) and Δi=I(TiCi). Assume that Ti and Ci are conditionally independent given Xi, and that the censoring scheme is non-informative. The observed data are independent and identically distributed copies {Ti, Δi, Xi}, i = 1, …, n.

The Cox model with time-varying coefficients is

h(tXi)=h0(t)exp[Xiβ(t)], (1)

where h0 is an unspecified baseline function, and β(t) is p × 1 vector of time-varying coefficients. Let Bj(t)’s, j = 1, …, q − 1, q > 1, be a set of B-spline basis of q − 1 degrees of freedom without intercept on a predetermined time interval [0, τ]. Assume that β(t) is expanded by the B-spline basis, β(t) = ΘF(t), where F(t) = {1, B1(t), …, Bq−1(t)} and Θ is a p × q matrix of parameters to be estimated. Therefore, each time varying coefficient βj(t) = ΘjF(t), j = 1, …, p, is determined by Θj, the jth row of parameter matrix Θ.

We decompose each βj(t) into two parts by partitioning Θj into two parts, each corresponding to a partition of F(t). That is, we write Θj = (Θj,1, Θj,−1), where Θj,1 is the coefficient of the first component, one, in F(t), and Θj,−1 consists of the coefficients of the remaining components in F(t), {B1(t), …, Bq−1(t)}. The intercept Θj,1 represents a time-independent, overall effect while Θj,−1 determines the temporal changes in βj(t) relative to the intercept. Because of this construction, the B-spline basis {B1(t), …, Bq−1(t)} cannot contain any intercept. With package splines from base R (R Development Core Team, 2011), this can be obtained from function bs with intercept = FALSE. In our simulation and analysis, we used function bs with quadratic splines (degree = 2) with q − 1 degrees of freedom (df = q − 1), with equally spaced interior knots.

Let θ = vech(Θ), the vectorization of Θ by row. Assuming no ties in the observed failure times, the log partial likelihood function is

ln(θ)=i=1nΔi[XiΘF(Ti)log{jRiexp(XjΘF(Ti))}], (2)

where Ri = {k: TkTi} is the risk set at time Ti. We propose to estimate θ by minimizing the negative penalized log partial likelihood

Qλn(θ)=ln(θ)+P(θ;λn), (3)

where P(θ; λn) is a penalty function that penalizes coefficient estimates in groups with a tuning penalty parameter λn (Yuan and Lin, 2006).

Suppose we partition θ into g groups, θ1, …, θg. The penalty function is P(θ;λn)=λni=1gWi||θi||, where Wi is a penalty weight for group i. The weight Wi can have group size pi built-in as in (Yuan and Lin, 2006) and can be chosen adaptively as in Zhang and Lu (2007). In particular, we use

Wi=pi/||θi||, (4)

where pi is the size of group i, and θ̃i is some initial, consistent estimator of θi. This weight penalizes more if the groups size pi is larger or if the norm ||θ̃i|| is smaller.

To select significant variables and the temporal nature of their effects, we consider two ways to partition θ. The first way puts each row in Θ into a single group, which leads to penalty function

P(θ;λn)=λnj=1pWj||Θj||. (5)

This penalty treats Θj as a whole group without distinguishing whether βj can be described by a time-independent effect. We call the penalty in (5) combined penalty because each covariate coefficient is penalized via a single penalty. Significant variable can be selected but all selected variables are bound to have time-varying coefficients. The second way further separates each Θj into two groups, a time-independent part Θj,1 and a time-varying part Θj,−1, j = 1, …, p. The penalty function is

P(θ;λn)=λnj=1p{Wj1Θj,1+Wj2||Θj,1||}, (6)

where Wj1 and Wj2 are weights as in (4) computed with the new partition of Θj. This penalty is expected to pick up the difference between time-varying coefficient and time-independent coefficient, if a covariate coefficient is selected to be nonzero. We call the penalty in (6) separate penalty because the overall level and the temporal changes of each covariate coefficient are penalized separately. When Θj,−1 is zero and Θj,1 is nonzero, β j(t) is time-independent. When both Θj,−1 and Θj,1 are nonzero, βj(t) is time-varying. It is possible that Θj,1 is zero while Θj,−1 is nonzero, in which case, the coefficient βj(t) crosses zero.

Our model selection procedure is summarized as follows.

  1. Minimize (3) with combined penalty (5) and weight Wj=q, j = 1, …, p, to obtain θ̃.

  2. Minimize (3) with combined penalty (5) and weight Wj, j = 1, …, p, computed from (4).

  3. Minimize (3) with separate penalty (6) and weight Wj1 and Wj2, j = 1, …, p, computed from (4).

The last step accomplishes the task to select significant variables and select the temporal nature of their effects at the same time. The second step is unnecessary, merely listed here for comparing results from combined penalty with those from separate penalty.

3. Computation

3.1 Iterative Group Shooting Algorithm

We propose an iterative group shooting algorithm to minimize Qλn (θ) in (3). For a fixed penalty parameter λn and fixed weight Wj, j = 1, …, g, the algorithm is an adaptation of the iterative reweighted least squares (IRLS) procedure (Tibshirani, 1997; Zhang and Lu, 2007) to group penalty. Let G = −Δln(θ) = −∂ln(θ)/θ and H = −Δ2ln(θ) = −2ln(θ)/θ θ. Let Inline graphic Inline graphic be the Cholesky decomposition of H. Define pseudo response vector Inline graphic = ( Inline graphic)−1{HθG}. Then, a quadratic approximation of Qλ(θ) is

12(YXθ)(YXθ)+λnj=1gWj||θj||. (7)

This is a penalized least square problem. A necessary and sufficient condition for θ to be a solution to the penalized least square (7) is (Yuan and Lin, 2006)

Xj(YXθ)+λjθj||θj||=0,θj0, (8)
||Xj(YXθ)||λj,θj=0, (9)

where λj = λWj. The closed-form solution of Yuan and Lin (2006) is not applicable because Inline graphic is not group orthonormal; Inline graphic is a triangular matrix from a Cholesky decomposition.

The condition (8) is equivalent to

Sj=(XjXj+λj||θj||Ipj)θj, (10)

where Sj=Xj(YXθj), with θj=(θ1,,θj1,0,θj+1,,θg). Consider the iteration

θj(1)=(XjXj+λj||θj(0)||Ipj)1Sj. (11)

This iteration is similar to the unified algorithm of Fan and Li (2002), except it is done for each group as in the shooting algorithm of Fu (1998). When indeed we have XjXj=Ipj, it reduces to the closed-form solution in Yuan and Lin (2006).

Our iterative shooting algorithm is summarized as follows.

  1. Initialize with θ(0).

  2. For each j = 1, …, g, obtain θj(1) from
    θj(1)={(XjXj+λj||θj(0)||Ipj)1Sj,||Sj||>λ,0||Sj||λ.
  3. Let θj(0)=θj(1) and repeat until convergence.

Note that when updating θj, Sj is computed with the most recent version of θj.

This algorithm does not have the drawback that once a coefficient is shrunken to zero, it will stay at zero (Fan and Li, 2002, p.1354) because in each iteration, each coefficient is checked to see if it is nonzero based on the most recent estimate of θj. Also, since this algorithm can be considered a special case of the block coordinate descent method, it is guaranteed to converge to a local minimizer (Tseng, 2001; Tseng and Yun, 2009). Because the negative log-partial likelihood and the penalty function are convex, the algorithm converges to a global minimizer. In our simulation studies, the algorithm usually converges in a few steps with starting values obtained from the last λ value under a moderate tolerance.

3.2 Choosing the Tuning Parameter

The tuning penalty parameter λn is estimated by generalized cross validation (GCV) (Craven and Wahba, 1979). We illustrate with the combined penalty function (5). The minimizer of (7) can be approximated by a ridge solution (H + λnD)−1 Inline graphic Inline graphic, where matrix D = diag{diag(W1/||θ1,||), …, diag(Wg/||θg||)}. Then, the number of effective parameters is approximated by pn) = tr{(H + λnD)−1H}, and the GCV function is approximated by

GCV(λn)=ln(θ)n{1p(λn/n)}2.

The optimal λn is chosen as the minimizer of GCV over a grid of λn values.

The flexibility of the B-spline basis is determined by its degrees of freedom q, which is in turn determined by the number and locations of interior knots. In our implementation, we used quadratic B-splines with interior knots equally spaced or placed on the sample quantiles of the observed failure times.

3.3 Likelihood Derivatives Evaluation

To minimize (3) using the iterative shooting algorithm, efficient evaluation of the derivatives of the log partial likelihood function (2) is needed. A naive way using standard software for time dependent covariate is to construct p × q pseudo time dependent covariates XjF(t), where ⊗ is the Kronecker product. This is computationally expensive for even moderate sample sizes because the time-dependent covariates needs to be constructed for each observed event time.

Taking advantage of Kronecker product, a fast routine suggested by Perperoglou et al. (2006) can be used. The gradient of (2) is ln(θ)=i=1nΔi(XiX¯i(Θ))F(Ti), where

X¯i(Θ)=jRiXiexp(XjΘF(Ti))jRiexp(XjΘF(Ti))

is the mean of covariate Xj in risk set Ri weighted by exp(XjΘF(Ti)). The Hessian matrix is 2ln(θ)=i=1nΔiCi(Θ){F(Ti)F(Ti)}, where Ci(Θ) is the covariance matrix of covariate vector Xj in risk set Ri weighted again by exp(XjΘF(Ti)). As shown by Perperoglou et al. (2006) and the numerical study in this article, such formulation is very efficient for larger sample sizes.

3.4 Variance Estimation

Following Fan and Li (2002), when the algorithm converges, the estimator satisfies the iteration

θ^(1)=θ^(0){2ln(θ^(0))+(θ^(0);λn)}1{ln(θ^(0))+U(θ^(0),λn)},

where

(θ;λn)=diag{λnW1||θ1||I(p1),,λnWg||θg||I(pg)},

I(k) is identity matrix of dimension k,

U(θ;λn)=diag{λnW1||θ1||diag(θ1),,λnWg||θg||diag(θg)}.

The corresponding sandwich formula can be used as an estimator for the covariance of the estimate θ̂NZ, the nonzero component of θ̂. That is, cov^(θ^NZ)=A1BA1, where

A=2ln{(θ^NZ,0)}+(θ^NZ;λn)

and B=cov^[ln{(θ^NZ,0)}].

Once the variance estimator of θ̂ is obtained, the variance estimator of a nonzero coefficient βj(t) is then F(t)cov^(Θ^j)F(t), where cov^(Θ^j) is the variance estimator of Θj extracted from the estimated variance matrix of θ̂. This estimator can be used to construct pointwise confidence intervals for βj(t).

4. Numerical Studies

Simulations were conducted to study the performance of the proposed adaptive group lasso with B-splines for finite samples. In particular, we want to check if the proposed method can 1) pick up important variables correctly (in the model or not); and 2) pick up the form of important variables correctly (time-varying versus time-independent).

Four factors are considered in our simulation design: number of covariates (10 and 20), censoring percentage cp (20% and 40%), sample size n (200 and 400), and effect scale s (1 and 2). The effect scale — a multiplier on all the coefficients — is designed to study the influence of effect size or signal level on the performance of the proposed methods.

Event times are generated from a varying-coefficient Cox model (1) with time-independent covariate vector X and coefficients β(t), whose nonzero components are β2(t) = −s{1 + cos(πt)}I(0 < t < 1), β3(t) = s{0.5 + sin(πt/2)}, and β8(t) = −s; see Figure 1 for t ∈ (0, 2). That is, out of 10 or 20 covariates, the 2nd and 3rd ones have time-varying coefficients, the 8th one has time-independent coefficient, and all the rest have coefficient zero. Note that β2(t) diminishes to zero at t = 1 and remains zero afterwards, which makes model selection and estimation harder. The baseline hazard function also has effect scale s built in, λ(t) = exp{−s cos(πt/2)}. Covariate vector X is generated from a multivariate normal distribution whose marginals are all N(0, 0.5), and whose pairwise correlation coefficients are 0.5|jk| for pair (j, k). Censoring times are generated from a mixture of uniform distribution over (0, 2) and a point mass at 2, with the mixing probability calibrated to yield desired censoring percentage cp. For each scenario, 100 datasets are generated.

Figure 1.

Figure 1

Estimated curves (gray) of the three nonzero coefficients from 100 replicates in the scenario of sample size n = 400 and censoring percentage cp = 40% with 20 covariates when β2(t) = I(0 < t < 1)s{1 − cos(πt)}. The dark lines are the true curves. The dashed lines are the average of 100 estimates. The dotted lines are the pointwise 95% confidence intervals.

Given a simulated dataset, we use quadratic B-splines with 5 degrees of freedom, with equally spaced knots in time window (0, 2), for each covariate coefficient. This gives two equal distant interior knots in (0, 2). Model selection results are obtained from combined penalty (5) and separate penalty (6), denoted as Method 1 and Method 2, respectively. For comparison, we also report the model selection results from adaptive lasso with all covariate coefficients specified as time-independent, which is denoted as Method 0.

Table 1 and Table 2 summarize the variable selection results for 10 covariates and 20 covariates, respectively, regardless of the time nature of their effects. We report the frequency of each variable selected, the average number of groups selected (NG), and the average MSE over 100 replicates. The “correct” NGs are 3, 3, and 5 for Method 0, Method 1, and Method 2, respectively. The MSE at a specific time t is calculated as {β̂ (t) − β(t)}V {β̂(t) − β(t)}, where V is the population covariance matrix of the covariates. The reported MSE is the average of pointwise MSE over a equally spaced grid of 100 points in time interval (0, 2).

Table 1.

Model selection results from 100 runs with 10 covariates. The three entries in each table cell are the counts of each variable being selected in time-fixed-coefficient models (Method 0), combined-penalty-varying-coefficient models (Method 1), and separate-penalty-varying-coefficient models (Method 2), respectively.

n cp X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 NG MSE
s = 1
200 20 7 51 100 3 4 2 4 100 1 5 2.8 1.053
1 32 99 1 1 0 1 100 1 0 2.4 1.014
8 63 100 2 5 2 4 100 1 5 3.7 0.852
40 11 67 96 5 8 9 11 98 13 6 3.2 1.123
4 63 96 0 0 2 3 97 5 1 2.7 0.937
11 82 98 3 6 8 8 98 10 6 4.2 0.845
400 20 9 87 100 2 6 1 1 100 3 0 3.1 0.870
1 73 100 0 0 0 0 99 0 0 2.7 0.692
6 94 100 0 1 0 1 99 0 0 4.3 0.591
40 10 95 100 6 2 4 1 100 4 7 3.3 0.894
2 94 100 0 0 1 1 100 0 0 3.0 0.555
10 97 100 3 1 2 2 100 4 5 4.7 0.503
s = 2
200 20 6 98 100 8 9 7 8 100 7 4 3.5 3.421
0 87 100 0 0 1 1 100 1 0 2.9 2.030
2 98 100 3 2 3 2 100 3 0 4.8 1.631
40 10 100 100 21 12 10 5 100 15 11 3.8 3.491
0 100 100 0 0 0 1 100 1 0 3.0 1.442
4 100 100 9 2 2 3 100 8 4 5.1 1.311
400 20 14 100 100 2 7 4 3 100 3 3 3.4 3.230
0 100 100 0 0 1 0 100 0 0 3.0 0.800
1 100 100 1 2 2 0 100 1 1 5.0 0.780
40 19 100 100 15 10 12 17 100 12 20 4.0 3.289
0 100 100 0 0 0 0 100 0 0 3.0 0.774
4 100 100 2 0 1 2 100 3 0 5.0 0.706

NG: number of selected groups. MSE: mean squared error.

Table 2.

Model selection results from 100 runs with 20 covariates. The three entries in each table cell are the counts of each variable being selected in time-independent-coefficient models (Method 0), combined-penalty-varying-coefficient models (Method 1), and separate-penalty-varying-coefficient models (Method 2), respectively.

n cp X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 NG MSE
s = 1
200 20 8 51 100 6 3 3 6 100 3 2 6 3 4 5 8 6 3 4 1 4 3.3 1.046
1 30 100 2 0 0 2 98 1 0 1 0 0 0 0 0 0 0 0 1 2.4 1.037
7 56 100 6 5 1 5 100 3 1 7 2 6 2 5 3 1 2 3 4 3.9 0.889
40 6 69 96 7 7 6 9 100 9 6 6 4 7 9 9 4 6 8 4 7 3.8 1.125
2 64 97 3 3 2 5 100 1 3 2 0 2 1 1 0 2 0 1 4 2.9 0.922
7 80 97 6 8 4 5 100 9 6 6 3 4 9 5 3 3 6 5 9 4.8 0.856
400 20 2 95 100 1 5 5 4 100 4 4 0 0 4 1 3 3 1 3 0 5 3.4 0.827
0 82 100 0 0 0 1 100 0 0 0 0 0 0 0 0 0 0 0 0 2.8 0.614
1 98 100 0 0 2 2 100 0 0 0 0 2 0 2 0 0 1 1 2 4.5 0.510
40 8 97 100 3 4 3 6 100 5 5 3 6 8 5 6 4 9 5 6 7 3.9 0.883
0 98 100 0 0 0 1 100 0 0 0 0 0 0 0 1 0 0 1 0 3.0 0.512
5 100 100 1 3 4 3 100 1 4 1 4 7 3 4 3 6 3 5 4 5.2 0.447
s = 2
200 20 11 99 100 9 4 6 13 100 8 2 5 10 8 8 7 6 4 6 3 4 4.1 3.362
0 93 100 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 2.9 1.936
1 98 100 2 0 0 1 100 0 1 0 2 2 0 1 0 0 0 0 0 4.6 1.680
40 15 99 100 16 9 14 11 100 12 13 11 9 13 10 11 13 15 10 9 11 5.0 3.418
1 99 100 1 0 0 3 100 2 0 0 0 0 0 0 1 0 0 0 0 3.1 1.402
7 100 100 5 3 3 7 100 5 3 4 4 4 2 4 3 4 0 5 2 5.4 1.243
400 20 10 100 100 7 6 5 10 100 5 5 6 2 3 11 7 9 6 7 5 7 4.1 3.210
0 100 100 0 0 0 1 100 0 0 0 0 0 0 0 0 0 0 0 0 3.0 0.812
2 100 100 1 1 2 3 100 1 1 4 1 0 3 1 3 1 2 3 0 5.2 0.762
40 18 100 100 10 14 17 20 100 5 15 19 13 18 8 16 14 12 13 9 16 5.4 3.259
0 100 100 0 0 1 0 100 0 0 0 0 0 0 0 0 0 1 0 0 3.0 0.749
7 100 100 4 2 3 6 100 0 2 5 5 2 3 5 6 1 2 2 3 5.6 0.682

NG: number of selected group. MSE: mean squared error.

In all scenarios, both Method 1 and Method 2 seem to work reasonably well in selecting covariates X3 and X8. Covariate X2 with a diminishing effect is difficult to select. It is selected more frequently with Method 2 than with Method 1, which is expected as the effect is easier to be picked up as time-independent with separate penalty. As the effect scale s increases from 1 to 2, both Method 1 and Method 2 selects more often X2, and the selection of non-important variables becomes less often or does not get worse. This is not true, however, for Method 0, which selects more often X2 but at the same time, selects more often non-important variables. All methods appear to improve as the sample size increases. For sample size n = 200, Method 2 performs similar to Method 0 in that non-important variables are over-selected. As sample size increases, the advantage of Method 2 relative to Method 0 becomes evident with less over-selection and smaller MSEs. For instance, in the scenario of s = 1, n = 400, and cp = 40% with 20 covariates, the MSE is 0.447 for Method 2 and 0.883 for Method 0; the MSE of Method 1 is 0.512, which is in between the other two. As censoring gets heavier, correct selection of X2 improves and the overall MSE decreases for both Method 1 and Method 2. This may be explained by the fact that when censoring is heavier, the proportion of events in earlier times is higher, which increases the chances of the earlier part of β2 being selected into the model as a negative time-independent effect.

It is of particular interest to check if Method 2 can tell whether a coefficient is time-independent or not. Table 3 summarizes these results for the three variables with nonzero coefficient. We report the frequencies that the intercept (Int) component and the time-varying (TV) component of each nonzero effect is selected. The performance of Method 2 improves as the effect scale or the sample sizes increases, with much higher frequency that X2 is selected to have a time-varying effect. Covariate X3, which has a positive bump effect, is selected to have a time-varying effect about 2/3 of the times or more for s = 1, and almost all the time for s = 2. Covariate X8 is correctly selected to have a time-independent effect most of the time even at sample size n = 200. For instance, consider again the scenario of s = 1, n = 400, and cp = 40% with 20 covariates. Variables X2 and X3 are correctly selected to have time-varying coefficient for 84 and 80 times, respectively; variable X8 is incorrectly selected to have time-varying coefficient only 4 times.

Table 3.

Time-varying selection results of separate-penalty-varying-coefficient models (Method 2) for variables 2, 3, and 8 from 100 runs.

n cp 10 covariates
20 covariates
X2
X3
X8
X2
X3
X8
Int TV Int TV Int TV Int TV Int TV Int TV
s = 1
200 20 63 12 97 67 100 1 56 11 96 64 100 0
40 81 26 86 76 98 3 80 36 87 71 100 4
400 20 94 43 94 89 99 1 98 56 93 91 100 1
40 97 71 89 82 100 1 100 84 90 80 100 4
s = 2
200 20 98 79 85 98 100 1 98 78 75 100 100 0
40 100 95 83 97 100 4 100 97 79 99 100 3
400 20 100 100 85 100 100 2 100 100 90 100 100 3
40 100 100 86 100 100 3 100 100 88 100 100 10

Int: the intercept component. TV: the time-varying component.

Finally, to study recovery of the nonzero coefficients, we plot in Figure 1 the 100 estimated coefficient curves overlaid with the true curves for the scenarios with n = 200 and cp = 400, using both combined penalty and separate penalty. It is clear that the scale size s plays an important role here. For stronger signal (s = 2), the estimated curves are much closer to and tighter around the true curves for all three coefficients. In particular, for s = 2, estimates of the diminishing effect β2(t) are recovering the true curve reasonably well; for s = 1, however, the earlier negative effect is more obviously shrunken to zero, and in the case of separate penalty, many of the estimates are estimated as negative but time-independent. With separate penalty, estimates of β3(t) are all time-varying in the case of s = 2, but with a noticeable number of time-independent curves in the case of s = 1. The separate penalty performs very well in estimating the time-independent coefficient β8(t), and gives less bias in comparison with those estimates from the combined penalty. By comparing the estimated curves under combined penalty and under separate penalty, it seems that when the true curve is time-independent, as with β8(t), the separate penalty gives lighter shrinkage towards zero and less variability; when the true curve is time-varying, however, as with β2(t) or β3(t), the combined penalty seems to provide less variability. This is observed in both cases of s = 1 and s = 2. The observation may be expected since the separate penalty approach tries to achieve more than the combined penalty, and it comes with a cost because, when the true curves are time-varying, there is a chance that the separate penalty may not select the necessary intercept as seen in Table 3.

Also plotted in Figure 1 are the averages of the 100 estimated coefficients curves and their pointwise 95% confidence intervals constructed using the variance estimator in Section 3.4. The standard errors appear to underestimate the true variation, which may be related to the shrinkage effect in the estimation. The underestimation of variation was also observed for Cox model with constant coefficients (Zhang and Lu, 2007). In our setting, the number of parameters in Θ is even more and, hence, an even larger sample size is necessary for the asymptotic variance to provide good approximation.

The performance of the methods is more aggressively studied by replacing the diminishing effect β2(t) with a crosszero effect β2(t) = −s cos(πt/2), which makes the problem much harder since β2(t) integrates to zero over (0, 2). Results in analogy to Tables 13 and Figure 1 are reported in the Web Appendix. In this study, the crosszero effect is very hard to be picked up with s = 1; for example, with n = 400, cp = 40% and 20 covariates in Web Table A.2, only 19 and 38 out of 100 times X2 is selected by Method 1 and Method 2, respectively. With s = 2, these frequencies increase to 96 and 98, respectively, and further, it was selected as time-varying coefficient 94 and 98 times, respectively (Web Table A.3). The estimated β2(t) curves in Web Figure A.1 are not close to the true curve with s = 1. Nevertheless, with s = 2, the estimates are recovering the true curve reasonably well from both Method 1 and Method 2, albeit shrunken towards zero at the two endpoints where it was most away from zero. Observations about recovering β3(t) and β8(t) are similar to those seen in Figure 1. The poor performance of β̂2(t) in the case of s = 1 is not a surprise because the problem is a much harder one. As the signal gets stronger, our methods can be useful in detecting and estimating such crosszero effects.

5. The Primary Biliary Cirrhosis Data

We apply the proposed method to the primary biliary cirrhosis (PBC) data, which has been analyzed in the context of model selection for Cox model with time-independent coefficients (Tibshirani, 1997; Zhang and Lu, 2007). PBC is a rare but fatal chronic autoimmune liver disease, with a prevalence of about 50-cases-per-million population (Fleming and Harrington, 1991). The dataset contains followup of 312 randomized and 106 unrandomized patients with PBC at Mayo Clinic between January 1974 and May 1984. The dependence of survival time on 17 covariates is studied in a Cox model with possibly time-varying coefficients. The survival time is the number of days between registration and the earlier of death or study analysis time in 1986. We consider the 312 randomized patients and, after removing missing values, end up with 276 observations. The 17 covariates are, in the same order as in Tibshirani (1997), 1) trt, treatment indicator (1 = treatment); 2) age (in 10 years); 3) female, gender indicator (1 = female); 4) ascites, presence of ascites; 5) hypato, presence of hypatomegaly; 6) spiders, presence of spiders; 7) edema, severity of oedema; 8) logbili, logarithm of serum bilirubin (mg/dl); 9) chol, serum cholesterol (mg/dl); 10) logalb, logarithm of albumin (g/dl); 11) copper, urine copper (mg/day); 12) alk.phos, alkaline phosphatase (U/l); 13) ast, aspartate aminotransferase (U/ml); 14) trig, triglycerides (mg/dl); 15) platelet, platelet count per 10−3 ml3; 16) logprotime, logarithm of prothrombine time (sec); 17) stage, histologic stage of disease (graded 1, 2, 3, or 4). Note that, we took log on serum bilirubin, albumin, and prothrombin time since Tian et al. (2005) and Martinussen and Scheike (2002) found possibly time-varying coefficients for these covariates.

We first fit the Cox model with time-independent coefficients for all 17 covariates without any penalty. The inverse of the absolute value of these estimates were then used as weights in two adaptive procedures, adaptive lasso (ALASSO) with time-independent coefficients and the proposed adaptive group lasso (AGLASSO) with B-splines. The ALASSO approach is the same as that in Zhang and Lu (2007), except that we took log for three aforementioned covariates. The AGLASSO approach allows time-varying coefficients and, for each covariate, penalizes the time-independent part and the time-varying part of the coefficient in separate groups. The B-spline basis are quadratic with 5 degrees of freedom over the time interval of (0, 3200) days, where 3200 is approximately the 90th percentile of the observed event times. After a final model is selected from ALASSO or AGLASSO, we then fit a Cox model without any penalty assuming that the selected models are known. Table 6 summarizes the results. Both ALASSO and AGLASSO selected the same set of covariates. The only variable selected as having time-varying coefficient was logbili, which is consistent with the finding of Tian et al. (2005). The estimate and pointwise 95% confidence interval of this coefficient is plotted in Figure 2. The estimated effect has a bump during between days 1000 and 1500, after which it diminishes gradually. Two other covariates, logprotime and edema, were identified as possibly having time-varying coefficient by Tian et al. (2005), who used only 5 out of the 17 variables to start with. Using AGLASSO, these two variables were selected to be significant in the model but their effects were not found to be time-varying.

Figure 2.

Figure 2

Time-varying coefficient estimate of covariate logbili

6. Discussion

Variable selection for semiparametric model is different form traditional variable selection for linear models in that the temporal nature of the coefficient of each selected variable needs to be selected as well. The method of Li and Liang (2008), developed for generalized varying-coefficient partially linear models, can be extended to varying coefficient Cox models, in which case, nonparametric coefficients would be fitted with kernel-based local partial likelihood. Nevertheless, this method assumes knowledge a priori about which covariates have varying-coefficient. Our nonparametric coefficients are fitted with smooth functions expanded using B-spline basis. Penalizing a time-independent part and a time-varying part separately for each coefficient, our adaptive group lasso approach not only selects significant variables but also identifies which ones have varying-coefficient. in addition to selecting those variables that are important. This is important for practitioners who do not have prior knowledge or are not willing to make assumptions about the functional form of the covariate coefficients. Our simulation studies shows rather good results for sample size as big as 400 with moderate censoring in selecting 3 important variables out of 20. The working version of our implementation as an informal R package is available upon request.

Our focus is on the methodology development, its computational implementation, and numerical evaluation of its performance. An important question that we have not addressed is the estimation and selection consistency of the proposed method. This is an interesting and challenging problem, especially if p is allowed to diverge with n. We conjecture that the procedure can correctly distinguish time-varying and time-independent covariates correctly as the sample size goes to infinity, in light of the results of (Huang et al., 2010) for nonparametric additive models. A rigorous proof, however, is not straightforward. The main difficulty arises from the fact that the log-partial likelihood is not a sum of independent terms. Therefore, the tools (e.g., maximal inequalities for independent random variables) from the empirical process theory are not applicable. The martingale method that is effective in studying Cox models with time-independent covariates does not apply to the current problem either. Research to carefully address all the technical details is warranted.

The proposed methods raise several questions. A multiple degree of freedom factor would lead to a collection of groups, each one formed by splines basis corresponding to one degree of freedom. For instance, the histologic stage of disease in our analysis of the PBC data was treated as a numerical variable, but it could, even preferably sometimes, be treated as a factor. A naive solution would be to treat all the groups as one big group and then apply the proposed method; this way, all contrasts of this factor are either in or out of the model altogether. A better solution would be to add different penalties to different levels of grouping similar to the bi-level penalty of (Breheny and Huang, 2009). It is known that a nonlinear effect in a Cox model may be mis-identified as time-varying effect (Therneau and Grambsch, 2000). Model selection with nonlinear effects may be done with the fractional polynomials approach (Royston and Altman, 1994; Sauerbrei and Royston, 1999). A sensitivity study of the performance of the proposed method under Cox models with nonlinear effect would be interesting. Comparison with nonconcave penalty approaches such as group SCAD (Fan and Li, 2001) and group minimax concave penalty (MCP) (Zhang, 2010) is of great interest as always. Our computing algorithm, however, is built upon the Karush–Kuhn–Tucker conditions of group LASSO (Yuan and Lin, 2006), which makes it nontrivial to adapt to SCAD and MCP. The coordinate descent algorithm for group SCAD and group MCP in (Breheny and Huang, 2011) may be extended to handle groups of basis coefficients and to the context of Cox models with varying-coefficients. Such extensions, implementation, and their numerical performance, however, deserve separate manuscripts on themselves.

Supplementary Material

1

Table 4.

Estimated coefficients and standard errors from ML, ALASSO, AGLASSO for the PBC data. Results for ALASSO and AGLASSO were obtained from refitting the selected model without penalty.

Covariate ML
ALASSO
AGLASSO
Coef Std.Err Coef Std.Err Coef Std.Err
trt −0.062 0.211
age 0.261 0.113 0.270 0.124 0.263 0.126
female −0.256 0.317
ascites 0.162 0.381
hepato −0.100 0.254
spiders 0.049 0.243
edema 0.926 0.378 0.842 0.410 0.932 0.443
logbili 0.723 0.162 0.699 0.115 See Figure 2
chol 0.000 0.000
logalb −2.270 0.947 −2.538 0.762 −2.440 0.789
copper 1.694 1.251 2.218 1.236 2.089 1.261
alk.phos 0.000 0.000
ast 0.003 0.002
trig −0.002 0.001
platelet 0.001 0.001
logprotime 2.335 1.321 2.099 1.241 1.822 1.249
stage 0.381 0.176 0.274 0.140 0.278 0.143

Acknowledgments

Yan’s research was partially supported by U.S. National Science Foundation grant DMS 0805965. Huang’s research was partially supported by NIH grants R01CA120988, R01CA142774 and NSF grant DMS 0805670. The computing was facilitated by a beowulf cluster at the Department of Statistics, University of Connecticut, acquired under the partial support of NSF SCREMS grant 0723557.

Footnotes

Supplementary Materials

Web Appendix, Tables, and Figures referenced in Sections 4 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

References

  1. Breheny P, Huang J. Penalized methods for bi-level variable selection. Statistics and its interface. 2009;2:369–380. doi: 10.4310/sii.2009.v2.n3.a10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cai Z, Sun Y. Local linear estimation for time-dependent coefficients in Cox’s regression models. Scandinavian Journal of Statistics. 2003;30:93–111. [Google Scholar]
  4. Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik. 1979;31:377–403. [Google Scholar]
  5. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  6. Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
  7. Fan J, Zhang C, Zhang J. Generalized likelihood ratio statistics and Wilks phenomenon. The Annals of Statistics. 2001;29:153–193. [Google Scholar]
  8. Fleming TR, Harrington DP. Counting Processes and Survival Analysis. John Wiley & Sons; 1991. [Google Scholar]
  9. Fu WJ. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
  10. Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models. Annals of Statistics. 2010;38:2282–2313. doi: 10.1214/09-AOS781. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Leng C. A simple approach for varying-coefficient model selection. Journal of Statistical Planning and Inference. 2009;139:2138–2146. [Google Scholar]
  12. Leng C, Zhang HH. Model selection in nonparametric hazard regression. Journal of Nonparametric Statistics. 2006;18:417–429. [Google Scholar]
  13. Li R, Liang H. Variable Selection in semiparametric regression modeling. Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Lin Y, Zhang HH. Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics. 2006;34:2272–2297. [Google Scholar]
  15. Martinussen T, Scheike TH. A flexible additive multiplicative hazard model. Biometrika. 2002;89:283–298. [Google Scholar]
  16. Perperoglou A, le Cessie S, van Houwelingen HC. A fast routine for fitting Cox models with time varying effects of the covariates. Computer Methods and Programs in Biomedicine. 2006;25:154–161. doi: 10.1016/j.cmpb.2005.11.006. [DOI] [PubMed] [Google Scholar]
  17. R Development Core Team. R: A Language and Environment for Statistical Computing; Vienna, Austria. R Foundation for Statistical Computing; 2011. [Google Scholar]
  18. Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: Parsimonious parametric modelling. Applied Statistics. 1994;43:429–467. [Google Scholar]
  19. Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: Transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society: Series A (Statistics in Society) 1999;162:71–94. [Google Scholar]
  20. Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. Springer-Verlag Inc; 2000. [Google Scholar]
  21. Tian L, Zucker D, Wei L. On the Cox model with time-varying regression coefficients. Journal of the American Statistical Association. 2005;100:172–183. [Google Scholar]
  22. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B: Methodological. 1996;58:267–288. [Google Scholar]
  23. Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
  24. Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of ptimization Theory and Applications. 2001;109:475–494. [Google Scholar]
  25. Tseng P, Yun S. Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications. 2009;140:513–535. [Google Scholar]
  26. Wang L, Li H, Huang JZ. Variable selection in nonparametric vary-coefficient models for analysis of repeated measurements. Journal of American Statistical Association. 2008;103:1556–1569. doi: 10.1198/016214508000000788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B: Statistical Methodology. 2006;68:49–67. [Google Scholar]
  28. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 2010;38:894–942. [Google Scholar]
  29. Zhang HH, Cheng G, Liu Y. Linear or nonlinear? Automatic structure discovery for partially linear models. Journal of the American Statistical Association. 2011 doi: 10.1198/jasa.2011.tm10281. Forthcoming. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zhang HH, Lu W. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
  31. Zucker DM, Karr AF. Nonparametric survival analysis with time-dependent covariate effects: A penalized partial likelihood approach. The Annals of Statistics. 1990;18:329–353. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES