Abstract
The Cox proportional hazard (PH) model is widely used to determine the effects of risk factors and treatments (covariates) on survival time of subjects that might be right censored. The selection of covariates depends crucially on the specific form of the conditional hazard model, which is often assumed to be PH, Accelerated Failure time (AFT) or proportional odds (PO). However, we show that none of these semi-parametric models allow for the crossing of the survival functions and hence such strong assumptions may adversely affect the selection of variables. Moreover, the most commonly used PH assumption may also be violated when there is a delayed effect of the risk factors. Taking into account all of these modeling assumptions, this study examines the effect of the PH assumption on covariate selection when the data generating model may have non-PH. In particular, variable selection under two alternative models are explored: (i) the penalized PH model (using the elastic-net penalty) and (ii) the linear spline based hazard regression model. We apply the aforementioned models to the ACTG-175 data set and simulated data sets with survival times generated from the Weibull and log-normal distributions. We also examine the effect on covariate selection of stratifying the analysis on the off-treatment indicator.
Keywords: AIDS trials, Crossing survival curves, Hazard regression, Penalized regression
1. Introduction
The Proportional Hazard (PH) regression model is perhaps the most widely used statistical technique to analyze survival data with censored values. Let T denote the (random) survival time of a subject with baseline vector of predictor variables (often called covariates) x⊤ = (x1, …, xp). The conditional survival function S(t|X) is then defined as
which provides the probability of a subject’s survival time beyond t units of time given a set of (baseline) characteristics represented by X = x when the subject enters a clinical study. E.g., if T denotes the survival time of AIDS patients receiving a treatment denoted by X, where say X = j denotes the j-th treatment type for j = 0, 1, the conditional survival functions S0(t) = S(t|x = 0) and S1(t) = S(t|x = 1) correspond to two treatment groups. Often it is of interest to compare the survival functions between two groups by computing their ratios or equivalently by comparing the ratios of the conditional cumulative hazard function H(t|x) = −log S(t|x) across different baseline covariate vectors x. In the two treatment groups scenario, we can compute the ratio γ(t) = H1(t)/H0(t), where Hj(t) = −log Sj(t) for j = 0, 1. As the ratio is a function of time t, often a simplifying assumption is made that γ(t) = γ0;∀t, i.e., the ratio is fixed across all time points t. This simplifying condition is known as the so-called PH assumption. Notice that by differentiating the cumulative hazard, we can express this assumption in terms of the conditional hazard function which leads to the name PH and the equivalent condition h(t|x) = γ(x)h0(t), where the proportionality factor γ(x) depends only on the covariate vector x and not on time t.
Moreover, in practice, one makes another simplifying assumption about the proportionality factor and assumes that for some (unknown) regression parameter vector β⊤ = (β1, …, βp). The goal then becomes to estimate the regression parameter vector based on a set of observations. However, often the subject’s survival times T are not fully observed as the clinical study may be stopped early at time C (since a subject enters the study) due to budget or other unrelated causes, and we may only get to observe U = min{T, C}. is the indicator for whether the survival time T is censored.
Our goal is often to estimate the conditional survival function S(t|x) or, equivalently, the (conditional) cumulative hazard function H(t|x) or the (conditional) hazard function h(t|x), with or without the PH assumption. In many clinical studies such as ACTG-175 (described later in Section 2), many covariates are measured at the baseline as potential risk factors; one of the primary goals is then to determine which among these potential risk factors denoted by the vector X⊤ = (x1, …, xp) significantly (statistically speaking) affect the survival times.
The determination of significant risk factors (i.e., a subset of variables among the p baseline covariates), is known as the variable selection problem. Clearly, if the variables are selected under a specific simplifying assumption (such as PH), the selected subset might be different than if another simplifying assumption is used. The primary goal of this paper is to explore (through empirical data analysis) to what extent a specific modeling assumption (e.g., PH) affects the selection of covariates. Towards this end, instead of theoretical investigations, we have taken more of an exploratory approach to compare several existing models (i.e., standard PH model, penalized PH model and linear spline based hazard regression model) to see the effect of modeling assumptions on covariate selection.
In our case study, we demonstrate that a couple of scenarios (e.g., crossing of survival curves and delayed effects, see section 3.1.1) are likely to happen and hence may cause violation of the PH assumption. Thus, it is natural to ask what is the effect on predictors when the risk factors are selected using possibly incorrect models that may enforce selection of irrelevant variables or rejection of relevant ones.
In order to explore the modeling methods, we first present details of a motivating data set which will serve as a case study for this paper in Section 2. In Section 3, we present an overview of PH models, penalized PH models and non-PH models like hazard regression models. In Section 4, we present the results of applying regular and penalized regression models to the motivating data set under both PH and non-PH assumptions. In Section 5, we present the results of applying the models to simulated data sets generated under both PH and non-PH assumptions. In Section 6, we present further variable selection methodologies stratified by a key covariate which provides further insights about the role of the PH assumption. Finally, in Section 7, we provide a few concluding remarks. Additional exploratory data analyses are also included in an Appendix.
2. A Motivating Data Set: ACTG-175
To explore the effect of the PH assumption on covariate selection, we analyzed the data from AIDS Clinical Trials Group (ACTG) study 175 (to be abbreviated as ACTG-175 from now on) consisting of 2139 patients (Hammer et al. 1996). ACTG-175 randomized HIV-1-infected subjects with CD4 cell counts between 200 and 500 per cubic millimeter to four antiretroviral therapies: zidovudine monotherapy (ZDV), ZDV plus didanosine (ZDV+ddI), ZDV plus zalcitabine, (ZDV+ZAL), and didanosine (ddI) monotherapy. The primary end point was a ≥ 50 percent decline in the CD4 cell count, progression to acquired immunodeficiency syndrome, or death. Approximately 76% of the survival times were right censored. The Kaplan-Meier curves by the four treatment regimens (Figure 1) and the Kaplan-Meier curves by the Off-treatment indicator suggest that non-PH may be present. In the first case, the hazard curves cross; in the second case, there is a delayed treatment effect. The ACTG-175 study consisted of many baseline covariates; we included sixteen of them (see Table 1) while omitting the others (see Appendix A for further details on this initial selection of variables).
Figure 1:
Kaplan-Meier curves by the four treatment groups. The violation of the PH assumption is illustrated by the crossing survival curve estimates.
Table 1:
Covariates included in the analysis.
| Covariates | Description | |
|---|---|---|
| 1 | ZDV+ddI | = 1 if assigned to ZDV+ddi, = 0 if assigned to ZDV |
| 2 | ZDV+ZAL | = 1 if assigned to ZDV+ZAL, = 0 if assigned to ZDV |
| 3 | ddI | = 1 if assigned to ddI, = 0 if assigned to ZDV |
| 4 | Age | age (years) at baseline |
| 5 | Wtkg | weight (kilograms) at baseline |
| 6 | Hemo | = 1 if hemophiliac, = 0 otherwise |
| 7 | Drugs | = 1 if injection-drug use, = 0 otherwise |
| 8 | Karnof | Karnofsky score, on a scale of 0–100 |
| 9 | Oprior | = 1 if non-ZDV anti-retroviral therapy was used prior to the study, = 0 otherwise |
| 10 | Preanti | number of days of anti-retroviral therapy prior to the study |
| 11 | Race | = 1 if non-white, = 0 if white |
| 12 | Gender | = 1 if male, = 0 if female |
| 13 | Symptom | = 1 if symptomatic HIV infection, = 0 if otherwise |
| 14 | Offtrt | = 1 if treatment discontinued before 96 plus/minus 5 weeks, = 0 otherwise |
| 15 | CD40 | CD4 cell count (cells per cubic millimeter) at baseline |
| 16 | CD80 | CD8 cell count (cells per cubic millimeter) at baseline |
In section 4 of this paper we explore how the chosen conditional hazard regression model affects the selection of significant risk factors for the HIV patients among these sixteen baseline variables. We will also demonstrate the effect of interactions among these variables and interactions with time that essentially capture violations of the PH assumption.
3. Conditional Hazard Models
In this section, we present a brief review of some of the conditional hazard models that we consider for our case studies.
3.1. Cox PH Models
The Cox PH Model (Cox 1972) is widely used to determine the impact of treatments and risk factors on time to an event, particularly death, for subjects that may be right-censored. Let right-censored time-to-event data be represented as the following: (Ui, Δi, xi),i = 1, …, n, where Ui = min(Ti, Ci) is the (possibly censored) time to event, is the censoring indicator (1 = death, 0 = censored), and is a vector of p baseline covariates. The Cox model assumes that the hazard function has the following form:
| (1) |
where β is a vector of p coefficients and the conditional hazard function is as defined in Section 1. The Cox model does not assume any specific functional form for the baseline hazard, h0(t). Hence, the Cox model is semi-parametric, as the estimates of β can be obtained without the knowledge of the baseline hazard function; this is one of the primary theoretical appeals of the PH model. Clearly, the hazard ratio for covariates x and at any given time point no longer depends on time t.
3.1.1. Possible Causes of Violation of the PH assumption
We present two common practical scenarios that may lead to the violation of the PH assumption:
(1). Crossing Survival Curves:
It is well known that when the covariates simply represent treatment groups (e.g., x = 0, 1), one may simply compute the non-parameteric Kaplan-Meier estimates (KME) of the survival functions and compare the two groups by plotting them, as illustrated in Figure 1. If the KMEs cross, then it is likely that the PH assumption is violated because if the PH assumption were true, then S1(t) = S0(t)γ0 for some γ0 > 0 and hence the sign of S1(t)−S0(t) would not change across all t. Many other popular models, e.g., the Accelerated Failure Time (AFT) model which assumes S(t|x) = S0(tγ(x)) or Proportional Odds (PO) model which assumes S(t|x)/(1 − S(t|x)) = γ(x)S0(t)/(1 − S0(t) for some baseline survival function S0(t), also do not allow the crossing of survival functions. In Figure 1, we demonstrate this case when comparing four different treatments from an AIDS clinical trial; notice that the red curve crosses the green curve twice and hence likely leads to a violation of the PH assumption.
(2). Delayed Treatment Effects:
Another source of violation occurs when the two treatment groups show no significant differences in the survival rates in the initial phase of the trial, but then a delayed effect is observed among the survival curves. In other words, for some initial time interval t ∈ (0, t0], we may observe S1(t) ≈ S0(t), but then we may observe S1(t) >> S0(t) for t > t0, hence violating the PH assumption that logH1(t) −logH0(t) = logγ0;∀t. In such cases, although survival curves may not cross, the PH assumption may still be violated as the separation of the logarithm of the cumulative hazard may not be constant across all time points. Again in Figure 1, we see that this is the case when comparing the blue and green curves; a more prominent feature of this delayed effect can be seen in Figure 2.
Figure 2:
Kaplan-Meier curves by the covariate Offtrt. The violation of the PH assumption is depicted by a delayed treatment effect.
However, when the assumption is not satisfied, one may allow the regression coefficient vector β to depend on time t and then check whether the estimated coefficient vector over discretized intervals of time is nearly constant. In our analyses, we use the readily available R function cox.zph, available in the R core package survival, to test the PH assumption (see Grambsch and Therneau 1994 for further details). We use the coxph function in R to obtain the parameter estimates under the PH assumption. However, the regular PH model doesn’t allow for selection of covariates and only provides a marginal test for significance of each baseline variable. One possible approach will be to run all possible 216 − 1 subset regression models and then choose one using an information criteria (such as AIC, BIC and Mallows’s Cp type methods). However, such subset selection becomes increasingly prohibitive with large numbers of variables (especially when interaction effects are to be considered as well). Next, we describe an extension of Cox PH regression to penalized regression methods that allow for simultaneous estimation and selection of the baseline covariates, still within the PH assumption.
3.2. Penalized PH Models
The primary idea of any penalized generalized linear regression model is to optimize the log-likelihood of the regression coefficients (in the case of the Cox PH model, one maximizes the partial log-likelihood instead) subject to some penalty for including too many baseline covariates. Such a goal is achieved by setting some of the βj = 0 within the vector β⊤ = (β1, …, βp) by using a penalty function like for some suitable a ≥ 1, which is essentially the same as maximizing the log-likelihood subject to for some threshold τ > 0. Clearly, as we decrease τ, some of the βj are naturally set to zero and hence the corresponding variables in the link function that connects to the response are dropped out. The choice of the power a ≥ 1 plays a crucial role: the choice of a = 1 leads to the so-called LASSO estimator, whereas the choice of a = 2 leads to the so-called Ridge estimator. As the former is not a strictly convex function, it has been shown that the LASSO penalty can achieve simultaneous estimation and selection of variables within a generalized linear modeling framework. For the penalized PH models, we penalize the negative log of the partial likelihood with the elastic-net penalty, which is a mixture of l1 (the lasso penalty) and l2 (the ridge penalty). The elastic-net penalty is defined as
| (2) |
which can be adjusted in favor of the ridge penalty (α = 0) or the lasso penalty (α = 1) (Friedman et al. 2010). When α = 1 − ϵ for some small ϵ > 0, the elastic-net penalty performs similarly to lasso while also removing degenerate behavior caused by extreme correlations. For our analyses, we set α = .95.
More specifically, to fit the penalized PH models, we use the R package glmnet, which uses cyclical coordinate descent to solve the elastic-net penalized cox models. To determine the optimal tuning parameter λ, we run cross validation through the function cv.glmnet one hundred times and take the median of one hundred λ1se. λ1se is the largest value of λ such that the error in terms of the partial likelihood deviance is within one standard error of the minimum. We overestimate λ in order to err on the side of parsimony, according to the so-called “one-standard-error” rule (Hastie et al. 2009).
3.3. Hazard Regression (HARE) Models
One of the primary concerns with even the penalized Cox regression model is that it is still based on the PH assumption, which may not be correct in practice (as illustrated in section 1.1) and hence any variables that were selected within this framework can be erroneous. Unlike the Cox PH model and the penalized PH model discussed previously, linear spline based hazard regression (HARE) models do not require the PH assumption to be valid (Kooperberg et al. 1995). HARE models use linear splines and their tensor products to fit a generalized linear model for the conditional log-hazard function
| (3) |
Let the vector of p covariates take on values in , let be a p-dimensional linear space of functions on such that g(·|x) is bounded on [0, ∞) for and , and let B1(·), …, Bp(·) be a basis of . Then, the conditional log-hazard function can be modeled as
| (4) |
where β⊤ = (β1, …, βp). The coefficients β are estimated through maximum likelihood estimation.
HARE, as implemented in the R function hare, part of the package polspline, automatically considers knots in the covariates and time. If the jth covariate xj has k knots, then the kth knot xjk is represented by the basis function (xj−xjk)+, where x+ = max(x, 0). Similarly, if time t has k knots, then the kth knot tk is represented by the basis function (tk−t)+. Knots in the covariates and time have the same units as those by which the covariates and survival times were measured, respectively (e.g., for ACTG-175 the knots in time are in years). HARE also automatically considers the interaction between each pair of covariates or their knots. It also considers the interaction between each covariate or its knot and time or its knot. If an interaction of the latter sort is included, then the HARE model becomes a non-PH model. HARE selects among these basis functions by using stepwise addition, stepwise deletion, and the Akaike Information Criterion (AIC). Further, HARE’s model selection can be restricted to PH models, but it still considers knots in time. We use both HARE based PH models and HARE non-PH models in our analysis.
Thus, in our analyses, we use four models for the covariate selection: the (i) Cox PH model, (ii) penalized Cox model, (iii) HARE PH model, and (iv) HARE non-PH model. The significance level for the p-values for all of our analyses was set to .05.
4. Variable Selection Based on PH and non-PH models
In this section we present detailed analysis of the ACTG-175 data set to illustrate the effect of the PH assumption on variable selection. A summary of selected risk factors under each of the four models is presented in Table 2, which clearly indicates that the selection of a subset of the risk factors does depend on the chosen model.
Table 2:
Covariates chosen in each of the four models.
| Covariates | PH | Penalized PH | HARE PH | HARE non-PH |
|---|---|---|---|---|
| Treatment: ZDV + ddI | ✔* | ✔ | ✔ | ✔ |
| Treatment: ZDV + ZAL | ✔* | ✔ | ✔ | ✔ |
| Treatment: ddI | ✔ | ✔ | ✔ | ✔ |
| Age | ✔ | ✔ | ✔ | |
| Wtkg | ||||
| Hemo | ||||
| Drugs | ✔ | ✔ | ||
| Karnof | ✔ | ✔ | ||
| Oprior | ||||
| Preanti | ✔ | ✔ | ✔ | ✔ |
| Race | ||||
| Gender | ||||
| Symptom | ✔ | ✔ | ✔ | ✔ |
| Offtrt | ✔* | ✔ | ✔ | ✔ |
| CD40 | ✔* | ✔ | ✔ | ✔ |
| CD80 | ✔* | ✔ | ✔ | ✔ |
| Age × Offtrt | N/A | N/A | ✔ | ✔ |
| (Age – 43)+ | N/A | N/A | ✔ | |
| (Age – 45)+ | N/A | N/A | ✔ | |
| (CD40 – 140)+ | N/A | N/A | ✔ | ✔ |
| (CD40 – 220)+ | N/A | N/A | ✔ | ✔ |
| (.61 – t)+ | N/A | N/A | ✔ | ✔ |
| (1.3 – t)+ | N/A | N/A | ✔ | ✔ |
| (.61 – t)+ × Offtrt | N/A | N/A | N/A | ✔ |
| (1.3 – t)+ × Offtrt | N/A | N/A | N/A | ✔ |
This covariate significantly violates the PH assumption, according to the R function cox.zph.
4.1. Variable Selection based on the Cox PH Model
According to the Cox PH model, covariates ZDV+ddI, ZDV+ZAL, and ddI are significant at the .05 level. This is consistent with Hammer et al. (1996), which found the relative hazard ratios of treatments ZDV+ddI, ZDV+ZAL, and ddI compared with ZDV alone to be significantly less than one. Covariates Drugs, Karnof, Preanti, Symptom, Offtrt, CD40, and CD80 are also significant at the .05 level (Table 2).
To test departures from the PH assumption, we test the significance of the correlation between the Schoenfeld residuals for each covariate and the ranked failure times with cox.zph (Grambsch & Therneau 1994). The global p-value of this test is 0.0002; therefore, the null hypothesis that there are no departures from proportionality can safely be rejected. In particular, the covariates for which the correlation between the Schoenfeld residuals and the ranked failure times are significant are ZDV+ddI (p = 0.019), ZDV+ZAL (p = 0.005), Offtrt (p = 0.0003), CD40 (p = 0.001), and CD80 (p = 0.029) (Table 2). In Hammer et al. (1996), a Cox PH model was used to compare ZDV+ddI and ZDV+ZAL with ZDV alone without accounting for possible departures from proportionality.
4.2. Variable Selection based on the Penalized PH Model
The penalized PH model selects all the covariates that the Cox PH model found significant, i.e., ZDV+ddI, ZDV+ZAL, ddI, Drugs, Karnof, Preanti, Symptom, Offtrt, CD40, and CD80 (Table 2). Therefore, using the tuning parameter λ1se does not prevent any covariates found significant in the Cox PH model from entering the penalized PH model.
Even though the covariate Age is not found to be significant in the Cox PH model, it is selected in the penalized PH model. This agrees with the finding that Age is an important covariate that determines the relative effectiveness of the combination therapies ZDV+ddI and ZDV+ZAL. That is, for subjects that are 34 years old or less, the survival probabilities for the treatment ZDV+ZAL are almost uniformly larger than those for the treatment ZDV+ddI. However, the opposite is true for patients that are more than 34 years old (Jiang et al. 2017). However, the penalized PH model does not automatically consider interactions between covariates, particularly those between Age and each treatment.
4.3. Variable Selection based on the HARE PH Model
Next, we apply the HARE algorithm to the data, restricting model selection to PH models. Of the baseline covariates in Table 1, the HARE PH model selects basis functions that correspond to covariates ZDV+ddI, ZDV+ZAL, ddI, Age, Preanti, Symptom, Offtrt, CD40, and CD80, respectively (Table 2). The HARE PH model selects all the covariates that the penalized PH model does, except for Drugs and Karnof. Of note, even though the basis function for Offtrt was included in the HARE PH model, it has a Wald statistic of −1.72, which is not significant under the .05 significance level.
The HARE algorithm automatically considers interactions between covariates, which is not the case for the Cox PH model nor the penalized PH model. Thus, the HARE PH model selects the interaction between Age and Offtrt. This is similar to the result from Jiang et al. (2017) as discussed in Section 4.2, i.e., there exists an interaction between Age and one of ZDV+ZAL and ZDV+ddI. While not equivalent, the interaction between Age and Offtrt is likely related to the interaction between Age and one of ZDV+ZAL and ZDV+ddI, as whether the subject discontinues the treatment necessarily impacts the effectiveness of the treatment.
In addition to interactions, the HARE algorithm also considers knots in the covariates. The HARE PH model selects the basis functions (Age–43)+, (CD40–140)+, and (CD40–220)+. The HARE PH model also considers knots in time, selecting basis functions (.61 − t)+ and (1.3 − t)+. The basis functions for knots in time, along with the constant basis function, give an estimate of the baseline log-hazard function. In the case of the Cox PH model and the penalized PH model, the baseline hazard function is not estimated at all.
4.4. Variable Selection based on the HARE Non-PH Model
Finally, we apply the HARE algorithm to the data again, this time without restricting the model selection to PH models. This allows for the possibility of a non-PH model. Of the baseline covariates in Table 1, the HARE non-PH model selects the same covariates as the HARE PH model (Table 2). Thus, expanding the family of allowable spaces to include interactions between the covariates and time does not impact covariate selection in this case.
The HARE non-PH model selects the interaction Age × Offtrt, like the HARE PH model. The HARE non-PH model selects the basis functions (Age–45)+, (CD40–140)+, (CD40–220)+, (.61 − t)+, and (1.3 − t)+; only the knot in Age is different from that in the HARE PH model. However, the difference may not be statistically significant. Of note, the basis function (.61 − t)+ has a Wald statistic of 0.02, which is not significant under the .05 significance level.
The HARE non-PH model selects the interactions (.61 − t)+ × Offtrt and (1.3 − t)+ × Offtrt, which show that Offtrt exhibits non-PH. Therefore, this model is in fact a non-PH model.
We use the HARE non-PH model to estimate conditional survival curves for two subjects from ACTG-175, one with Offtrt = 0 and one with Offtrt = 1 (Figure 3). Figure 3 shows the interactions between Offtrt and the other covariates: the survival curve corresponding to one of the levels of Offtrt changes significantly depending on the values of the other covariates. Figure 3 also shows Offtrt’s departure from proportionality: when the values of the covariates other than Offtrt are set to those of the 575th patient, the hazards for Offtrt = 0 and Offtrt = 1, respectively, cross.
Figure 3:
The black and red solid curves refer to the survival curves conditioned on the covariates for the 1380th and 575th patients in the ACTG-175 data set, respectively. Each dashed line has the same Offtrt value as the solid line of the same color, but all other covariates take on the values of the other patient. Note that when the solid red line is compared with the dashed black line, between which only the Offtrt value is different, the hazards seem to cross. The vertical lines indicate the knots of the interaction between Time and Offtrt (Table 2).
Thus, as the unrestricted HARE model is not necessarily based on the PH assumption and automatically considers possible interaction effects, it seems to be a robust choice for not only the ACTG-175 data set but also other similar clinical trials. We explore this using several simulation studies using synthetic data which are similar in structure to our real motivating data set (see section 6).
5. Variable Selection Based on Stratified Models
One way to enforce the PH assumption is to stratify the patients into different groups and hope that the PH assumption is possibly valid for each strata. Thus, we stratify the analyses into two groups: one group contains subjects with Offtrt = 0 and the other group contains subjects with Offtrt = 1. There are 1363 subjects in the Offtrt = 0 group and 776 patients in the Offtrt = 1 group. We fit the Cox PH model, penalized PH model, HARE PH model, and HARE non-PH model to each group. Because the analyses are stratified on Offtrt, Offtrt is no longer a covariate and so cannot be selected by the models. Therefore, we disregard Offtrt when discussing the covariates that were selected or found significant in each of the four models.
5.1. Variable Selection based on the Cox PH Model
For the Offtrt = 0 group, the Cox PH model finds ZDV+ddI, ZDV+ZAL, ddI, Karnof, Preanti, Symptom, CD40, and CD80 to be significant (Table 3). For the Offtrt = 1 group, the Cox PH model finds ZDV+ddI, ZDV+ZAL, ddI, Age, Drugs, Preanti, Symptom, CD40, and CD80 to be significant (Table 4).
Table 3:
Covariates chosen in each of the four models for the Offtrt = 0 group.
| Covariates | PH | Penalized PH | HARE PH | HARE non-PH |
|---|---|---|---|---|
| Treatment: ZDV + ddI | ✔ | ✔ | ✔ | ✔ |
| Treatment: ZDV + ZAL | ✔* | ✔ | ✔ | ✔ |
| Treatment: ddI | ✔ | ✔ | ✔ | ✔ |
| Age | ||||
| Wtkg | * | |||
| Hemo | ||||
| Drugs | ||||
| Karnof | ✔ | ✔ | ||
| Oprior | ✔ | |||
| Preanti | ✔ | ✔ | ||
| Race | ||||
| Gender | ||||
| Symptom | ✔ | ✔ | ||
| CD40 | ✔* | ✔ | ✔ | ✔ |
| CD80 | ✔* | ✔ | ✔ | ✔ |
| (CD40 – 230)+ | N/A | N/A | ✔ | |
| (CD40 – 360)+ | N/A | N/A | ✔ | |
| (CD80 – 1200)+ | N/A | N/A | ✔ | ✔ |
| (1.3 – t)+ | N/A | N/A | ✔ | ✔ |
This covariate significantly violates the PH assumption, according to the R function cox.zph.
Table 4:
Covariates chosen in each of the four models for the Offtrt = 1 group.
| Covariates | PH | Penalized PH | HARE PH | HARE non-PH |
|---|---|---|---|---|
| Treatment: ZDV + ddI | ✔ | |||
| Treatment: ZDV + ZAL | ✔ | |||
| Treatment: ddI | ✔ | |||
| Age | ✔ | ✔ | ✔ | ✔ |
| Wtkg | ||||
| Hemo | ||||
| Drugs | ✔ | |||
| Karnof | ||||
| Oprior | ||||
| Preanti | ✔ | ✔ | ✔ | ✔ |
| Race | ||||
| Gender | ||||
| Symptom | ✔ | ✔ | ✔ | ✔ |
| CD40 | ✔ | ✔ | ✔ | ✔ |
| CD80 | ✔ | ✔ | ✔ | ✔ |
| (Age – 40)+ | N/A | N/A | ✔ | ✔ |
| (CD40 – 140)+ | N/A | N/A | ✔ | ✔ |
| (CD80 – 310)+ | N/A | N/A | ✔ | ✔ |
| (.61 – t)+ | N/A | N/A | ✔ | ✔ |
These results suggest that Drugs, Karnof, and Age interact with Offtrt, as each of these covariates are significant according to the Cox PH model in one of the groups, but not in the other.
For the Offtrt = 0 group, the global p-value of the cox.zph test is 0.009; therefore, the null hypothesis that there are no departures from proportionality can be rejected. In particular, the covariates for which the cox.zph found significant departures from proportionality are ZDV+ZAL (p = 0.003), Wtkg (p = 0.029), CD40 (p = 0.002), and CD80 (p = 0.006). For the Offtrt = 1 group, the global p-value of the cox.zph test is .103; Therefore, the null hypothesis that there are no departures from proportionality cannot be rejected.
These results suggest that non-PH is only an issue if Offtrt = 0, that is, if subjects stay on their assigned treatment throughout their participation in the clinical trial.
5.2. Variable Selection based on the Penalized PH Model
For the Offtrt = 0 group, the penalized PH model selects all the covariates that the Cox PH model for the same group found significant, i.e., ZDV+ddI, ZDV+ZAL, ddI, Karnof, Preanti, Symptom, CD40, and CD80 (Table 3). Therefore, using the tuning parameter λ1se does not prevent any covariates found significant in the Cox PH model from entering the penalized PH model with stratification on Offtrt = 0. However, this penalized PH model also selects Oprior, which is not selected by the Cox PH model for the same group.
For the Offtrt = 1 group, the penalized PH model selects Age, Preanti, Symptom, CD40, and CD80 (Table 4). This set of covariates is a strict subset of the set of covariates that the Cox PH model for the same group found significant. In particular, this set of covariates is missing ZDV+ddI, ZDV+ZAL, ddI, and Drugs, which the Cox PH model for the same group found significant.
These results suggest that the three treatment covariates, ZDV+ddI, ZDV+ZAL, and ddI, also interact with Offtrt, in addition to Karnof, Oprior, and Age. These covariates are selected by the penalized PH model stratified on one of the groups, but not by the penalized PH model stratified on the other group.
5.3. Variable Selection based on the HARE PH Model
For the Offtrt = 0 group, the HARE PH model selects five of the baseline covariates in Table 1: ZDV+ddI, ZDV+ZAL, ddI, CD40, and CD80 (Table 3). This set of covariates is a strict subset of the set of covariates selected by the penalized PH model for the same group. In particular, this set of covariates is missing Karnof, Oprior, and Preanti, which are selected by the penalized PH model for the same group. This HARE PH model selects the knots (CD40–230)+, (CD80–1200)+, and (1.3 − t)+.
For the Offtrt = 1 group, the HARE PH model selects five of the baseline covariates in Table 1: Age, Preanti, Symptom, CD40, and CD80 (Table 4). This set of covariates is equivalent to the set of covariates selected by the penalized PH model for the same group. This HARE PH model selects the knots (Age–40)+, (CD40–140)+, (CD80–310)+, and (.61 − t)+.
These results suggest that ZDV+ddI, ZDV+ZAL, ddI, Age, Preanti, and Symptom interact with Offtrt, as each of these covariates is selected by the HARE PH model stratified on one of the groups, but not by the HARE PH model stratified on the other group.
5.4. Variable Selection based on the HARE Model not restricted to PH Models
For the Offtrt = 0 group, the unrestricted HARE model selects the same covariates and almost the same knots as the HARE PH model for the same group (Table 3). The only difference between the two HARE models is the knot in CD40, which located at 360 for the unrestricted HARE model, and located at 230 for the HARE PH model. There are no interactions in this unrestricted HARE model, particularly interactions between a covariate and time, and so the unrestricted HARE model is another HARE PH model rather than a HARE non-PH model.
For the Offtrt = 1 group, the unrestricted HARE model selects the same set of covariates and knots as the HARE PH model (Table 4). Therefore, the unrestricted HARE model is another HARE PH model rather than a HARE non-PH model.
6. Simulation Studies
To further explore the effect of the PH assumption, or the violation thereof, on the four variable selection methods, we conducted several simulation studies. To mimic realistic simulation scenarios we used the ACTG-175 data to simulate the risk factors and generate the (uncensored) survival times from various parametric distributions. For each simulation study, 500 simulated data sets were used, in which survival times for n = 2139 subjects given the same set of risk factors (covariates) as in the ACTG-175 data set were generated from two parametric models, one satisfying the PH assumption and the other violating the PH assumption.
For each simulation study, the relative frequencies of selection of true covariates (i.e., those used to generate survival times) and unused covariates (i.e., those with coefficients equal to 0) were measured for each variable selection method. The variable selection results of the simulations, as well as the coefficient values used, are listed in Tables 5 and 6. The last column in each table, “× Time”, indicates the proportion of times that the HARE non-PH model selected the basis function corresponding to the interaction between the covariate and time, leading to non-PH. Notice that non-PH HARE makes no assumptions about the true data generating mechanism and hence is robust against assumptions like PH and AFT.
Table 5:
Proportion of times each covariate is selected across the four models fitted on simulated survival times from a Weibull PH model. The column “Coefficient” denotes the true coefficient value used in the simulations, multiplied by the corresponding covariate’s standard deviation. The column “× Time” is the proportion of times the HARE non-PH model selected an interaction between the covariate and time.
| Covariates | Coefficient | PH | Penalized PH | HARE PH | HARE non-PH | × Time |
|---|---|---|---|---|---|---|
| ZDV + ddI | −0.323 | 1.000 | 0.986 | 0.992 | 0.992 | 0.008 |
| ZDV + ZAL | −0.299 | 1.000 | 0.978 | 0.982 | 0.980 | 0.004 |
| ddI | −0.237 | 0.992 | 0.786 | 0.966 | 0.962 | 0.006 |
| Age | −0.158 | 0.938 | 0.968 | 0.826 | 0.826 | 0.014 |
| Wtkg | 0.00 | 0.046 | 0.178 | 0.004 | 0.004 | 0.000 |
| Hemo | 0.00 | 0.030 | 0.210 | 0.010 | 0.010 | 0.002 |
| Drugs | 0.00 | 0.072 | 0.260 | 0.012 | 0.012 | 0.000 |
| Karnof | 0.00 | 0.056 | 0.190 | 0.008 | 0.008 | 0.000 |
| Oprior | 0.00 | 0.044 | 0.186 | 0.008 | 0.008 | 0.000 |
| Preanti | 0.217 | 1.000 | 1.000 | 0.986 | 0.986 | 0.006 |
| Race | 0.00 | 0.052 | 0.202 | 0.010 | 0.010 | 0.000 |
| Gender | 0.00 | 0.076 | 0.224 | 0.008 | 0.008 | 0.000 |
| Symptom | 0.164 | 0.962 | 0.972 | 0.866 | 0.868 | 0.000 |
| Offtrt | −0.420 | 1.000 | 1.000 | 1.000 | 1.000 | 0.008 |
| CD40 | 4.586 | 1.000 | 1.000 | 1.000 | 1.000 | 0.028 |
| CD80 | 0.222 | 1.000 | 1.000 | 0.984 | 0.984 | 0.014 |
Table 6:
Proportion of times each covariate is selected across the four models fitted on simulated survival times from the log-normal AFT model. The column “Coefficient” denotes the true coefficient value used in the simulations, multiplied by the corresponding covariate’s standard deviation. The column “× Time” is the proportion of times the HARE non-PH model selected an interaction between the covariate and time.
| Covariates | Coefficient | PH | Penalized PH | HARE PH | HARE non-PH | × Time |
|---|---|---|---|---|---|---|
| ZDV + ddI | −0.323 | 1.000 | 0.486 | 0.998 | 0.998 | 0.052 |
| ZDV + ZAL | −0.299 | 1.000 | 0.816 | 0.998 | 0.998 | 0.052 |
| ddI | −0.237 | 1.000 | 0.418 | 0.998 | 0.998 | 0.042 |
| Age | −0.158 | 0.862 | 0.484 | 0.922 | 0.922 | 0.024 |
| Wtkg | 0.00 | 0.058 | 0.042 | 0.024 | 0.024 | 0.004 |
| Hemo | 0.00 | 0.086 | 0.034 | 0.064 | 0.066 | 0.004 |
| Drugs | 0.00 | 0.442 | 0.088 | 0.070 | 0.064 | 0.002 |
| Karnof | 0.00 | 0.624 | 0.408 | 0.030 | 0.030 | 0.002 |
| Oprior | 0.00 | 0.054 | 0.042 | 0.036 | 0.034 | 0.002 |
| Preanti | 0.217 | 1.000 | 0.936 | 0.812 | 0.812 | 0.044 |
| Race | 0.00 | 0.264 | 0.122 | 0.040 | 0.040 | 0.004 |
| Gender | 0.00 | 0.108 | 0.026 | 0.054 | 0.054 | 0.000 |
| Symptom | 0.164 | 0.994 | 0.602 | 0.998 | 0.998 | 0.050 |
| Offtrt | −0.420 | 1.000 | 1.000 | 1.000 | 1.000 | 0.180 |
| CD40 | 4.586 | 1.000 | 1.000 | 1.000 | 1.000 | 0.348 |
| CD80 | 0.222 | 1.000 | 0.994 | 0.464 | 0.464 | 0.024 |
6.1. Data Generated from a Weibull PH model
The first simulation study generated survival times from a Weibull distribution, which satisfies the PH assumption. We used the approach of Bender et al. (2005), which involves transforming uniform random variables via the inverse of the cumulative baseline hazard function of the Weibull distribution. To estimate the shape parameter of the baseline hazard, we generated 100,000 values from the baseline of the HARE non-PH model fitted to the ACTG-175 data set (setting all covariates to 0), and fitted a Weibull distribution to the values. The shape parameter was calculated to be 262.429. We set the scale parameter of the baseline hazard to 3.348e-140 in order to make the range of the simulated data’s survival times similar to that of the ACTG-175 data set. The coefficients of the Weibull PH model were set equal to those of the HARE non-PH model fitted to the ACTG-175 data set, and each covariate was taken from the ACTG-175 data set. For covariates that were not selected by the HARE non-PH model, the coefficients were set to zero. We did not incorporate the knots nor the tensor products present in the HARE non-PH model into the Weibull PH model. Of note, omitting the tensor products between Offtrt and time and between Offtrt and Age would cause Offtrt to have a strictly positive effect on survival, rather than a delayed negative effect as in the ACTG-175 data set. The censoring times were generated from an exponential distribution with rate parameter 0.4 to achieve a censoring rate of 74%, which is similar to the censoring rate in the ACTG-175 data set.
For survival times generated from the Weibull PH model, the Cox PH, HARE PH, and HARE non-PH models perform similarly in terms of selecting the true covariates (true positive rate), with the Cox PH model performing slightly better (Table 5). For the Cox PH model, the proportion of times a given true covariate was chosen, i.e., determined significant under the .05 level, ranged from 0.938 to 1.000, while for the HARE models the proportion ranged from 0.826 to 1.000. However, the HARE models performed better in terms of minimizing the false positive rate. For the Cox PH model, the proportion of times a given unused covariate was chosen ranged from 0.030 to 0.076, whereas for the HARE models the proportion ranged from 0.004 to 0.012. There were no significant differences between the HARE PH and HARE non-PH models in terms of covariates selected, like in the analysis of the ACTG-175 data set (Table 2). The HARE non-PH model had a low false positive rate in regards to detecting (absent) interactions between the covariates and time, with the relative frequency ranging from 0.000 to 0.028.
The penalized PH model performed the worst, with a true positive rate as low as 0.786 and a false positive rate ranging from 0.178 to 0.260. This may be due to the well-known biased estimation caused by the selection of the tuning parameter within the penalized PH model (see Tibshirani (1997) for further details).
6.2. Data Generated from a Log-Normal AFT model
The second simulation study generated survival times from a log-normal AFT model which violates the PH assumption, specified as
| (5) |
where β1, …, βp are the coefficients from the HARE non-PH model, xi1, …, xip are the covariates from the ACTG-175 data set, σ is a scale parameter and ϵi are i.i.d. from N(0, 1). As before, coefficients of covariates that were not selected by the HARE non-PH model were set to zero, and the knots and tensor products in the HARE non-PH model were omitted. The intercept parameter, β0, was set to −11 in order to make the range of the simulated data’s survival times similar to that of the ACTG-175 data set. The scale parameter, σ, was set to be the standard deviation of the log of the survival times in the ACTG-175 study. The censoring times were generated from an exponential distribution with rate parameter 2.1 to achieve a censoring rate of 73%, which is similar to the censoring rate in the ACTG-175 data set.
For survival times generated from the log-normal AFT model, the Cox PH model maintained a high true positive rate, ranging from 0.862 to 1.000. The Cox PH model performed significantly worse in terms of the false positive rate. The Cox PH model chose three covariates unused in the simulation, Drugs, Karnof, and Race, with relative frequencies of 0.442, 0.624, and 0.264, respectively. However, the penalized PH model performed poorly in terms of the true positive rate, selecting the true covariates ZDV+ddI, ddI, and Age with relative frequencies of 0.486, 0.418, and 0.484, respectively. The penalized PH model also performed worse in terms of the false positive rate, selecting unused covariates Karnof and Symptom with relative frequencies of 0.408 and 0.602, respectively. Of note, in the analysis of the ACTG-175 data set, the Cox PH model and penalized PH model both selected Drugs and Karnof, whereas the HARE models ignored these covariates (Table 2).
The HARE models selected most of the true covariates with a high relative frequency ranging from 0.812 to 1.000 except for CD80, which was selected with a relative frequency of 0.464. In this regard, the Cox PH model is superior. However, like in the Weibull PH simulation (Section 6.1), the HARE models performed better in regards to the false positive rate, with the relative frequency of selection of unused covariates ranging from 0.024 to 0.070. As in the Weibull PH simulation, there was almost no difference in terms of covariate selection between the HARE PH model and the HARE non-PH model. However, the HARE non-PH model detected interactions between the covariates and time. The most frequently selected interactions involved Offtrt and CD40, with relative frequencies of 0.180 and 0.348, respectively.
7. Conclusion
In this paper we have explored both the penalized PH model and the non-PH model when selecting risk factors for clinical study using survival analysis methods. As demonstrated in the previous sections, the choice of a model structure (PH vs. non-PH) does affect the selection of the risk factors; in particular, the non-PH models have the ability to find potentially new risk factors (e.g., those that interact with time) that otherwise couldn’t be identified by using PH models. Moreover, from the several simulation studies that we have carried out in this paper, it is apparent that even when the true data generating mechanism is based on a PH model, the unrestricted HARE model (which doesn’t enforce the PH assumption) works reasonably well in identifying the true risk factors and has a very low false positive rate. However, when the true data generating mechanism violates the PH assumption, the PH model and its penalized version falsely identifies risk factors (with a proportion as high as 50%), which could lead to serious consequences in medical applications. There are several alternatives to address the PH violation (e.g., time varying coefficient model), but many of these extensions of PH models still make use of the inherent multiplicative structure of the PH model and it is not clear how to use penalized methods for such time varying coefficients.
Thus, it is prudent that practitioners use both the penalized PH model and the non-PH model to explore the effect of the chosen model on variable selection. One possible line of future research is to use a penalized regression method to select variables under the non-PH framework instead of using forward/backward selection methods as used in the HARE models.
Acknowledgement
The author gratefully acknowledges the funding received from NIH NHLBI’s SIBS program (award# R25 HL131490) awarded to NC State University which allowed the first author to collaborate with his mentor (second author). The authors are also grateful for many constructive and helpful comments and suggestions from two anonymous reviewers which led to a much improved version of an earlier manuscript.
Appendix A
Before we performed formal statistical methods to select variables with or without the PH assumption we also explored the baseline covariates for multicollinearity. Such a phenomenon is known to cause problems within a generalized linear model. Below are pairs of variables that lead to almost perfect multicollinearity.
ZDV-only indicator Treat and Treatment indicator Trt. Treat was dropped. Trt was expanded into ZDV+ddI, ZDV+ZAL, and ddI.
Antiretroviral-experienced indicator Str2 and antiretroviral history indicator Strat. Str2 was dropped.
Below are pairs of highly correlated variables, with their Spearman’s rank correlation coefficients in parantheses.
Gender and homosexual status Homo (rs = 0.61). Homo was dropped.
Number of days of antiretroviral therapy before ACTG-175, Preanti and indicator of having taken ZDV in the 30 days prior to ACTG-175, Z30 (rs = .83). Z30 was dropped.
Strat and Preanti (rs = 0.96). Strat was dropped.
Baseline CD4 cell count CD40 and CD4 cell count at 20+/−5 weeks CD420 (rs = 0.62). Note: due to its high correlation with CD420, CD40 serves as a good proxy for CD420.
Baseline CD8 cell count CD80 and CD8 cell count at 20+/−5 weeks CD820 (rs = 0.74). Note: due to its high correlation with CD820, CD80 serves as a good proxy for CD820.
Based on preliminary analysis, we have found that CD420 dominates the covariate selection, preventing other covariates from being selected. Since CD40 serves as a good proxy for CD420, we have decided to omit CD420. Since CD80 serves as a good proxy for CD820, we have also decided to omit CD820 (Lufkin n.d.). In addition, based on preliminary analysis, CD820 was not selected by many of the models.
References
- Bender R, Augustin T & Blettner M (2005), ‘Generating survival times to simulate cox proportional hazards models’, Statistics in Medicine 24(11), 1713–1723. [DOI] [PubMed] [Google Scholar]
- Cox DR (1972), ‘Regression models and life-tables’, Journal of the Royal Statistical Society 34(2), 187–220. [Google Scholar]
- Friedman J, Hastie T & Tibshirani R (2010), ‘Regularization paths for generalized linear models via coordinate descent’, Journal of Statistical Software 33(1), 1–22. [PMC free article] [PubMed] [Google Scholar]
- Grambsch PM & Therneau TM (1994), ‘Proportional hazards tests and diagnostics based on weighted residuals’, Biometrika 81(3), 515–526. [Google Scholar]
- Hammer SM, Katzenstein DA, Hughes MD, Gundacker H, Schooley RT, Haubrich RH, Henry WK, Lederman MM, Phair JP, Niu M, Hirsch MS & Merigan TC (1996), ‘A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter’, The New England Journal of Medicine 335(15), 1081–1090. [DOI] [PubMed] [Google Scholar]
- Hastie T, Tibshirani R & Friedman J (2009), The Elements of Statistical Learning: Prediction, Inference and Data Mining, 2 edn, Springer-Verlag, New York. [Google Scholar]
- Jiang R, Lu W, Song R & Davidian M (2017), ‘On estimation of optimal treatment regimes for maximizing t-year survival probability’, Journal of the Royal Statistical Society 79(4), 1165–1185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kooperberg C, Stone CJ & Truong YK (1995), ‘Hazard regression’, Journal of the American Statistical Association 90(429), 78–94. [Google Scholar]
- Lufkin PS (n.d.), ‘Linear models with dominant variables’, SAS User Group International.
- Tibshirani R (1997), ‘The lasso method for variable selection in the cox model’, Statistics in Medicine 16, 385–395. [DOI] [PubMed] [Google Scholar]



