Abstract
This paper discusses methods for clustering a continuous covariate in a survival analysis model. The advantages of using a categorical covariate defined from discretizing a continuous covariate (via clustering) is (i) enhanced interpretability of the covariate's impact on survival and (ii) relaxing model assumptions that are usually required for survival models, such as the proportional hazards model. Simulations and an example are provided to illustrate the methods.
Keywords: Clustering, discretize, moderator analysis, stratification, survival analysis
1. Introduction
In statistical modeling, the practice of discretizing continuous variables has a long history and is often utilized for various reasons. For example [19] consider discretizing a continuous variable in diagnostic testing setting and [8] notes issues with discretizing continuous variables in multiple linear regression models. [22] introduce a method of discretizing predictors in a generalized linear model framework with a focus on controlling type-I error rates when considering multiple cutpoints defining the threshold for discretizing the predictor variable. For survival analysis models, legitimate concerns about the practice of discretizing continuous variables have been raised [e.g 2,5,21]. However, discretizing variables can nonetheless provide a clinically intuitive and interpretable understanding of the relationships between variables. Importantly, discretizing continuous variables is also essential in order to make clinical treatment decisions based on continuous measures. Treatment decisions are most often discrete (e.g. give drug A or give drug B) but the statistical models used to infrom treatment decisions are often based on continuous covariates. Thus, there is often a need to discretize continuous variables, that is, determine cutoff values, based on these continuous variables to guide treatment decisions. However, these can bring about challenges as well as defining an optimum cutoff [e.g. 3,4]. For example, a middle-aged male may be classified as being hypertensive and hence in need of blood pressure medication if the patient's blood pressure exceeds the cut-off of 140 on a continuous blood pressure scale. Additionally, discretizing a continuous covariate can be regarded as a way to statistically summarize data in order to provide a useful descriptive analysis in a modeling context. Note that partitioning approaches for discretizing a variable are not necessarily predicated on an assumption that true latent classes exist. Instead, regardless of whether or not true latent classes exist, the partitioning strategy can be employed and yield decision cutpoints that many applications require.
If a continuous covariate is to be discretized, the challenge that arises is how one will form the categories (or strata) for the discretized variable. An appealing approach is to use a clustering algorithm applied to the covariate. In this paper we look at implementing a clustering algorithm to optimize the regression model fit in the setting of survival analysis models. In Section 3.2, we examine data from a depression trial and model the time to relapse for individuals who responded to an initial antidepressant treatment. The effect of continuous covariates on time to relapse are then investigated using the proposed partitioning strategies in a survival analysis Cox proportional hazards model.
The problem of partitioning continuous covariates in survival analysis has studied previously. [20] gives an overview of various approaches. Other related work on partitioning strategies in survival analysis models appears in [7,15–17] including approaches based on obtaining optimal cutpoints for a prognostic variable [1,13,14,18]. For example, [7] consider the problem of choosing a single cutpoint for a prognostic variable in order to identify a high and low risk group. Their methodology is based on maximizing the magnitude of a log-rank statistic and controlling type I error rates for multiple testing. They show their test statistic follows a Brownian bridge distribution asymptotically under the null distribution of no difference in survival probabilities for the high and low risk groups. The approach of [7] is also illustrated in [27] who developed a SAS macro to implement the cutpoint estimation procedure. Our focus is not on establishing statistical significance of a discretized covariate like in [12] and we recommend following the advice of [9] to access the statistical significance of the covariate prior to partitioning.
Consider a survival analysis set up where survival times t are recorded along with a continuous baseline covariate x. This paper addresses the question of how to partition x in a survival model according to specified optimality criteria. By partitioning (or clustering) the continuous x, we are essentially discretizing the variable. Two natural options we will consider are: (1) include the partitioned x as a categorical covariate in the survival model or (2) use it to stratify the baseline hazard rate. These ideas are described in Section 2. Simulation results are presented in Section 3.1. Two illustrative examples are presented in Section 3.2. The paper is concluded in Section 4.
A central issue in cluster analysis is choosing k, the number of clusters. In some applications, k may be determined or estimated based on an assumption that the population consists of k distinct but latent sub-populations. This approach is often problematic. Our approach in this paper is more pragmatic – we simply choose k in order to obtain interpretable and clinically useful results with the goal of keeping k as small as possible.
2. Methodology
In this section we describe two approaches for partitioning a continuous covariate for a survival model. The first approach is to simply use a categorical covariate (defined by discretizing x via the clustering) in the Cox proportional hazards model – we shall call this venue 1. The second approach is to use the discretized covariate as a strata variable in a stratified proportional hazards model – we shall call this approach venue 2. In our discussions below, x denotes a continuous covariate of interest and for each venue, we need to determine a clustering or partitioning x. We denote by other covariates that we may want to control for but that are not of primary interest. The next subsections describe our approaches.
2.1. Venue 1: use a discretized covariate in a Cox model
The Cox proportional hazards model is
| (1) |
The goal here is to partition the continuous covariate x into k non-overlapping sets using some chosen partitioning strategy. Then one simply replaces x by its discretized version defined by the indicator variables in the Cox model. If x is one-dimensional, then the sets can be chosen to be intervals determined by a set of k−1 cutpoints The Cox model for the partitioned x variable using this approach is
| (2) |
A drawback of by discretizing a continuous variable into two or more categories is that the resulting model has more parameters. However, the advantage is that using the discretized covariate may lead to a model that is easier to interpret (particularly if the number of strata is small).
In practice, any clustering or partitioning strategy that yields a partition of x's support can be used (for illustration below in Section 3.1, we use k-means clustering and a uniform clustering). However, the next section describes a partitioning strategy to optimize the fit of the model for a given objective.
2.1.1. Optimal partitioning in a Cox model
The goal here is to choose a partitioning that optimizes some aspect of the fitted survival model. The optimization criterion can be to choose cutpoints that produces a fitted model with the highest partial log-likelihood or the smallest AIC for example. We shall call the partitioning based on maximizing the partial log-likelihood the optimal method.
Alternatively, suppose the primary goal of the survival analysis is to examine the strength of the statistical evidence for some variable or treatment on survival. A natural strategy is to determine the cutpoints that maximize the power of the statistical test to be used, such as the partial log-likelihood test in a Cox proportional hazards model. Another criterion for choosing a partition, which is closely related to the magnitude of the likelihood ratio statistic, is to base it on the p-value of a statistical test of interest, i.e. pick a partition leading to the smallest p-value. To aid the visual interpretation of the results that follow, we can take a log-transformation of the p-values to better highlight the variability in the distribution of p-values. Again, we stress that the usual interpretation of statistical significance gleaned from the extreme p-values (which we will call unadjusted p-values below) obtained from the partitioned model may not be valid [7]; instead the goal is to obtain an interpretable model that best highlights the partitioned covariate's impact on survival.
These cutpoints can be found from a brute force search over all possible partitions of the data into k non-overlapping sets. However, for k moderately large, this approach becomes computationally intensive. On the other hand, in order to provide useful partitioning results, typically only a small number of strata defining the partition would be needed which will generally be computationally feasible.
First consider a partition of a continuous variable x into k = 2 groups. If there are n observations on a a single continuous covariate, then there will be at most n−1 unique cutpoints. A simple strategy (which we implement below) is to fit the model using the indicator variables and where c is the cutpoint (the use of < and > can be changed to ≤ or ≥ depending on which choice provides a better fit). The optimal cutpoint can then be determined by examining which one gives the largest partial likelihood ratio test statistic in a survival analysis model. However, for cutpoints near the boundary of the support of the covariate, the set of points defining one or the other categories in the categorized predictor will likely have very few observations leading to a poorly conditioned model which could cause terms in the model related to the categorized predictor to be statistically unstable.
A similar brute-force approach can also be used to partition a continuous covariate into k = 3 strata by simply looping over all combinations of two cutpoints (and hence every possible partition of the covariate into 3 intervals). This can be accomplished by using two loops in the software instead of one when there is a single cutpoint. This approach generalizes to any number of desired cutpoints although the computational burden grows exponentially in k. For small values of k (say k = 2 for 3), the computation time generally is not too overwhelming.
If one wishes to partition two or more covariates, one can simply iterate over all possible partitions of each covariate separately forming a joint partition that corresponds to a Cartesian product, known as a product quantizer in the signal processing literature [e.g.][p. 42 10]. Product quantizers may not be optimal in terms of a stated objective function (e.g. yielding the smallest likelihood ratio p-value), but product quantizers may produce nearly optimal results. More importantly, product quantizers are easy to interpret and apply in practice: product quantizers will give cut-off values for each of the covariates used in the partitioning.
2.2. Venue 2: partitioning for a stratified proportional hazards model
Instead of using a categorized x as a covariate in a Cox proportional hazards model as described above in Section 2.1.1, we could instead partition (or cluster) x and use the resulting strata for x in a stratified proportional hazards model. As above, consider a partitioning of the range of x into non-overlapping intervals . Then we can consider a stratified proportional hazards model
| (3) |
for strata when and is a vector of other covariates that is not subject to partitioning.
The partitioning method for clustering x can be the same as in the previous section. The method we call ‘optimal’ for Venue 2 will be a partitioning that optimizes some aspect of the fit using the stratified proportional hazards model. In particular, we shall use minimization of the AIC as our optimization criterion.
3. Applications
In this section we first present simulation results illustrating the performance of our partitioning strategies and then provide a couple data examples that motivated our work.
3.1. Simulation results
In this section we present results of simulation experiments to evaluate how the optimal partitioning approaches of Section 2.1.1 and Section 2.2 compare with two other clustering methods for both Venues 1 & 2:
Uniform partitioning. A straightforward approach to partition a single continuous covariate x is to simply partition x based on a uniform partitioning of its support. If k strata are needed in the partition, then simply form k intervals of equal length for the range of x values. Alternatively, one could partition x according to quantiles – this will likely lead to intervals in the partition of unequal length, but the interval lengths will mirror the distribution of x.
k-means clustering. A very popular partitioning approach is to use k-means clustering [e.g. 11]. For k-means clustering, an initial set of k cluster means are given and each data point is assigned to the cluster depending on which cluster mean it is closest. Once the data points are assigned to clusters, the cluster means are re-computed. This process iterates until convergence. To implement this approach, one simply forms a partition by running the k-means algorithm on the covariate. The k-means clustering approach can easily be implemented for a multi-dimensional set of covariates by running the k-means algorithm on the multi-dimensional covariate distribution. However, this approach, like the uniform partitioning, forms partitions using the covariate information only and does not use information in the data on survival times. Additionally, in the multivariate setting, the sets comprising the partition from the k-means clustering algorithm, although they are convex sets, can have complicated forms that makes interpretation difficult.
The optimization criterion for the k-means algorithm is to minimize the within cluster sum-of-squares. In the current context of survival analysis, a natural idea to consider for modifying the k-means clustering approach is to change the objective function from minimizing the within cluster sum-of-squares to maximizing the (partial) log-likelihood. As the algorithm iterates, instead of assigning points to clusters based on minimal distance to cluster means, one can assign points to clusters that result in the largest increase in the partial log-likelihood using the discretized covariate. We implemented this algorithm but unfortunately, the resulting clusters were not convex sets in the predictor space (e.g. for a single continuous variable, the clusters were not intervals) which makes interpreting the results difficult.
Our simulations will also look at two cases when the covariate x is Homogeneous (e.g. normally distributed) or a Mixture Covariate (e.g. a finite mixture of normal distributions). Only for Venue 1 did we calculate the Schoenfeld residuals for each method since it made sense in this instance to see how the proportionality criteria was met. We report p-values for this. For both venues, where k = 2 or k = 3, we calculated AIC scores averaged across simulations in order to compare across methods and cutpoints.
3.1.1. Homogeneous covariate
Survival data sets were simulated from a Cox proportional hazards model
| (4) |
when the predictor x is homogeneous, specifically, , to mimic an age distribution for adults that also included a binary treatment effect (trt) and where the coefficients were set at: and A Weibull distribution is used for the baseline hazard throughout. The methods used to partition x were the uniform partitioning, k-means, optimal partitioning that searched for the partition that produced the smallest AIC among all possible cutpoints (as long as there were at least 3 observations per cluster) and Venue 2 based on fitting a stratified Cox proportional hazards model. 500 data sets were simulated and for each one, the Cox proportional hazards model was fit with the partitioned x variable using each of the four methods. The Schoenfeld residuals in Table 1 show that, on average, the proportionality criteria is being met. The AIC was recorded and the results are shown in the beanplots of Figure 1 for k = 3. The plot shows clearly that the uniform and k-means partitioning methods perform similarly in terms of AIC and they are not competitive with the optimal partitioning. The AIC results are tabulated in Table 2 where we see that the partitioning based on the stratified Cox proportional hazards function is clearly performing the best in terms AIC. Note also that for k = 3, the AIC estimation is much more stable (i.e. smaller standard deviation) than partitioning based on the uniform and k-means partitioning. Simulation results for Venue 2 are shown in Table 3.
Table 1.
Simulation results: Schoenfeld residuals (normal).
| k = 2 | ||||
|---|---|---|---|---|
| Methods | Min | Mean | Max | Std deviation |
| Uniform | 0.0600 | 0.55779 | 0.9966 | 0.2188 |
| K-means | 0.0506 | 0.5725 | 0.9899 | 0.2138 |
| Optimal | 0.0000 | 0.3161 | 0.6590 | 0.1527 |
| k = 3 | ||||
| Min | Mean | Max | Std deviation | |
| Uniform | 0.0080 | 0.7461 | 0.9999 | 0.2032 |
| K-means | 0.0000 | 0.7170 | 0.9998 | 0.2020 |
| Optimal | 0.0000 | 0.1522 | 0.4166 | 0.0811 |
Figure 1.
Bean plots of the simulated AIC distributions when data is simulated using a normal model using k = 3 clusters for Venue 1 (left panel) and Venue 2 (right panel). The grey line marks the overall average.
Table 2.
Simulation results: AIC results: Venue 1 (normal).
| k = 2 | ||||
|---|---|---|---|---|
| Methods | Min | Mean | Max | Std deviation |
| Uniform | 1291 | 1426 | 1549 | 36.358 |
| K-means | 1286 | 1420 | 1524 | 34.314 |
| Optimal | 1282 | 1409 | 1516 | 33.383 |
| k = 3 | ||||
| Min | Mean | Max | Std deviation | |
| Uniform | 1259 | 1377 | 2859 | 94.893 |
| K-means | 1214 | 1342 | 3226 | 107.978 |
| Optimal | 1178 | 1309 | 1399 | 34.440 |
Table 3.
Simulation results: AIC results: Venue 2 (normal).
| k = 2 | ||||
|---|---|---|---|---|
| Methods | Min | Mean | Max | Std deviation |
| Uniform | 1234 | 1342 | 1474 | 34.031 |
| K-means | 1215 | 1334 | 1423 | 30.494 |
| Optimal | 1215 | 1329 | 1413 | 30.203 |
| k = 3 | ||||
| Min | Mean | Max | Std deviation | |
| Uniform | 1152 | 1257 | 1364 | 38.858 |
| K-means | 1087 | 1195 | 1283 | 30.676 |
| Optimal | 1070 | 1171 | 1247 | 29.225 |
3.1.2. Mixture covariate
The next simulation scenario generates survival data from a Cox model (1) with same coefficients as in the normal case but when the density of x is a finite mixture with density
| (5) |
where the sum to one and are component density functions (e.g. ). We used a mixture of K = 2 normal component distributions and with equal weighting, We found for the mixture distribution that again the optimal method performs better than k-means and uniform for k = 2 and k = 3. However, for k = 3, the difference between the optimal and k-means is not as pronounced but both of these methods perform better than the uniform partitioning as seen in Figure 2. Table 4 demonstrates that proportionality is maintained for the Cox model with the covariate obtained from a mixture distribution for both k = 2 and k = 3. This situation occurred again for both venues as seen in Tables 5 and 6.
Figure 2.
Bean plots of the simulated AIC distributions for the four methods when data is simulated using mixture models.
Table 4.
Simulation results: Schoenfeld residuals (mixture).
| k = 2 | ||||
|---|---|---|---|---|
| Methods | Min | Mean | Max | Std deviation |
| Uniform | 0.0920 | 0.6085 | 0.9985 | 0.2032 |
| K-means | 0.1093 | 0.6356 | 0.9993 | 0.1943 |
| Optimal | 0.0067 | 0.3748 | 0.7086 | 0.1446 |
| k = 3 | ||||
| Min | Mean | Max | Std deviation | |
| Uniform | 0.1680 | 0.7374 | 0.9999 | 0.1808 |
| K-means | 0.0002 | 0.7119 | 0.9949 | 0.1895 |
| Optimal | 0.0000 | 0.1861 | 0.5041 | 0.0729 |
Table 5.
Simulation results: AIC results: Venue 1 (mixture).
| k = 2 | ||||
|---|---|---|---|---|
| Methods | Min | Mean | Max | Std deviation |
| Uniform | 1255 | 1365 | 1470 | 38.249 |
| K-means | 1240 | 1361 | 1481 | 38.530 |
| Optimal | 1237 | 1351 | 1460 | 37.639 |
| k = 3 | ||||
| Min | Mean | Max | Std deviation | |
| Uniform | 1176 | 1304 | 1410 | 40.478 |
| K-means | 1156 | 1285 | 1393 | 38.504 |
| Optimal | 1136 | 1262 | 1361 | 38.022 |
Table 6.
Simulation results: AIC results: Venue 2 (mixture).
| k = 2 | ||||
|---|---|---|---|---|
| Methods | Min | Mean | Max | Std deviation |
| Uniform | 1183 | 1296 | 1389 | 34.192 |
| K-means | 1183 | 1293 | 1392 | 33.704 |
| Optimal | 1178 | 1288 | 1384 | 33.937 |
| k = 3 | ||||
| Min | Mean | Max | Std deviation | |
| Uniform | 1042 | 1169 | 1280 | 36.137 |
| K-means | 1027 | 1147 | 1240 | 32.793 |
| Optimal | 1025 | 1136 | 1236 | 32.595 |
To summarize the simulation results, for each venue, the optimal method performs better than k-means and the uniform partitioning as is to be expected.
3.2. Data applications
In this section we illustrate the partitioning strategies with two examples.
3.2.1. Cancer example
Data from the SEER cancer site (https://seer.cancer.gov/data/citation.html) is used to illustrate the methods in this section. 2000 observations on lung and colon cancer records are used. Figure 3 shows a estimated Kaplan–Meier survivorship curves for subjects with lung and colon cancer. The plot shows that survival is worse for those with lung cancer compared to colon cancer.
Figure 3.

Kaplan–Meier survivorship curves for lung and colon cancer subjects.
This example uses only the covariate age (not controlling for other covariates in (3)). We illustrate the optimal partitioning in the survival models by partitioning age into k = 2 levels. Figure 4 shows a plot of the log-likelihood test statistic for all possible cutpoints into k = 2 levels for both Venues 1 and 2. The black curve is for partitioning the Cox model as in Section 2.1.1 and the blue curve is for a partitioning based on a stratified Cox model described in Section 2.2 (this curve has been scaled down by a factor of 50 so that both curves show up on the plot). The optimal cutpoint for the Cox regression model is x = 76.5 years and for the stratified Cox model, the optimal cutpoint is x = 74.5 years. In order to assess the operational characteristics of the cutpoint estimators, we performed a bootstrap resampling (using 500 bootstrap samples). The venue 1 and 2 cutpoints were estimated for each bootstrap sample (for k = 2) and the bootstrap sampling distributions are shown in Figure 5. The venue 1 cutpoint distribution (black density curve) shifted to the right of the venue 2 cutpoint distribution. The venue 1 density shows evidence of a local maximum that roughly coincides with the average of the venue 2 cutpoints. It is interesting to note that the optimal cutpoint for both venue 1 and 2 are both to the right of the average age (70.2 years) indicating that an optimal partition is not achieved simply by grouping observations based on whether or not they lie above or below the mean age.
Figure 4.
Log-likelihood test statistic for all possible cutpoints for k = 2 strata for the lung and colon cancer data set.
Figure 5.
Bootstrap distribution for venue 1 (black) and 2 (dashed) cutpoint estimators (k = 2) for the cancer example.
For k = 3, the optimal age cutpoints using the Cox regression model are x = 65.5 and 79.5 years. For the stratified Cox model, the two optimal cutpoints were found to be x = 70.5 and 80.5 years.
The primary purpose of this example is to illustrate and contrast the partitioning strategies for Venue 1 and 2. From these results, though it appears that the optimal clustering criterion for both Venue 1 and 2 give similar cutpoints, they are used in different contexts where one is to discretize a covariate and the other is to categorize the predictor to form a stratification variable of the baseline hazard rate.
3.2.2. Depression study example
This section presents the primary illustration of the partitioning approach for modeling survival times. The example is related to moderator analysis where the goal is to examine baseline covariates that are treatment effect modifiers. That is, identify baseline covariates where the outcome on one treatment compared to another treatment depends on the value of the covariate. This section looks at moderators in terms of survival models where there are differences in survival based on the value of baseline covariates. The goal is to find a partition of continuous baseline covariates that yield clinically interpretable modifiers. All of the partitioning results that follow below use the optimal partitioning of Venue 1.
First we describe the data. A depression study was conducted to determine optimal treatment duration for depression with fluoxetine using a randomized discontinuation trial design [23]. Study subjects were outpatients aged 18 to 65 years, meeting diagnostic criteria for major depression with severity scores of 16 or more on a modified 17-item Hamilton Rating Scale for Depression (HRSD). The study had two phases: a 12 week open phase and a 90 day discontinuation phase. In the open phase, all patients were treated openly with fluoxetine, 20 mg/day, for 12 weeks. The HRSD scores were assessed at weeks 0, 1, 2, 3, 4, 6, 8, 10, 11, 12. Remission was defined as HRSD scores of 7 or less and failure to meet the diagnostic criteria for major depression for the last 2 weeks. Demographic (gender, age) and clinical characteristics (such as persistency of depression, age-at-onset, melancholia type, chronicity, neurovegetative type, number of previous episodes, and many others), were recorded at baseline. Of 839 patients who openly received fluoxetine in the study, 395 completed the 12 weeks open treatment phase, met the remission criteria and agreed to be enrolled in a double blind discontinuation phase. In the discontinuation phase, these remitters were randomized to either continue taking fluoxetine or to be switched to placebo. They were followed for 90 days and the time to relapse was recorded. Subjects still in remission were censored at 90 days.
The goal of this article is to perform an optimal stratification of baseline variables for a survival analysis in order to maximize interaction effects (in this case between the two treatments, fluoxetine and placebo). Tarpey et al. [26] performed a partitioning strategy using the initial open-label 12 week phase portion of this data only and did not incorporate information from the discontinuation phase of the study. Our objective is to cluster baseline covariates in the hope of finding simple and interpretable modifiers of treatment outcome.
Figure 6 shows Kaplan–Meier survivorship functions for the time to relapse during the discontinuation phase of the study. As the figure clearly shows, subjects treated with a placebo were more likely to relapse into depression than those treated with Prozac (fluoxetine). A Cox proportional hazards model was fit to this data using only an indicator for treatment during the discontinuation phase and the treatment indicator is highly significant (this is not surprising as seen from Figure 6) producing a likelihood ratio p-value of In this section, we shall examine the effect of an initial improvement and also age on the survivorship function during the discontinuation phase.
Figure 6.

Kaplan–Meier survivorship functions for placebo and fluoxetine (Prozac) treated subjects during the 3-month discontinuation phase.
First, we examine our partitioning approach using the covariate improvement after one-week of treatment during the first open-label phase. This covariate is defined as
| (6) |
Since lower values on the HRSD correspond to lower levels of depression, larger values of the variable x correspond to higher levels of response. It is generally believed that SSRI treatments for depression (such as fluoxetine) take some time to build up in the system before having a specific-chemical related effect on mood. However, placebo effects, which can be quite substantial in depression treatment, can manifest themselves immediately. Thus, if a subject is experiencing improvement immediately, say within the first week of treatment, then a placebo response is a plausible explanation whereas a specific drug-response is either not likely or would be quite weak. Another explanation is spontaneous improvement (say due to other factors besides treatment). Figure 7 shows a histogram of the 1-week improvement x which is skewed to the right.
Figure 7.
Improvement on the HRSD after one-week of open-label fluoxetine treatment.
As an initial analysis, a Cox proportional hazards model was fit with hazard function
| (7) |
where t is the time to relapse, is the baseline hazards function, x denotes the 1-week improvement score (6) and ‘trt’ is a treatment factor (0 for placebo and 1 for fluoxetine). Using the Cox proportional hazards function from the R package survival, ‘coxph’, the fitted model. The estimated coefficients from (7) (and their standard errors) are:
The improvement variable and its interaction with treatment were highly significant based on a Wald test (p-values of 0.0022 and 0.0205).
To illustrate the difference between placebo and Prozac treated subjects, Figure 8 shows a plot of the estimated hazard ratio from the estimated Cox model based on initial improvement
| (8) |
This figure indicates that the initial improvement (as measured by x) appears to be predictive as to whether or not a person will relapse when treated with placebo; however, for subjects that remained on Prozac during the discontinuation phase, the initial improvement is not predictive of relapse at all.
Figure 8.

Estimated hazard ratios for Prozac and placebo treated patients as a function of the improvement in HRSD after one-week of treatment.
A major drawback to the analysis presented here is that a unit increase in the HRSD is difficult to interpret since the HRSD value is obtained by aggregating results from an instrument consisting of several depression-related questions. Thus, one motivation for partitioning the improvement covariate here is to obtain a more coherent interpretation of the impact of initial improvement on the likelihood of relapsing.
An optimal stratification (for k = 2, Venue 1) was performed over all possible cutpoints for the continuous predictor improvement (after 1 week) and the results are shown in the left panel of Figure 9. The optimal cutpoint (in terms of yielding the smallest Wald p-value for the significance of the interaction term) occurs at x = 12.5. Since the smaller p-values are concentrated in a tiny interval near zero, the p-values were log-transformed and the results are shown in the right panel of Figure 9. In order to better distinguish the signal from the large degree of variation in these p-values, a penalized cubic B-spline smooth was estimated and plotted to show the overall trend in the log-transformed p-values. The right panel shows more clearly a global minimum to the log-transformed p-values (again with the vertical line indicating the smallest value).
Figure 9.
Left panel: A plot of p-values versus cutpoint for the depression study. The cutpoints are for the 1-week improvement scores (on HRSD). The optimal cutpoint occurs at x = 12.5 which is marked by the vertical line. (The horizontal dashed grey line is at 0.05.) Right panel shows the log-transformed p-values with a cubic B-spline smooth.
For the optimal cutpoint in the partitioned Cox model, all the terms are highly significant via Wald's test which may not be surprising since the non-discretized improvement covariate was highly significant. However, as noted in the Introduction, the p-values based on the discretized covariate need to be interpreted with caution since they represent the extremes of p-values across all cutpoints. For a partition of improvement into two groups, low improvement was coded as zero and high improvement (HI) was coded as 1. Using the optimal cutpoint, the estimated Cox model is proportional to
where the reference group is placebo treated subjects with low initial improvement. From this fitted model, we estimate that placebo treated subjects that had a high level of initial improvement were times more likely to relapse than placebo treated subjects that did not have a high initial improvement rate. For Prozac-treated subjects with a low-level of initial improvement we have ; thus, Prozac treated subjects with a low-level of initial improvement were only about half as likely to relapse as placebo-treated subjects with a low-level of initial improvement. The estimated interaction coefficient of is quite large yielding an unadjusted Wald p-value of 0.0028. From this we see that the hazard ratio comparing Prozac-treated subjects with a high-level of initial improvement (compared to Prozac-treated subjects with a low-level of initial improvement) is . Thus, for Prozac-treated subjects, there is a slightly lower risk of relapse if there is large initial improvement but for placebo-treated subjects, those who had a large initial improvement were almost 6 times more likely to relapse.
Recall that all the subjects in this study were rated as responders after the initial open-label 12 week phase of the study. What we see here is that those who experienced a strong degree of initial improvement (which could be labeled as a placebo response since the specific drug effect is weak to non-existent during the first week) and were later randomized to a placebo had much higher rates of relapse. The higher rate of relapse did not occur though for subjects that remained on Prozac during the discontinuation phase. The conclusion then is that when treating someone for depression (using a drug like Prozac), if they show immediate improvement, then they are more likely going to need the drug for maintenance in order to reduce the risk of relapsing.
For illustration, Figure 10 shows a plot of the estimated interaction coefficient versus the improvement cutpoint with the vertical line again marking the optimal cutpoint. Recall that in a moderator analysis, it is the interaction effect (in this case between the placebo and the Prozac treatment) that is of primary interest. The magnitude of the interaction coefficient is mostly flat as the cutpoint increases until one reaches the optimal value of the cutpoint where the interaction coefficient becomes more negative. The far right points in Figure 10 show very large (negative) interaction cutpoints, but although these values are large in magnitude, they do not correspond to small unadjusted p-values as can be seen in the plot of the associated p-values in Figure 9. This illustrates that using unadjusted p-values instead of the magnitude of the coefficient can provide a clearer interpretation of the results with a partitioned covariate.
Figure 10.

A plot of the estimated interaction coefficient in the Cox model using a partitioned improvement variable into k = 2 strata versus the cutpoint. The vertical line marks the optimal cutpoint (as shown in Figure 9).
The solid red curve in Figure 9 corresponds to the (unadjusted) p-values based on using the stratified Cox model where the stratification is based on partitioning the initial improvement into 2 intervals using a single cutpoint. The full model for each cutpoint has a factor for treatment and fits a stratified Cox model while the reduced model is just a Cox model with a treatment effect only with no stratification.
For a further illustration, the above procedure was repeated using the covariate which produced Figure 11 of the age-treatment interaction p-values versus the age cutpoints. The fitted Cox model with a continuous age variable is proportional to
and each coefficient is significant (p<0.03) using Wald's criterion. Partitioning the age covariate yields interpretations for ‘young’ and ‘old’ individuals. Model (7) was fit using a dichotomized age variable. The right panel of Figure 11 shows the log-transformed p-values along with a penalized cubic B-spline fit to the log-transformed p-values (the solid curve). The optimal cutpoint is estimated to be at (this is the minimal point of the penalized cubic B-spline curve shown in Figure 11). In the left-panel of Figure 11, it is difficult to ascertain where the optimal age cutpoint lies but in the (log-transformed) right panel of Figure 11, a clear optimal value for the age cutpoint (at about the age of 32) can be seen. The estimated coefficients (along with their standard errors) are
and each of these coefficients are large relative to their standard errors yielding very small unadjusted p-values. From these results, we have that for placebo-treated subjects, the risk of relapse during the discontinuation phase is (or about half) for older patients (age ) than for younger patients. However, and thus, for Prozac treated subjects, the risk of relapse for older patients is That is, older Prozac treated patients were more than one and a half times more likely to relapse compared to younger patients.
Figure 11.
Left panel: Plot of the interaction p-values versus age cutpoints for testing significance of the treatment versus age category (younger versus older). Right Panel: The log-transformed p-values versus the age cutpoints with a penalized cubic B-spline overlaid to indicate the trend in p-values versus the age cutpoints. The vertical line marks the age cutpoint of 32.14 giving the smallest interaction p-value.
The above approach can also be extended by performing an optimal partitioning of a model using both covariates improvement and age by partitioning both variables into two categories forming four classes (Low Improvement & Younger, Low Improvement & Older, High Improvement & Younger, High Improvement & Older). We do not give the results here but note that this can be performed by doing a grid search in the two-dimensional covariate space.
The depression study example provides an illustration of using our partitioning methodology to create a binary predictor from a continuous covariate. This is a principled approach that produces a binary predictor that leads to a clear interpretation of results (e.g. looking at the effect of high and low baseline depression levels on the likelihood of relapse). The depression example illustrates how the data-summarization via partitioning is potentially very useful to clinicians in practice because it provides a clearer and more succinct picture of how the covariate impacts the outcome.
As an additional verification of our methods, we compared our approaches to the method in [6] where the optimal cutpoints are estimated based on producing equal log-relative hazard values. We employed the [6] method on both the SEER cancer dataset and the depression data example. The approach in [6] had difficulties with the depression dataset example because it could not handle missingness in covariates even when the missingness was eliminated. For the SEER cancer data, the method in [6] produced results that are difficult to interpret that were very discrepant from what we had obtained. For example, for k = 3, the method in [6] found a left cutpoint of 8.213 and a right cutpoint of 76.640. This solution is difficult to interpret because the lower cutpoint of 8.213 produces an essentially empty cluster since very few individuals have age below this cutpoint. On the other hand, for our approaches with k = 3, we obtain interpretable age cutpoints of x = 65.5 and 79.5 years using venue 1. For the stratifed Cox model (venue 2), the two optimal cutpoints were found to be x = 70.5 and 80.5 years. We, therefore, feel this further demonstrates that our method has more flexibility with different types of data as compared to the current state of art.
4. Conclusion and discussion
We have described methods of discretizing continuous covariates via clustering in a survival model in order to obtain better insight into the effect of the covariate on survival. We looked at two venues: Venue 1 is discretizing a continuous covariate and using the discretized covariate in the model as a predictor variable. Venue 2 is discretizing a continuous covariate and stratifying the baseline hazard rate using this discretized variable, which is useful when that covariate is suspected to cause non-proportionality.
We showed that the optimal strategy appears to perform better for selecting cutpoints for both venues when we compared it to a uniform and k-means clustering strategy separately. We showed this through simulations as well as on two applications. In particular, using discretized measures (obtained via clustering) of age and initial improvement in our depression example has provided useful insight into the whether or not a depressed individual will need an active treatment maintenance when treating depression based on the person's age and the whether or not the person exhibits an initial placebo response.
In this paper we focused on partitioning a single continuous covariate with a single cutpoint. We have performed preliminary work on simultaneously partitioning more than one covariate and using multiple cutpoints. This is a problem that has been addressed in the partitioning literature [24,25] and it would be an interesting avenue for future research for survival data.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Abdolell M., LeBlanc M., Stephens D., and Harrison R., Binary partitioning for continuous longitudinal data: categorizing a prognostic variable, Stat. Med. 21 (2002), pp. 3395–409. [DOI] [PubMed] [Google Scholar]
- 2.Altman D., Categorising continuous variables, Br. J. Cancer 64 (1991), pp. 975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Altman D., Suboptimal analysis using ‘optimal’ cut points, Br. J. Cancer 78 (1998), pp. 556–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Altman D., Lausen B., Sauerbrei W., and Schumacher M., Dangers of using ‘optimal’ cut points in the evaluation of prognostic factors, J. Natl. Cancer Inst. 86 (1994), pp. 829–835. [DOI] [PubMed] [Google Scholar]
- 5.Altman D.G. and Royston P., The cost of dichotomising continuous variables, Br. Med. J. 332 (2006), pp. 1080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chen Y., Jialing H., He X., Gao Y., Mahara G., Lin Z., and Zhang J., A novel approach to determine two optimal cut-points of a continuous predictor with a u-shaped relationship to hazard ratio in survival data: Simulation and application, BMC. Med. Res. Methodol. 19 (2019), pp. 96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Contal C. and O'Quigley J., An application of changepoint methods in studying the effect of age on survival in breast cancer, Comput. Stat. Data. Anal. 30 (1999), pp. 253–270. [Google Scholar]
- 8.Cumsille F., Bangdiwala S.I., Sen P.K., and Kupper L.L., Effect of dichotomizing a continuous variable on the model structure in multiple linear regression models, Comm. Statist. Theory Methods 29 (2000), pp. 643–654. [Google Scholar]
- 9.Faraggi D. and Simon R., A simulation study of cross-validation for selecting an optimal cutpoint in univariate survival analysis, Stat. Med. 15 (1996), pp. 2203–2213. [DOI] [PubMed] [Google Scholar]
- 10.Graf L. and Luschgy H., Foundations of Quantization for Probability Distributions, Springer, Berlin, 2000. [Google Scholar]
- 11.Hartigan J.A. and Wong M.A., A -means clustering algorithm, Appl. Stat. 28 (1979), pp. 100–108. [Google Scholar]
- 12.Hilsenbeck S. and Clark G., Practical p-value adjustment for optimally selected cut points, Stat. Med. 15 (1996), pp. 103–112. [DOI] [PubMed] [Google Scholar]
- 13.Hollander N., Sauerbrei N., and Schumacher M., Confidence intervals for the effect of a prognostic factor after selection of an ‘optimal’ cutpoint, Stat. Med. 23 (2004), pp. 170–713. [DOI] [PubMed] [Google Scholar]
- 14.Hollander N. and Schumacher M., On the problem of using ‘optimal’ cut points in the assessment of quantitative prognostic factors, Onkologie 24 (2001), pp. 194–199. [DOI] [PubMed] [Google Scholar]
- 15.Icuma T.R., Achcar J.A., Martinez E.Z., and Davarzani N., Determination of optimum medical cut points for continuous covariates in lifetime regression models, Model. Assist. Stat. Appl. 13 (2018), pp. 141–159. [Google Scholar]
- 16.Jespersen N., Dichotomizing a continuous covariate in the Cox regression model, Tech. Rep., Statistical Research Unit of University of Copenhagen, Research Report, 1986.
- 17.Klein J.P. and Wu J.T., Discretizing a continuous covariate in survival studies, in Handbook of Statistics: Advances in Survival Analysis, N. Balakrishnan and C.R. Rao, eds., Vol. 23, Elsevier, New York, 2004, pp. 27–42.
- 18.Lausen B. and Schumacher M., Evaluating the effect of optimized cutoff values in the assessment of prognostic factors, Comput. Statist. Data Anal. 21 (1996), pp. 307–326. [Google Scholar]
- 19.Magder L.S. and Fix A.D., Optimal choice of a cut point for a quantitative diagnostic test performed for research purposes, J. Clin. Epidemiol. 56 (2003), pp. 956–962. [DOI] [PubMed] [Google Scholar]
- 20.Mazumdar M. and Glassman J.R., Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments, Stat. Med. 19 (2000), pp. 113–132. [DOI] [PubMed] [Google Scholar]
- 21.Royston P., Altman D.G., and Sauerbrei W., Dichotomizing continuous predictors in multiple regression: A bad idea, Stat. Med. 25 (2006), pp. 127–141. [DOI] [PubMed] [Google Scholar]
- 22.Silva G.T. and Klein J.P., Cutpoint selection for discretizing a continuous covariate for generalized estimating equations, Comput. Statist. Data Anal. 55 (2011), pp. 226–235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Stewart J.W., Quitkin F.M., McGrath P.J., Amsterdam J., Fava M., Fawcett J., Reimherr F., Rosenbaum J., Beasley C., and Roback P., Use of pattern analysis to predict differential relapse of remitted patients with major depression during 1 year of treatment with fluoxetine or placebo, Arch. Gen. Psychiatry. 55 (1998), pp. 334–343. [DOI] [PubMed] [Google Scholar]
- 24.Tarpey T., Self-consistent patterns for symmetric multivariate distributions, J. Classif. 15 (1998), pp. 57–79. [Google Scholar]
- 25.Tarpey T., A parametric k-means algorithm, Comput. Stat. 22 (2007), pp. 71–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Tarpey T., Petkova E., Lu Y., and Govindarajulu U., Optimal partitioning for linear mixed effects models: Applications to identifying placebo responders, J. Amer. Statist. Assoc. 105 (2010), pp. 968–977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Williams B.A., Jayawant M.S., Mandrekar N., and Mandrekar S.J., Finding optimal cutpoints for continuous covariates with binary and time-to-event outcomes, Tech. Rep., Mayo Foundation, 2006.







