Abstract
The mixture cure rate model (MCM) is the most widely used model for the analysis of survival data with a cured subgroup. In this context, the most common strategy to model the cure probability is to assume a generalized linear model with a known link function, such as the logit link function. However, the logit model can only capture simple effects of covariates on the cure probability. In this article, we propose a new MCM where the cure probability is modeled using a decision tree-based classifier and the survival distribution of the uncured is modeled using an accelerated failure time structure. To estimate the model parameters, we develop an expectation maximization algorithm. Our simulation study shows that the proposed model performs better in capturing nonlinear classification boundaries when compared to the logit-based MCM and the spline-based MCM. This results in more accurate and precise estimates of the cured probabilities, which in-turn results in improved predictive accuracy of cure. We further show that capturing nonlinear classification boundary also improves the estimation results corresponding to the survival distribution of the uncured subjects. Finally, we apply our proposed model and the EM algorithm to analyze an existing bone marrow transplant data.
KEYWORDS: Decision tree, EM algorithm, multiple imputation, cure rate, cross-validation
MATHEMATICS SUBJECT CLASSIFICATION: 62N02
1. Introduction
In the traditional survival models, such as the accelerated failure time model (AFT) or the proportional hazards model (PH), an implicit assumption is that all patients will eventually experience the event of interest if the follow up is continued for a sufficiently long period of time [12,14]. Examples of an event of interest could be death due to a disease, recurrence of a disease, or complete recovery from a disease, among others. Even though the AFT model has received much less attention compared to the PH model, it can be used as a useful alternative to the PH model specifically when the PH assumption is not satisfied. In clinical trials with good overall prognosis, we often find that a significant proportion of patients respond favorably to the treatment and consequently show long-term survival with respect to the event. This group of patients results in heavy censoring towards the end of a long follow up study and is considered as cured or long-term survivors [2–6]. The other group of patients that experiences the event is called the non-cured or susceptible. Thus, the population can be considered as a mixture of cured and susceptible patients [20–22,34]. The traditional survival models (AFT or PH) are not appropriate to capture the mixture population since they cannot accommodate the cured group of patients owing to their rigid assumption. Mixture cure rate models (MCMs) are well known for analyzing time-to-event data that indicates the presence of a cured subgroup [27].
Let T denote the time-to-event (or lifetime) corresponding to a mixture population and be the survival function of T, also known as the population or overall survival function. For cured patients, we have , whereas for susceptible patients we have . Let U be an indicator variable that takes the value 0 if (i.e. patient is cured) and 1 if (i.e. patient is susceptible). Furthermore, let denote the survival function of the susceptible patients only, i.e. , and π denote the uncured probability, i.e. . In addition, let and be the vectors of covariates affecting the survival distribution of uncured (also called latency) and the uncured probability (also called incidence), respectively. Note that and may share common elements. Then, the MCM can be expressed as follows:
| (1) |
In the context of MCM, the most studies have used a generalized linear model for with a known link function. In this regard, the logistic link function is the most common choice [2,3,6]. Other choices of link functions are the probit and complementary log-log link functions. However, these parametric link functions produce similar results and they are not flexible enough to capture complex patterns in the data [13]. From a classification problem point of view this implies that the parametric link functions assume the boundary separating the cured and susceptible patients to be linear, which may not be true in practice. Recently, researchers considered semi-parametric approach to model the incidence with smooth effects of covariates [29] ; however, the effects still act on the incidence through the logistic link. Other nonparametric approaches to model the incidence include the spline-based method [8,35], which may not work well in the presence of complicated interactions among covariates. Thus, it is clear that there is room to improve the modeling with respect to the incidence part of MCM.
With the popularity of data mining, machine learning (ML)-based classification methods have turned out to be more flexible and robust when compared to traditional models such as the logistic regression [13,16,36]. In this article, we propose to use a decision tree (DT)-based approach to model the incidence, i.e. , with an objective to capture complex relationship between and . To model the latency, i.e. , we use an AFT structure with an unspecified baseline survival function. This leads to a semi-parametric MCM where the incidence and latency parts are modeled using DT and AFT, respectively. To estimate the unknown model parameters, we develop an expectation maximization (EM) algorithm. In this regard, interested readers may also see some recent works on a stochastic variation of the EM algorithm in the context of cure rate models [10,15,24]. To demonstrate the superiority of our proposed model (MCM-AFT-DT), we compare our model with the logistic regression based MCM (MCM-AFT-Logit) as well as the spline regression-based MCM (MCM-AFT-Spline), noting that spline based models can also capture non-linearity in the data. Given the availability of other ML-based approaches, such as support vector machine (SVM) [13,25,26] and neural networks (NN) [36], we prefer to use DT since it is easy to interpret and can deal with co-linearity better than other ML techniques such as the SVM. Moreover, DT is expected to be computationally less expensive when compared to NN and random forests.
The rest of this article is organized as follows. In Section 2, we discuss the formulation of the MCM-AFT-DT model and develop the steps of the EM algorithm for parameter estimation. In Section 3, we carry out an extensive Monte Carlo simulation study to demonstrate the performance and superiority of the proposed model. In Section 4, we apply the proposed model to analyze a survival data related to leukemia patients who went through bone marrow transplantation. Finally, in Section 5, we make some concluding remarks and discuss some potential future research problems.
2. Methods
Let us assume that the observed data are of the form , , where is the observed lifetime, is the right-censoring indicator with if is right censored and is is uncensored, and are the observed values of and for the ith subject, and n is the sample size. We also assume the censoring to be independent and non-informative. The semi-parametric MCM with AFT structure for the latency can be written as follows:
| (2) |
where is an unspecified baseline survival function and with denoting the vector of regression coefficients. For the incidence , instead of using the traditional logistic regression model we use the DT to model the effect of on .
Let the covariate vector and the cured status indicator , for , be the input data to build the DT model for the incidence. Now, consider a partition into M regions, say, , and assume the response can be modeled as binary in each region where we let m denote the index for terminal node. Then, in node m, representing the region with observations, let us define
| (3) |
as the proportion of class k observations in node m, where k takes the value 0 if the input data relates to a cured subject and 1 if it relates to a susceptible subject. Observations in node m are classified to class k with the majority points, i.e. . In the given context, although one can think of different measures of node impurity, we consider the Gini index in this work which is defined as follows:
| (4) |
In Equation (4), K denotes the number of classes and it takes the value 2 (cured and susceptible). The complexity of the DT is controlled by the size of the tree, which is a tuning parameter. The recursive binary splitting technique may lead to a very large tree which may over-fit the data. To circumvent this issue, the cost-complexity pruning can be used to prune the full tree and narrow down to a number of sub-trees for comparisons.
Let denote the full tree, denote a sub-tree obtained by pruning , and denote the number of terminal nodes in . Then, the cost-complexity criterion can be defined as:
| (5) |
where α represents the cost complexity tuning parameter (cp), is the minimum number of observations that must exist in a node for a split to be attempted (minsplit) and is the minimum number of observations in any terminal or leaf node (minbucket). The goal here is to find the sub-tree that minimizes Equation (5) for each α. Note that α controls the trade-off between the size of the tree and its flexibility of fit to the data, which implies that with an increase in α the number of terminal nodes in the sub-tree decreases, and vice versa. In particular, if , there's no penalty and the best sub-tree is that is created through recursive binary splitting [9].
To avoid any over-fitting, we split the data into a training set and a test set. Since DT can easily over-fit the training data, we use a two-way modeling approach. In the first case, we perform a grid-search ten-fold cross-validation technique on the training set to obtain the optimal hyper-parameters α and . For this purpose, we set and . On the other hand, the hyper-parameter is set as . The best hyper-parameters obtained through this search are then used to grow the first tree in the first approach. In the second case, we perform cost-complexity pruning on the first tree obtained in the first approach. The cost-complexity pruning parameter, , is also obtained using cross-validation. The corresponding to the lowest cross-validation error is used to prune the tree grown in the first approach. The pruned tree obtained in the second approach is considered as the final optimal model for predictions on new data. Next, we validate the performance of the optimal model using the test set. For this purpose, we use the graphical receiver operating characteristic (ROC) curve and it's area under the curve (AUC).
Let denote the output of the DT model, which is treated as uncalibrated predictions defined on for classification. To obtain the calibrated posterior uncured probabilities, we pass these outputs through the following sigmoid function [28]:
| (6) |
where A and B are unknown parameters to be estimated. The parameters A and B can be obtained from the solutions of the following optimization problem:
| (7) |
where . Note that Equation (7) can be solved using the gradient descent algorithm. Now, using the same data to train the DT and the sigmoid in Equation (6) can lead to unwanted bias in the sigmoid training set, which can in-turn lead to poor fitted results. To resolve this issue, a k-fold cross-validation technique can be used that allows the DT and the sigmoid to be trained simultaneously on the full training set. For all practical purposes, a 3-fold cross-validation can be used [28]. Moreover, let and denote the number of uncured and cured subjects in the training set, respectively. To avoid over-fitting to the training set, we use the out-of-sample model where for each subject Platt calibration uses outcome values and instead of 1 and 0, defined as
| (8) |
Note that the out-of-sample values and are non-binary, but they converge respectively to 1 and 0 as the training size approaches infinity.
2.1. Development of the EM algorithm
The unobserved 's for the set of censored lifetimes introduces the missing data and motivates the development of the EM algorithm. Let the complete data be defined as , which includes the missing . The complete data log-likelihood function can be expressed as: where
| (9) |
and
| (10) |
where is the hazard function corresponding to , i.e. with denoting the baseline hazard function. Note that 's are linear in both and . Furthermore, and do not share common model parameters. These simplify the development of the EM algorithm. At the kth iteration of the EM algorithm the E-step calculates the conditional expectation of with respect to given the observed data and the parameter values from the th iteration, denoted by , and . This reduces to the computation of the conditional expectation of , which is given by
| (11) |
The E-step thus replaces the 's in Equations (9) and (10) with 's, and the resulting conditional expectation can be expressed as , where
| (12) |
and
| (13) |
The M-step of the EM algorithm maximizes Equations (12) and (13) independently with respect to and , respectively, to obtain improved estimates, i.e. and , . However, depending on the specific needs of modeling, alternate functions to Equations (12) and (13) may be considered. For example, Zhang and Peng [37] suggested the use of the following function instead of Equation (13) to obtain :
| (14) |
where . Details of this part of the M-step can be found in [38]. Using the estimate of , can be estimated non-parametrically using a Breslow-type estimator [7]. Similarly, instead of maximizing Equation (12) using a parametric link function for , we consider the more flexible DT approach, as explained in Section 2, to obtain . Now, to implement the DT approach, we need the values of 's for all . However, since 's are unknown for the set of censored observations, we propose a multiple imputation-based approach to impute the missing 's using the current which is based on the suggestion of [13]. That is, we simulate multiple values of (with ) from a Bernoulli distribution with success probability . The kth update of , i.e. , is obtained as the average of the estimates of from the multiple imputed 's. Once the E- and M-steps are complete, they are iterated until some specified convergence criterion is achieved, such as where , is a chosen tolerance (e.g. ) and is the -norm. Along the lines of [13], the bootstrap method is employed to calculate the standard errors of the estimated parameters. The steps of the proposed EM algorithm are summarized below.
Step 1: Use the censoring indicator to initiate the value of . That is, if the ith patient is censored, and is 1 otherwise.
Step 2: Impute the values of based on , and then apply the DT together with the Platt scaling method to estimate . The final estimate of is calculated as the average from multiple imputation.
Step 3: Use Equation (14) to obtain an estimate of [38] and then obtain non-parametrically using the estimate of [7].
Step 4: Use the current estimates of , and to update using Equation (11).
Step 5: Repeat steps (2)–(4) above until convergence is achieved.
Step 6: Use the bootstrap method to calculate the standard errors of the estimated parameters based on B bootstrap samples.
3. Simulation study
To assess the performance of the proposed MCM-AFT-DT model and the EM algorithm we carry out a Monte Carlo simulation study. We consider two different sample sizes: n = 300 and 600. Two-third of the data is used as training set and the remaining one-third is used as test set. The cured status U in the incidence part is generated from a Bernoulli distribution with probability , where is generated according to the following three scenarios:
In the above scenarios, and are generated from the standard normal distribution. Note that the first scenario represents the standard logistic regression model, whereas scenarios 2 and 3 represent complex relationship between U and that involve non-logistic or nonlinear effects. The resulting cure rates are roughly 0.50, 0.40 and 0.30 for scenarios 1, 2 and 3, respectively. Figure 1 presents the classification boundaries for the three considered scenarios. We use the same set of covariates in the latency part of the model, i.e. we consider . For a susceptible subject, the failure time is generated from an AFT model with , where the true values of the regression coefficients are chosen as , and . The censoring time is generated independently from a Uniform(0,20) distribution, which results in censoring rates of 0.65, 0.50 and 0.45 for scenarios 1, 2 and 3, respectively. The number of imputation times and bootstrap samples are taken as 5 and 100, respectively, which are along the lines of the recent work of [13]. All simulation results are averaged over 500 Monte Carlo runs. For all results, we compare the proposed MCM-AFT-DT model with MCM-AFT-Logit and MCM-AFT-Spline models. To fit the MCM-AFT-Spline model, we use a nonparametric additive model with a thin plate spline. For this purpose, we use the ‘gam’ function available in R package ‘mgcv’, which allows the effective degrees of freedom of each covariate function to be automatically selected.
Figure 1.
Simulated cured and non-cured statuses under the three methods.
Table 1 presents the biases and mean square errors (MSEs) of the estimated uncured probability. We note that when the true classification boundary is nonlinear (scenarios 2 and 3), MCM-AFT-DT results in much smaller biases and MSEs when compared to both MCM-AFT-Logit and MCM-AFT-Spline. This implies that DT can capture non-linearity in the incidence better than logit and Spline, which consequently results in much improved area under the curve (AUC) values corresponding to the receiver operating characteristic curves (ROC); see Table 2 for the AUC values and Figure 2 for the ROC curves. Furthermore, the similarity in the AUC values from the training and test sets clearly implies there is no issue with over-fitting. Note that the true label for calculating the AUC is the true cure index for each subject when the data are generated. Under scenario 1 (linear classification boundary) in Table 1, it is not surprising to see MCM-AFT-Logit performing better than both MCM-AFT-DT and MCM-AFT-Spline since logit-based models are well known to capture linearity in the data. Table 3 presents the biases and MSEs of the estimated overall survival probability, which is a function comprising of parameters from both incidence and latency. Under scenarios 2 and 3, we again note that MCM-AFT-DT performs better in terms of bias and MSE when compared to both MCM-AFT-Logit and MCM-AFT-Spline, whereas under scenario 1 the performance of MCM-AFT-Logit is the best. Table 4 presents the biases and MSEs of the estimated susceptible survival probability, which is only a function of the latency parameters. Once again, we notice that under nonlinear classification boundaries, the proposed MCM-AFT-DT model estimates the survival distribution of uncured subjects with the smallest bias and MSE. From the results in Table 4, it is interesting to note that even though the DT algorithm is only applied to model the incidence to improve the uncured probability estimation (specifically for nonlinear classification boundaries), it also improves the estimation results corresponding to the latency. This can be looked as an added advantage our proposed model provides. In Table 5, we present the computation time (in seconds) to get the estimation results for one dataset, which includes the multiple imputation (of size 5) for the incidence and the bootstrap method (of size 100) to obtain the standard errors. Our reported computation times are consistent with those reported by Li et al. [13]. Furthermore, it is clear that even though our proposed methods are computationally extensive, it is still possible to get the estimation results in a reasonable amount of time. Finally, in Table 6, we present the estimates and standard errors corresponding to the AFT regression parameters for n = 600. For n = 300, the results are similar and hence are not reported for the sake of brevity. Under nonlinear classification boundaries, the performances of MCM-AFT-DT and MCM-AFT-Spline are comparable with MCM-AFT-DT performing considerably better in terms of bias under scenario 2. Overall, under all scenarios, the estimates of regression parameters obtained from MCM-AFT-DT model are reasonably close to the true parameter values. Note the poor performance (in terms of bias) of MCM-AFT-Logit under scenario 3.
Table 1.
Comparison of bias and MSE of the uncured probability for different models.
| Bias | MSE | ||||||
|---|---|---|---|---|---|---|---|
| Scenario | n | DT | Spline | Logit | DT | Spline | Logit |
| 300 | 0.144 | 0.046 | 0.031 | 0.048 | 0.010 | 0.003 | |
| 1 | 600 | 0.125 | 0.030 | 0.021 | 0.036 | 0.005 | 0.002 |
| 300 | 0.143 | 0.305 | 0.345 | 0.040 | 0.183 | 0.221 | |
| 2 | 600 | 0.132 | 0.307 | 0.348 | 0.032 | 0.183 | 0.225 |
| 300 | 0.140 | 0.277 | 0.247 | 0.053 | 0.129 | 0.151 | |
| 3 | 600 | 0.122 | 0.282 | 0.237 | 0.040 | 0.130 | 0.153 |
Table 2.
Comparison of AUC values for different models and scenarios.
| Scenario | n | Training AUC | Testing AUC | ||||
|---|---|---|---|---|---|---|---|
| DT | Spline | Logit | DT | Spline | Logit | ||
| 300 | 0.938 | 0.970 | 0.973 | 0.891 | 0.964 | 0.971 | |
| 1 | 600 | 0.947 | 0.972 | 0.973 | 0.914 | 0.969 | 0.972 |
| 300 | 0.931 | 0.567 | 0.536 | 0.866 | 0.559 | 0.553 | |
| 2 | 600 | 0.926 | 0.554 | 0.523 | 0.885 | 0.549 | 0.536 |
| 300 | 0.923 | 0.746 | 0.543 | 0.853 | 0.704 | 0.562 | |
| 3 | 600 | 0.927 | 0.745 | 0.530 | 0.880 | 0.721 | 0.545 |
Figure 2.
ROC curves for different models and scenarios.
Table 3.
Bias and MSE of the overall survival probability for different models.
| Overall Survival Probability | |||||||
|---|---|---|---|---|---|---|---|
| Bias | MSE | ||||||
| Scenario | n | DT | Spline | Logit | DT | Spline | Logit |
| 300 | 0.096 | 0.045 | 0.039 | 0.023 | 0.006 | 0.004 | |
| 1 | 600 | 0.080 | 0.031 | 0.027 | 0.017 | 0.003 | 0.002 |
| 300 | 0.099 | 0.157 | 0.163 | 0.021 | 0.045 | 0.046 | |
| 2 | 600 | 0.088 | 0.156 | 0.162 | 0.017 | 0.046 | 0.045 |
| 300 | 0.094 | 0.167 | 0.148 | 0.025 | 0.057 | 0.049 | |
| 3 | 600 | 0.077 | 0.169 | 0.139 | 0.018 | 0.058 | 0.044 |
Table 4.
Bias and MSE of the susceptible survival probability for different models.
| Susceptible Survival Probability | |||||||
|---|---|---|---|---|---|---|---|
| Bias | MSE | ||||||
| Scenario | n | DT | Spline | Logit | DT | Spline | Logit |
| 300 | 0.046 | 0.038 | 0.038 | 0.007 | 0.004 | 0.004 | |
| 1 | 600 | 0.032 | 0.026 | 0.026 | 0.003 | 0.002 | 0.002 |
| 300 | 0.046 | 0.113 | 0.140 | 0.005 | 0.032 | 0.042 | |
| 2 | 600 | 0.033 | 0.107 | 0.140 | 0.003 | 0.029 | 0.042 |
| 300 | 0.045 | 0.054 | 0.124 | 0.005 | 0.008 | 0.036 | |
| 3 | 600 | 0.034 | 0.040 | 0.129 | 0.003 | 0.004 | 0.037 |
Table 5.
Computation times for different models under varying scenarios and sample sizes.
| Computation time (in seconds) | |||
|---|---|---|---|
| Scenario | Model | n=300 | n=600 |
| 1 | DT | 1270.95 | 1578.65 |
| Spline | 1053.88 | 1460.03 | |
| Logit | 286.62 | 465.93 | |
| 2 | DT | 1191.13 | 1685.12 |
| Spline | 861.732 | 1079.74 | |
| Logit | 391.54 | 542.91 | |
| 3 | DT | 1588.79 | 2246.58 |
| Spline | 1411.51 | 1794.57 | |
| Logit | 532.80 | 768.79 | |
Table 6.
Estimation results corresponding to the AFT regression parameters for n = 600.
| Estimate | Standard Error | ||||||
|---|---|---|---|---|---|---|---|
| Scenario | Parameter | DT | Spline | Logit | DT | Spline | Logit |
| 0.581 | 1.014 | 0.856 | 0.497 | 0.557 | 0.623 | ||
| 1 | −0.626 | −0.570 | −0.670 | 0.275 | 0.478 | 0.126 | |
| −1.114 | −0.931 | −1.170 | 0.167 | 0.304 | 0.101 | ||
| 1.000 | 1.361 | 1.351 | 0.114 | 0.114 | 0.212 | ||
| 2 | −0.731 | −1.286 | −1.302 | 0.091 | 0.090 | 0.102 | |
| −1.001 | −1.689 | −1.678 | 0.126 | 0.130 | 0.114 | ||
| 0.812 | 0.633 | 1.206 | 0.104 | 0.111 | 0.170 | ||
| 3 | −0.519 | −0.417 | −0.008 | 0.130 | 0.129 | 0.093 | |
| −0.689 | −0.742 | −1.406 | 0.128 | 0.154 | 0.102 | ||
3.1. Comparison of DT and SVM with AFT structure for latency
In this section, we compare the proposed MCM-AFT-DT with MCM-AFT-SVM [13]. This comparison is done through simulation study using a sample size of n = 600. The results are presented in Tables 7 and 8. It is clear that for all three considered scenarios (both linear and nonlinear classification boundaries), the biases and MSEs of the estimated uncured probabilities as well as the estimated susceptible and overall survival probabilities are smaller for the MCM-AFT-DT model. Furthermore, from Table 8, the DT model results in higher predictive accuracy for cure when compared to the SVM model. These results once again demonstrate the superiority of the MCM-AFT-DT model.
Table 7.
Comparison of DT and SVM models through the biases and MSEs of different quantities of interest for n = 600.
| Scenario | Model | Bias | MSE | Bias | MSE | Bias | MSE |
|---|---|---|---|---|---|---|---|
| 1 | DT | 0.1256 | 0.0373 | 0.0789 | 0.0178 | 0.0281 | 0.0022 |
| SVM | 0.2006 | 0.0864 | 0.1069 | 0.0255 | 0.1496 | 0.0664 | |
| 2 | DT | 0.1276 | 0.0286 | 0.0858 | 0.0161 | 0.0284 | 0.0019 |
| SVM | 0.2477 | 0.1704 | 0.1887 | 0.1048 | 0.0420 | 0.0038 | |
| 3 | DT | 0.1118 | 0.0338 | 0.0717 | 0.0152 | 0.0373 | 0.0034 |
| SVM | 0.2924 | 0.2284 | 0.2780 | 0.1403 | 0.0490 | 0.0065 | |
Table 8.
Comparison of DT and SVM models through the AUC values for n = 600.
| Training AUC | Testing AUC | |||
|---|---|---|---|---|
| Scenario | DT | SVM | DT | SVM |
| 1 | 0.949 | 0.837 | 0.895 | 0.818 |
| 2 | 0.933 | 0.925 | 0.891 | 0.891 |
| 3 | 0.937 | 0.893 | 0.885 | 0.870 |
4. Application: analysis of leukemia data
In this section, we apply the proposed MCM-AFT-DT model and the EM algorithm to analyze a bone marrow transplantation data on leukemia patients. For the purpose of comparison, we also fit the MCM-AFT-Spline and MCM-AFT-Logit models. In these data, there are 137 leukemia patients who were followed for 2640 days, and the event of interest is relapse or death due to leukemia following bone marrow transplantation. At the end of the study, 54 patients survived disease free and hence their lifetimes were right censored. Interested readers can download the complete data from R the package ‘KMsurv’. In the data, we have information on Methotrexate (MTX) used as a prophylactic agent to prevent graft-versus-host disease ( 1–Yes, 0-No), which can be considered as a covariate that may affect patient survival. A plot of the Kaplan–Meier estimates of the survival probabilities, stratified by the MTX status, shows long plateau that levels off to nonzero survival probability (see Figure 3). This clearly indicates the presence of a cured subgroup and hence the MCM is appropriate. On the other hand, the plots of the logarithm of the estimated cumulative hazard functions for two groups (MTX = 1 and MTX = 0), based on the nonparametric Kaplan–Meier survival estimator, clearly suggests that the two curves are not parallel. Hence, the PH assumption for the latency is not appropriate for this dataset which motivates the proposed semi-parametric AFT structure; see Figure 4. In our application, we also consider patient's age and donor's age, both measured in years, as other possible covariates. All three considered covariates (MTX, patient's age and donor's age) are used in both incidence and latency parts of the model. To circumvent the issue with over-fitting and given the moderate sample size, we apply a 10-fold cross-validation technique that allows us to train both DT and sigmoid models on the full data. This is along the lines of the work of Hastie et al. [11]. As in the case of simulation study, the number of multiple imputations is chosen as 5 and the number of bootstrap samples is chosen as 100.
Figure 3.
Kaplan–Meier survival curves stratified by MTX for the leukemia data.
Figure 4.
Logarithm of the cumulative hazard function plots.
In Figure 5, we present the 3-dimensional plots of the estimated uncured probabilities with respect to the covariates along with their 95% confidence bounds. It is interesting to note that, for all candidate models, the uncured probabilities attain a local minimum when donor's age is between 20 and 40 years and patient's age is between 10 and 30 years, irrespective of the MTX status. Also, while the uncured probabilities obtained from the MCM-AFT-Logit model show a monotonic behavior with respect to patient's and donor's ages, the uncured probabilities from the MCM-AFT-DT and MCM-AFT-Spline models tend to rise slowly as patient's age and donor's age increases, irrespective of the MTX status. Lower uncured probabilities are also observed among younger patients who used younger donors and MTX during the treatment process. Thus, it is clear that unlike the MCM-AFT-Logit model, both MCM-AFT-DT and MCM-AFT-Spline models can capture nonlinear age effects on the uncured probability. Now, it is important to understand whether capturing non linearity in the data can improve the predictive accuracy of cure, and which model (specifically when comparing between DT and spline) results in the highest predictive accuracy of cure. For this purpose, we calculate the ROC curves and the associated AUC values. However, to calculate the ROC curves for the real data, note that the true cured statuses are unknown for all censored lifetimes. As such, we impute these missing cured statuses. For this imputation, we first estimate the conditional probability of uncured given in Equation (11) for each censored observation. Then, using the estimated conditional probability of uncured, we generate a Bernoulli random variable whose value represents the uncured (if the value is 1) or cured (if the value is 0) status. This process is repeated 500 times and the averaged ROC curves together with the AUC values are presented in Figure 6. From the figure, it is clear that the proposed MCM-AFT-DT model results in the highest predictive accuracy of cure, and hence is our preferred model for this leukemia data. Note that the proposed simulation-based approach to impute the missing cured statuses in a real data for the calculation of the ROC and AUC is preferred since it does not require any naive assumptions, for example, the existence of a known threshold time beyond which all censored observations are considered cured [1].
Figure 5.
Plots of estimated uncured probabilities as a function of donor's age and patient's age and stratified by MTX for the leukemia data. The last two rows presents the plots along with 95% confidence bounds.
Figure 6.
ROC curves for different models corresponding to the leukemia data.
In Figure 7, we present the final decision tree plot that shows how different variables play a role in classifying a patient as cured or uncured. The terminal nodes represent the predicted values/classes, where 0 denotes the cured group and 1 denotes the uncured group. The decision tree algorithm first selects the variable that minimizes the impurity (error rate) level of the tree. This variable is usually considered as the most significant variable in building the tree. As can be seen from the tree plot, donor's age is being considered for the first split, and making a split at age less than 40 years. The tree predicts patients who received bone marrow from older donors (i.e. donors over 40 years old) to be of higher probability/risk (0.88) of being uncured. However, for younger patients (i.e. patients less than 28 years old) who received bone marrow from donors between 21 and 40 years old are more likely to be cured. This is consistent with previous findings since this may be due to several factors such as decrease in immune functioning, higher risk of genetic abnormalities, etc. associated with the age and health status of older donors. Note that the final tree only considered donor's age and patient's age as variables that are important in predicting the cured status of leukemia patients. The observations from the decision tree plot are consistent with the overall variable importance plot; see Figure 8.
Figure 7.

Decision tree plot corresponding to the leukemia data. Note that 0 represents the cured group and 1 represents the uncured group.
Figure 8.

Overall variable importance plot for the incidence corresponding to the leukemia data.
Next, we turn our attention to the inference related to the latency part of the model. Table 9 presents the estimation results corresponding to the latency regression parameters, where , and represents the effects of patient's age, donor's age and MTX, respectively. For all three considered models, the effects of the covariates on the latency are the same. At the 5% level of significance, there is a significant difference in the survival distribution between patients who used MTX as a prophylactic agent and those who did not use when they are not cured. Furthermore, the sign of the estimated coefficient corresponding to MTX agrees with the Kaplan–Meier plot presented in Figure 3.
Table 9.
Estimation of the latency regression parameters corresponding to the leukemia data.
| Estimate | Standard Error | p-value | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | ||||||||||||
| DT | 0.576 | 0.029 | −0.034 | −0.766 | 0.321 | 0.026 | 0.023 | 0.256 | 0.073 | 0.184 | 0.102 | 0.003 |
| Spline | 0.575 | 0.029 | −0.036 | −0.767 | 0.304 | 0.026 | 0.025 | 0.257 | 0.058 | 0.253 | 0.109 | 0.003 |
| Logit | 0.619 | 0.030 | −0.033 | −0.762 | 0.344 | 0.029 | 0.024 | 0.302 | 0.070 | 0.266 | 0.132 | 0.012 |
5. Conclusion
In this article, we propose a novel MCM where the incidence is modeled using a DT algorithm and the latency is modeled using an AFT structure. For the estimation of model parameters we develop an EM algorithm where we make use of the Platt scaling method to convert the DT outputs to posterior probabilities of uncured. The proposed DT-based model can capture complex patterns in the incidence better than the existing logit-based and spline-based mixture cure models. Consequently, the proposed model results in improved predictive accuracy of cure. Interestingly, we notice that capturing complex patterns in the incidence also results in improved estimation results (in terms of bias and MSE) for the overall and susceptible survival functions. As a future work, it is of interest to model both incidence and latency using suitable machine learning techniques to see how the results compare to those presented in this article. In addition, it is also of great interest to extend the current setup to accommodate high-dimensional covariates and address variable selection. Another possible future research is to integrate machine learning with cure models that are based on competing risks and addresses elimination of risks after a certain passage of time [17–19,23,30–33]. We are currently working on some of these problems and hope to report the findings in the future article.
Supplementary Material
Acknowledgments
The authors would like to thank two anonymous reviewers for their useful comments and suggestions, which led to this improved version of the manuscript.
Funding Statement
The research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R15GM150091. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Suvra Pal's research was also supported by the Research Innovation Grant awarded by the College of Science at University of Texas at Arlington, USA.
Data availability statement
The leukemia data analyzed in this article is freely available in the R package ‘KMsurv’. The computational codes are available in the supplemental material.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Supplemental Data
Supplemental data for this article can be accessed online at https://doi.org/10.1080/02664763.2024.2418476.
References
- 1.Asano J., Hirakawa A., and Hamada C., Assessing the prediction accuracy of cure in the Cox proportional hazards cure model: An application to breast cancer data, Pharm. Stat. 13 (2014), pp. 357–363. [DOI] [PubMed] [Google Scholar]
- 2.Balakrishnan N. and Pal S., EM algorithm-based likelihood estimation for some cure rate models, J. Stat. Theory Pract. 6 (2012), pp. 698–724. [Google Scholar]
- 3.Balakrishnan N. and Pal S., Lognormal lifetimes and likelihood-based inference for flexible cure rate models based on COM-Poisson family, Comput. Stat. Data Anal. 67 (2013), pp. 41–67. [Google Scholar]
- 4.Balakrishnan N. and Pal S., An EM algorithm for the estimation of parameters of a flexible cure rate model with generalized gamma lifetime and model discrimination using likelihood-and information-based methods, Comput. Stat. 30 (2015), pp. 151–189. [Google Scholar]
- 5.Balakrishnan N. and Pal S., Likelihood inference for flexible cure rate models with gamma lifetimes, Commun. Stat. – Theory Methods 44 (2015), pp. 4007–4048. [DOI] [PubMed] [Google Scholar]
- 6.Balakrishnan N. and Pal S., Expectation maximization-based likelihood inference for flexible cure rate models with Weibull lifetimes, Stat. Methods Med. Res. 25 (2016), pp. 1535–1563. [DOI] [PubMed] [Google Scholar]
- 7.Breslow N.E., Covariate analysis of censored survival data, Biometrics 30 (1974), pp. 89–99. [PubMed] [Google Scholar]
- 8.Chen T. and Du P., Mixture cure rate models with accelerated failures and nonparametric form of covariate effects, J. Nonparametr. Stat. 30 (2018), pp. 216–237. [Google Scholar]
- 9.Cheng-Min C., Ya-Wen Y., Bor-Wen C., and Yao-Lung K., Construction the model on the breast cancer survival analysis use support vector machine, logistic regression and decision tree, J. Med. Syst. 38 (2014), p. 106. [DOI] [PubMed] [Google Scholar]
- 10.Davies K., Pal S., and Siddiqua J.A., Stochastic EM algorithm for generalized exponential cure rate model and an empirical study, J. Appl. Stat. 48 (2021), pp. 2112–2135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hastie T., Tibshirani R., and Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001. [Google Scholar]
- 12.Hoang L.Q., Pal S., Liu Z., Senkowsky J., and Tang L., A time-dependent survival analysis for early prognosis of chronic wounds by monitoring wound alkalinity, Int. Wound J. 20 (2022), pp. 1459–1475. 10.1111/iwj.14001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Li P., Peng Y., Jiang P., and Dong Q., A support vector machine based semiparametric mixture cure model, Comput. Stat. 35 (2020), pp. 931–945. [Google Scholar]
- 14.Overman T. and Pal S., Statistical tools and techniques in modeling survival data, in Mathematics Research for the Beginning Student Volume 2: Accessible Projects for Students After Calculus, E. Goldwyn, A. Wootton, and S. Ganzell, eds., Chap. 3, Birkhauser-Springer, 2022, pp. 75–99.
- 15.Pal S., A simplified stochastic EM algorithm for cure rate model with negative binomial competing risks: An application to breast cancer data, Stat. Med. 40 (2021), pp. 6387–6409. [DOI] [PubMed] [Google Scholar]
- 16.Pal S. and Aselisewine W., A semiparametric promotion time cure model with support vector machine, Ann. Appl. Stat. 17 (2023), pp. 2680–2699. [Google Scholar]
- 17.Pal S. and Balakrishnan N., Destructive negative binomial cure rate model and EM-based likelihood inference under Weibull lifetime, Stat. Probab. Lett. 116 (2016), pp. 9–20. [Google Scholar]
- 18.Pal S. and Balakrishnan N., An EM type estimation procedure for the destructive exponentially weighted Poisson regression cure model under generalized gamma lifetime, J. Stat. Comput. Simul. 87 (2017), pp. 1107–1129. [Google Scholar]
- 19.Pal S. and Balakrishnan N., Likelihood inference for the destructive exponentially weighted Poisson cure rate model with Weibull lifetime and an application to melanoma data, Comput. Stat. 32 (2017), pp. 429–449. [Google Scholar]
- 20.Pal S. and Roy S., On the estimation of destructive cure rate model: A new study with exponentially weighted Poisson competing risks, Stat. Neerl. 75 (2021), pp. 324–342. [Google Scholar]
- 21.Pal S. and Roy S., A new non-linear conjugate gradient algorithm for destructive cure rate model and a simulation study: Illustration with negative binomial competing risks, Commun. Stat. Simul. Comput. 51 (2022), pp. 6866–6880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Pal S. and Roy S., On the parameter estimation of Box-Cox transformation cure model, Stat. Med. 42 (2023), pp. 2600–2618. [DOI] [PubMed] [Google Scholar]
- 23.Pal S., Majakwara J., and Balakrishnan N., An EM algorithm for the destructive COM-Poisson regression cure rate model, Metrika 81 (2018), pp. 143–171. [Google Scholar]
- 24.Pal S., Barui S., Davies K., and Mishra N., A stochastic version of the EM algorithm for mixture cure model with exponentiated Weibull family of lifetimes, J. Stat. Theory Pract. 16 (2022), p. 48. [Google Scholar]
- 25.Pal S., Peng Y., and Aselisewine W., A new approach to modeling the cure rate in the presence of interval censored data, Comput. Stat. 39 (2023), pp. 2743–2769. 10.1007/s00180-023-01389-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Pal S., Peng Y., Aselisewine W., and Barui S., A support vector machine-based cure rate model for interval censored data, Stat. Methods Med. Res. 32 (2023), pp. 2405–2422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Peng Y. and Yu B., Cure Models: Methods, Applications and Implementation, Chapman and Hall/CRC, Boca Raton, FL, 2021. [Google Scholar]
- 28.Platt J., Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, 1999, pp. 61–74.
- 29.Ramires T.G., Hens N., Cordeiro G.M., and Ortega E.M., Estimating nonlinear effects in the presence of cure fraction using a semi-parametric regression model, Comput. Stat. 33 (2018), pp. 709–730. [Google Scholar]
- 30.Rodrigues J., de Castro M., Balakrishnan N., and Cancho V.G., Destructive weighted Poisson cure rate models, Lifetime Data Anal. 17 (2011), pp. 333–346. [DOI] [PubMed] [Google Scholar]
- 31.Treszoks J. and Pal S., On the estimation of interval censored destructive negative binomial cure model, Stat. Med. 42 (2023), pp. 5113–5134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Treszoks J. and Pal S., A destructive shifted Poisson cure model for interval censored data and an efficient estimation algorithm, Commun. Stat. Simul. Comput. 53 (2024), pp. 2135–2149. [Google Scholar]
- 33.Treszoks J. and Pal S., Likelihood inference for unified transformation cure model with interval censored data, Comput. Stat. (2024). 10.1007/s00180-024-01480-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wang P. and Pal S., A two-way flexible generalized gamma transformation cure rate model, Stat. Med. 41 (2022), pp. 2427–2447. [DOI] [PubMed] [Google Scholar]
- 35.Wang L., Du P., and Liang H., Two-component mixture cure rate model with spline estimated nonparametric components, Biometrics 68 (2012), pp. 726–735. [DOI] [PubMed] [Google Scholar]
- 36.Xie Y. and Yu Z., Mixture cure rate models with neural network estimated nonparametric components, Comput. Stat. 36 (2021), pp. 2467–2489. [DOI] [PubMed] [Google Scholar]
- 37.Zhang J. and Peng Y., A new estimation method for the semiparametric accelerated failure time mixture cure model, Stat. Med. 26 (2007), pp. 3157–3171. [DOI] [PubMed] [Google Scholar]
- 38.Zhang J. and Peng Y., An alternative estimation method for the accelerated failure time frailty model, Comput. Stat. Data Anal. 51 (2007), pp. 4413–4423. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The leukemia data analyzed in this article is freely available in the R package ‘KMsurv’. The computational codes are available in the supplemental material.






