Abstract
Survival analysis has been a topic of active statistical research in the past few decades with applications spread across several areas. Traditional applications usually consider data with only small numbers of predictors with a few hundreds or thousands of observations. Recent advances in data acquisition techniques and computation power has led to considerable interest in analyzing very high-dimensional data where the number of predictor variables and the number of observations range between 104 – 106. In this paper, we present a tool for performing large-scale regularized parametric survival analysis using a variant of cyclic coordinate descent method. Through our experiments on two real data sets, we show that application of regularized models to high-dimensional data avoids overfitting and can provide improved predictive performance and calibration over corresponding low-dimensional models.
Keywords: Survival analysis, parametric models, regularization, penalized regression, pediatric trauma
1. Introduction
Regression analysis of time-to-event data occupies a central role in statistical practice [1, 2] with applications spread across several fields including biostatistics, sociology, economics, demography, and engineering [3, 4, 5, 6]. Newer applications often gather high-dimensional data that present computational challenges to existing survival analysis methods. As an example, new technologies in genomics have led to high-dimensional microarray gene expression data where the number of predictor variables is of the order of 105. Other large-scale applications include medical adverse event monitoring, longitudinal clinical trials, and business data mining tasks. All of these applications require methods for analyzing high-dimensional data in a survival analysis framework.
In this paper we consider high-dimensional parametric survival regression models involving both large numbers of predictor variables as well as large numbers of observations. While Cox models continue to attract much attention [7], parametric survival models have always been a popular choice among statisticians for analyzing time-to-event data [4, 6, 8, 9]. Parametric survival models feature prominently in commercial statistical software, are straightforward to interpret and can provide competitive predictive accuracy. To bring all these advantages to high-dimensional data analysis, these methods need to be scaled to data involving 104 – 106 predictor variables and even larger numbers of observations.
Computing the maximum likelihood fit of a parametric survival model requires solving a non-linear optimization problem. Standard implementations work well for small-scale problems. Because these approaches typically require matrix inversion, solving large-scale problems using standard software is typically impossible. One possible remedy is to perform feature selection as a pre-processing step. Although feature selection does reduce memory and computational requirements and also serves as a practical solution to overfitting, it introduces new problems. First, the statistical consequences of most feature selection methods remain unclear, making it difficult to choose the number of features for a given task in a principled way. Second, the most efficient feature selection methods are greedy and may choose redundant or ineffective combinations of features. Finally, it is often unclear how to combine heuristic feature selection methods with domain knowledge. Even when standard software does produce estimates, numerical ill conditioning can result in a lack of convergence, large estimated coefficient variances and poor predictive accuracy or calibration.
In this paper, we describe a regularized approach to parametric survival analysis [10]. The main idea is the use of a regularizing prior probability distribution for the model parameters that favors sparseness in the fitted model, leading to point estimates for many of the model parameters being zero. To solve the optimization problem, we use a variation of the cyclic coordinate descent method [11, 12, 13, 14]. We show that application of this type of model to high-dimensional data avoids overfitting, can provide improved predictive performance and calibration over corresponding low-dimensional models, and is efficient both during fitting and at prediction time.
In Section 2, we describe some of the other work that addresses similar problems. In Section 3, we describe the basics of regularized approach to parametric survival analysis. In section 4, we briefly describe the four parametric models used for survival analysis using the high-dimensional formulation. We also describe our optimization technique which is tailored for each specific model for computing point estimates of model parameters. More involved algorithmic details of the method pertaining to each of the four parametric models can be found in Appendix A. We describe the data sets and methods that we use in our experiments in Section 5, and provide experimental results in Section 6. The first application uses a large data set of hospitalized injured children for developing a model for predicting survival. Through our experiments, we establish that an analysis that uses our proposed approach can add significantly to predictive performance as compared to the traditional low-dimensional models. In the second application, we apply our method to a publicly available breast cancer gene expression data set and show that the high-dimensional parametric model can achieve similar performance to a low-dimensional Cox model while being better calibrated. Finally, we conclude in Section 8 with directions for future work.
We have publicly released the C++ implementation of our algorithm which can be downloaded from http://code.google.com/p/survival-analysis-cmake/. This code has been derived from the widely-used BBR/BXR software for performing large-scale Bayesian logistic regression (http://www.bayesianregression.org). All four parametric models discussed in this paper are included in the implementation, however, it is fairly straightforward to also extend it to other parametric models. Computation of various evaluation metrics discussed in Section 5 is also integrated into the code.
2. Related Work
Survival analysis is an old subject in statistics that continues to attract considerable research attention. Traditionally, survival analysis is concerned with the study of survival times in clinical and health related studies [4, 15]. However, over the past several decades, survival analysis has found an array of applications ranging from reliability studies in industrial engineering to analyses of inter-child birth times in demography and sociology [3]. Other application areas have also benefited from the use of these methods [8, 6].
Earlier applications usually consisted of a relatively small number of predictors (usually less than 20) with a few hundred or sometimes a few thousand examples. Recently, there has been considerable interest in analyzing high-dimensional time-to-event problems. For instance, a large body of work has focused on methodologies to handle the overwhelming amount of data generated by new technologies in biology such as gene expression microarrays and single nucleotide polymorphism (SNP) data. The goal of the work in high-dimensional survival analysis has been both to develop new statistical methods [16, 17, 18, 19, 20, 21] and to extend the existing methods to handle new data sets [22, 23, 13, 24]. For example, recent work [16, 18] has extended the traditional support vector machines used for regression to survival analysis by additionally penalizing discordant pairs of observations. Another method [19, 20] extends the use of random forests for variable selection for survival analysis. Similarly, other methods [22, 25] have used an elastic net approach for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model. This method is similar to other work [24] that applies an efficient method to compute L1-penalized parameter estimates for Cox models. A recent review [26] provides a survey of the existing methods for variable selection and model estimation for high-dimensional data.
Some more recent tools such as coxnet [27] and fastcox [25] adopt optimization approaches that can scale to the high-dimensional high sample size data that we focus on. Although other models provided by the R package glmnet do support sparse formats, neither coxnet nor fastcox currently support a sparse matrix format for the input data. However, both coxnet and fastcox provide estimates for the Cox proportional hazards model and fastcox supports elastic net regularization.
3. Regularized Survival Analysis
Denote by n the number of individuals in the training data. We represent their survival times by yi = min(ti, ci), i = 1, …, n, where ti and ci are the time to event (failure time) and right-censoring time for each individual respectively. Let δi = I(ti ≤ ci) be the indicator variable such that δi is one if the observation is not censored and zero otherwise. Further, let xi = [xi1, xi2, …, xip]⊤ be a p-vector of covariates. We assume that ti and ci are conditionally independent given xi and that the censoring mechanism is non-informative. The observed data comprise triplets D = {(yi, δi, xi):i = 1, …, n}.
Let θ be the set of unknown, underlying model parameters. We assume that the survival times y1, y2, …, yn arise in an independent and identically distributed fashion from density and survival functions f(y|θ) and S(y|θ) respectively, parametrized by θ. We are interested in the likelihood L(θ|D) of the parametric model, where
(1) |
We analyze and compare the performance of four different parametric models by modeling the distributions of the survival times using Exponential, Weibull, Log-Logistic or Log Normal distributions. Each of these distributions can be fully parametrized by the parameter pair θ = (λ, α). Typically, the parameter λ is re-parametrized in terms of the covariates x = [x1, x2, …, xp]⊤ and the vector β = [β1, β2, …, βp]⊤ such that λi = φ (β⊤xi), i = 1, …, n. In general, the mapping function φ(·) is different for each model and standard choices exist. The likelihood function of (1) in terms of the new parameters can be written as
(2) |
The parameters β and α are estimated by maximizing their joint posterior density
(3) |
The joint posterior distribution of (β, α) does not usually have a closed form solution, but it can be shown that the conditional posterior distributions π(β|α, D) and π(α|β, D) are concave and therefore can be solved efficiently. In practice, it is sufficient to estimate β and α by maximizing the conditional posterior of just β
(4) |
For a model to generalize well to unseen test data, it is important to avoid overfitting the training data. In the Bayesian paradigm, this goal can be achieved by specifying appropriate prior distribution on β such that each βj is likely to be near zero. As we will observe in the following sections, both Gaussian and Laplacian priors fall under this category. Since we focus on posterior mode estimation in this paper, one can view our procedure as Bayesian or simply as a form of regularization or penalization. We use the Bayesian terminology in part because we view fully Bayesian computation as the next desirable step in large-scale survival analysis. We return to this point in the conclusion section.
3.1. Gaussian Priors and Ridge Regression
For L2 regularization, we assume a Gaussian prior for each βj with zero mean and variance τj, i.e.,
(5) |
The mean of zero encodes a prior preference for values of βj that are close to zero. The variances τj are positive constants that control the degree of regularization and are typically chosen through cross-validation. Smaller values of τj imply stronger belief that βj is being close to zero while larger values impose a less-informative prior. In the simplest case we assume that τ1 = τ2 = … = τp. Assuming that the components of β are independent a priori, the overall prior for β can be expressed as the product of the priors of each individual βj, i.e., . Finding the maximum a posteriori estimate of β with this prior is equivalent to performing ridge regression [28]. Note that although the Gaussian prior favors the value of βj close to zero, posterior modes are generally not exactly equal to zero for any βj.
3.2. Laplacian Prior and Lasso Regression
For L1 regularization, we again assume that each βj follows a Gaussian distribution with mean zero and variance τj. However, instead of fixed τj’s, instead we assume that each τj arise from an exponential distribution parametrized using γj’s and having density
(6) |
Integrating out τj gives an equivalent non-hierarchical, double-exponential (Laplace) distribution with density
(7) |
Again, in the simplest case, we assume that γ1 = γ2 = … = γp. Similar to the Gaussian prior, assuming that the components of β are independent, . Finding the maximum a posteriori estimate of β with the Laplacian prior is equivalent to performing a Lasso regression [29]. Using this approach, a sparse solution typically ensues, meaning the posterior mode for many components of the β vector will be zero.
4. Parametric Models
It can be shown that for both Gaussian and Laplacian regularization, the negated log-posterior of (4) for all the four parametric models considered in our work are log-convex. Therefore, a wide range of optimization algorithms can be used. Due to the high-dimensionality of our target applications, usual methods like Newton-Raphson cannot be used because of their high memory requirements. Many alternate optimization approaches have been proposed for maximizing MAP for high-dimensional regression problems [30, 31]. We use the Combined Local and Global (CLG) algorithm of [30] which is a type of cyclic coordinate descent algorithm due to CLG’s favorable property of scaling to high-dimensional data and ease of implementation. This method was successfully adapted to the lasso [14] for performing large-scale logistic regression and implemented in the widely-used BBR/BXR software (http://www.bayesianregression.org).
A cyclic coordinate descent algorithm begins by setting all p parameters βj, j = 1, …, p to some initial value. It then sets the first parameter to a value that minimizes the objective function, holding all other parameters constant. This problem then is a one-dimensional optimization. The algorithm finds the minimizing value of a second parameter, while holding all other values constant (including the new value of the first parameter). The third parameter is then optimized, and so on. When all variables have been traversed, the algorithm returns to the first parameter and starts again. Multiple passes are made over the parameters until some convergence criterion is met. We note that similar to previous work [14], instead of iteratively updating each parameter till convergence, we do it only once before proceeding on to the next parameter. Since the optimal values of the other parameters are themselves changing, tuning a particular parameter to very high precision, in each pass of the algorithm, is not necessary. For more details of the CLG method, we refer readers to a relevant publication in this area [30, 14]. The details of our algorithm for different parametric models are described in Appendix A. We also note that for the case of the Laplacian prior, the derivative of the negated log-posterior is undefined for βj = 0, j = 1, …, p. Section 4.3 of [14] describes the modification of the CLG algorithm that we utilize to address this issue.
5. Experiments
We tested our algorithm for large-scale parametric survival analysis on two different data sets. Below, we briefly describe the data sets and also motivate their choice for our work. In both cases, we compare the performance of the high-dimensional parametric model trained on all predictors with that of a low-dimensional model trained on a small subset of features.
International Classification of Disease, Ninth Revision
5.1. Pediatric Trauma Data
Trauma is the leading cause of death and acquired disability in children and adolescents. Injuries result in more deaths in children than all other causes combined [32]. To provide optimal care, children at high risk for mortality need to be identified and triaged to centers with the resources to manage these patients. The overall goal of our analysis is to develop a model for predicting mortality after pediatric injury. For prediction within the domain of trauma, the literature has traditionally focused on low-dimensional analysis, i.e., modeling using a small set of features from the injury scene or emergency department [33, 34]. While approaches that use low-dimensional data to predict outcome may be easier to implement, the trade-off may be poorer predictive performance.
We obtained our data set from National Trauma Data Bank (NTDB), a trauma database maintained by the American College of Surgeons. The data set includes 210, 555 patient records of injured children < 15 years old collected over five years (2006–2010). We divided these data into a training data set(153, 402 patients for years 2006–2009) and a testing data set (57, 153 patients for year 2010). The mortality rate of the training set is 1.68% while that of the test set is 1.44%. There are a total of 125, 952 binary predictors indicating the presence or absence of a particular attribute (or interaction among various attributes). The high-dimensional model was trained using all the 125, 952 predictors, while the low-dimensional model used only 41 predictors. The information about various predictors used for both high and low-dimensional models is summarized in Table 1.
Table 1.
Description of the predictors used for high and low-dimensional models for pediatric trauma data.
Predictor Type | # Predictors | Description | High-dim | Low-dim |
---|---|---|---|---|
Main Effects | ||||
ICD-9 Codes | 1, 890 | International Classification of Disease, Ninth Revision. | ✓ | ✗ |
AIS Codes (predots) | 349 | Abbreviated Injury Scale codes that includes body region, anatomic structure associated with the injury and the level of injury. | ✓ | ✗ |
Interactions/Combinations | ||||
ICD-9, ICD-9 | 102, 284 | Co-occurrences of two ICD-9 injury codes. | ✓ | ✗ |
AIS Code, AIS Code | 20, 809 | Co-occurrences of two AIS codes. | ✓ | ✗ |
Body Region, AIS Score | 41 | Combinations of any of the 9 body regions associated to an injury with injury severity score (between 1–6) determined according to AIS coding scheme. | ✓ | ✓ |
[Body Region, AIS Score], [Body Region, AIS Score] | 579 | Co-occurrences of two [Body Region, AIS Score] combinations. | ✓ | ✗ |
5.2. Breast Cancer Gene Expression Data
For our second application, we analyzed the well-known breast cancer gene expression data set [35, 36]. This data set is publicly available and consists of cDNA expression profiles of 295 tumor samples of patients with breast cancer. These patients were analyzed with breast cancer between 1984 and 1995 at the Netherlands Cancer Institute and were aged 52 years or younger at the time of diagnosis. Overall, 79 (26.78%) patients died during the follow-up time and the remaining 216 are censored. The total number of predictors (number of genes) is 24, 885 each of which represents the log-ratio of the intensities of the two color dyes used for a specific gene. The high-dimensional model was trained using all the 24, 885 predictors. For the low-dimensional model, we used glmpath (coxpath) package of R [37] and generated the entire regularized paths and outputs predictors in the order of their relative importance. We picked the top 5 predictors from the list and used our model to train a low-dimensional model for comparisons. The data was then randomly split into training (67%) and testing (33%) data such that the mortality in both parts was equal to the original mortality rate.
5.3. Hyperparameter Selection
The Gaussian and Laplace priors both require a prior variance, , j = 1, …, p, for parameter values. The actual hyperparameters are for the Gaussian and for Laplace. For both the applications, the regularization parameter was selected using a four-fold cross validation on the training data. The variance σ was varied between 10−5 – 106 by multiples of 10. This amounts to varying actual regularization parameter for the Gaussian loss between 10−5 – 106 by multiples of 10 while that of the Laplacian between 0.0014 – 447.21 in multiples of . For each choice of the hyperparameter, we computed the sum of log-likelihoods for the patients in all the four validation sets and chose the hyperparameter value that maximized this sum.
5.4. Performance Evaluation
There are several ways to compare the performance of fitted parametric models. These evaluation metrics can be divided into two categories - ones that assess the discrimination quality of the model and the others that assess the calibration. As for any other regression model, the log-likelihood on the test data is an obvious choices for measuring the performance of penalized parametric survival regression. In the past several years, survival analysis-specific metrics have been proposed that take censoring into account. Among these metrics, area under the ROC curve (AUC), also known as the Harrell’s c-statistic [38, 39], a measure of discriminative accuracy of the model, has become one of the important criteria used to evaluate the performance of survival models [40]. Motivated from a similar test for logistic regression model proposed by Hosmer and Lemeshow, we evaluated calibration using an overall goodness-of-fit test that has been proposed for survival data [41, 42]. Although the original test was proposed for Cox models, its use for parametric models have also been described [6, Chapter 8]. Below, we briefly describe these two metrics.
5.4.1. Harrell’s c-statistic
Harrell’s c-statistic is an extension of the tradition area under the curve (AUC) statistic, but is more suited for time-to-event data since it is independent of any thresholding process. The method is based on the comparison of estimated versus the ground truth ordering of risks between pairs of comparable subjects. Two subjects are said to be comparable if at least one of the subjects in the pair has developed the event (e.g., death) and if the follow-up time duration for that subject is less than that of the other. Using the notation of Section 3, the comparability of an ordered pair of subjects (i, j) can be represented by the indicator variable ζij = I(yi < yj, δi = 1), such that ζi,j is one when the two subjects are comparable and zero otherwise. The total number of comparable pairs in the test data containing n subjects can then be computed as
(8) |
To measure the predicted discrimination, the concordance of the comparable pairs is then estimated. A pair of comparable subjects (as defined above) is said to be concordant, if the estimated risk of the subject who developed the event earlier is more than that of the other subject. Therefore, the concordance of the ordered subject pair (i, j) can also be represented using the indicator variable ξij = I(ζij = 1, ri > rj), where ri = −β⊤ xi and rj = −β⊤xj are the relative risks scores of the ith and jth subjects. Therefore, ξij is one when the two subjects are concordant and zero otherwise. The total number of concordant pairs can then be written as
(9) |
Finally, the c-statistic is given by
(10) |
5.4.2. Hosmer-Lemeshow Statistic
To assess the overall goodness-of-fit using this test, the subjects are first sorted in an order of their relative risk score (ri = −β⊤ xi, i = 1, …, n) and divided into G equal-sized groups. The observed number of events og of gth group is obtained by summing the number of non-censored observations in that group while the number of expected events eg is computed by summing the cumulative hazard of all the subjects in the same group. The χ2 statistic for the overall goodness-of-fit is then by
(11) |
6. Results
We now compare the performance of the low and high-dimensional parametric models on both the data sets. For both applications, the hyperparameter σ2 was selected by performing a four-fold cross-validation on the training data.
6.1. Pediatric Trauma Data
Tables 2 and 3 summarize the results for the pediatric trauma data set for low and high-dimensional models using all the four parametric models with Gaussian and Laplacian penalization. Note that under the L2 prior, the estimate for any βj is never exactly zero and thus all variables contribute towards the final model. The number of selected predictors in Table 2 just refers to the number of significant predictors (|βj| > 10−4). The Hosmer-Lameshow χ2 index was computed using the value of G = 50. Both the discriminative measures (Log-likelihood, c-statistic) are always significantly better for high-dimensional models as compared to the corresponding low-dimensional models. In many cases, the high-dimensional model is also better calibrated.
Table 2.
Comparison of low and high-dimensional models for pediatric trauma data with Gaussian penalization. The number of selected predictors refers to the number of significant predictors (βj > 10−4).
Model Type | Predictors
|
Log-likelihood | c-statistic | χ2 | |
---|---|---|---|---|---|
Overall | Selected | ||||
Exponential | |||||
Low-dim | 41 | 41 | −4270.01 | 0.88 | 124.39 |
High-dim | 125,952 | 101,733 | −4372.41 | 0.94 | 543.28 |
Weibull | |||||
Low-dim | 41 | 41 | −4242.38 | 0.88 | 131.60 |
High-dim | 125,952 | 101,794 | −4557.10 | 0.94 | 749.22 |
Log-Logistic | |||||
Low-dim | 41 | 41 | −4120.66 | 0.89 | 95.37 |
High-dim | 125,952 | 100,889 | −3765.45 | 0.94 | 95.02 |
Log-Normal | |||||
Low-dim | 41 | 41 | −3234.00 | 0.89 | 76.95 |
High-dim | 125,952 | 88,244 | −3129.02 | 0.93 | 165.68 |
Table 3.
Comparison of low and high-dimensional models for pediatric trauma data with Laplacian penalization.
Model Type | Predictors
|
Log-likelihood | c-statistic | χ2 | |
---|---|---|---|---|---|
Overall | Selected | ||||
Exponential | |||||
Low-dim | 41 | 41 | −4271.23 | 0.88 | 122.58 |
High-dim | 125,952 | 153 | −4034.67 | 0.92 | 94.34 |
Weibull | |||||
Low-dim | 41 | 41 | −4243.57 | 0.88 | 126.84 |
High-dim | 125,952 | 151 | −3997.28 | 0.92 | 107.99 |
Log-Logistic | |||||
Low-dim | 41 | 41 | −4122.07 | 0.89 | 94.73 |
High-dim | 125,952 | 432 | −3777.83 | 0.94 | 83.00 |
Log-Normal | |||||
Low-dim | 41 | 41 | −3236.79 | 0.89 | 80.71 |
High-dim | 125,952 | 168 | −2974.49 | 0.93 | 89.36 |
To provide further insight, we grouped the subjects into low, medium and high risk groups by sorting their relative risk scores in increasing order and using threshold values at the 33rd and 66th percentiles. For each group, we counted the number of events (number of non-censored observations). Tables 4 and 5 summarize the results for Gaussian and Laplacian penalization respectively. The results show that in almost all cases, the subjects assigned to high risk group by the high-dimensional models had more events than the ones assigned high-risk by the low-dimensional models. Although both kinds of models assign similar number of subjects to the low risk group, the mean observed survival time of the subjects having events is more for the high-dimensional models than for the corresponding low-dimensional ones. These findings also establish that in most cases the high-dimensional models are better calibrated than their low-dimensional counterparts.
Table 4.
Comparison of number of events in low, medium and high risk groups for various low and high-dimensional models using Gaussian penalization for pediatric trauma data. MST stands for mean observed survival time in days.
Model Type | Risk Group | # Subjects | Low-dim
|
High-dim
|
||
---|---|---|---|---|---|---|
# Events | MST | # Events | MST | |||
Exponential | Low | 18,860 | 10 | 2.00 | 8 | 3.38 |
Medium | 18,860 | 41 | 4.10 | 18 | 4.33 | |
High | 19,433 | 774 | 3.66 | 799 | 3.65 | |
Weibull | Low | 18,860 | 11 | 2.00 | 7 | 3.43 |
Medium | 18,860 | 34 | 3.44 | 18 | 3.67 | |
High | 19,433 | 780 | 3.69 | 800 | 3.66 | |
Log-Logistic | Low | 18,860 | 10 | 2.00 | 11 | 3.73 |
Medium | 18,860 | 32 | 4.13 | 21 | 4.14 | |
High | 19,433 | 783 | 3.66 | 793 | 3.65 | |
Log-Normal | Low | 18,860 | 10 | 2.00 | 16 | 6.56 |
Medium | 18,860 | 31 | 4.10 | 39 | 3.23 | |
High | 19,433 | 784 | 3.66 | 770 | 3.62 |
Table 5.
Comparison of number of events in low, medium and high risk groups for various low and high-dimensional models using Laplacian penalization for pediatric trauma data. MST stands for mean observed survival time in days.
Model Type | Risk Group | # Subjects | Low-dim
|
High-dim
|
||
---|---|---|---|---|---|---|
# Events | MST | # Events | MST | |||
Exponential | Low | 18,860 | 10 | 2.00 | 10 | 3.00 |
Medium | 18,860 | 40 | 3.95 | 28 | 7.75 | |
High | 19,433 | 775 | 3.67 | 787 | 3.52 | |
Weibull | Low | 18,860 | 10 | 2.00 | 10 | 3.00 |
Medium | 18,860 | 37 | 3.32 | 20 | 9.80 | |
High | 19,433 | 778 | 3.70 | 795 | 3.51 | |
Log-Logistic | Low | 18,860 | 10 | 2.00 | 4 | 4.50 |
Medium | 18,860 | 32 | 4.13 | 23 | 3.65 | |
High | 19,433 | 783 | 3.66 | 798 | 3.66 | |
Log-Normal | Low | 18,860 | 10 | 2.00 | 12 | 3.00 |
Medium | 18,860 | 32 | 4.10 | 25 | 2.96 | |
High | 19,433 | 783 | 3.67 | 788 | 3.69 |
6.2. Breast Cancer Gene Expression Data
The results of the gene expression data set for low and high-dimensional models using all the four parametric models with Gaussian and Laplacian penalization are summarized in Tables 6 and 7 respectively. The Hosmer-Lameshow χ2 index was computed using G = 10 due to very few observations in the test set. While the discriminative performance of both methods is similar for most of the cases, the high-dimensional models are almost always better calibrated.
Table 6.
Comparison of low and high-dimensional models for gene expression data with Gaussian penalization. The number of selected predictors refers to the number of significant predictors (βj > 10−4).
Model Type | Predictors
|
Log-likelihood | c-statistic | χ2 | |
---|---|---|---|---|---|
Overall | Selected | ||||
Exponential | |||||
Low-dim | 5 | 5 | −86.26 | 0.71 | 16.83 |
High-dim | 24,496 | 24,344 | −98.72 | 0.75 | 112.69 |
Weibull | |||||
Low-dim | 5 | 5 | −85.64 | 0.71 | 12.79 |
High-dim | 24,496 | 22,299 | −85.56 | 0.70 | 9.34 |
Log-Logistic | |||||
Low-dim | 5 | 5 | −85.65 | 0.70 | 14.74 |
High-dim | 24,496 | 22,090 | −86.14 | 0.70 | 5.37 |
Log-Normal | |||||
Low-dim | 5 | 5 | −52.10 | 0.70 | 16.02 |
High-dim | 24,496 | 24,154 | −65.14 | 0.66 | 23.61 |
Table 7.
Comparison of low and high-dimensional models for gene expression data with Laplacian penalization.
Model Type | Predictors
|
Log-likelihood | c-statistic | χ2 | |
---|---|---|---|---|---|
Overall | Selected | ||||
Exponential | |||||
Low-dim | 5 | 5 | −86.27 | 0.71 | 17.14 |
High-dim | 24,496 | 13 | −87.51 | 0.67 | 2.74 |
Weibull | |||||
Low-dim | 5 | 5 | −85.63 | 0.71 | 12.79 |
High-dim | 24,496 | 13 | −86.80 | 0.66 | 3.75 |
Log-Logistic | |||||
Low-dim | 5 | 5 | −85.65 | 0.70 | 14.71 |
High-dim | 24,496 | 9 | −86.20 | 0.68 | 3.66 |
Log-Normal | |||||
Low-dim | 5 | 5 | −52.10 | 0.70 | 15.98 |
High-dim | 24,496 | 9 | −53.46 | 0.66 | 8.57 |
7. Computation Time
Table 8 summarizes the training time taken for fitting different parametric models for low and high-dimensional data sets. All the experiments were performed on a system with Intel 2.4GHz processor with 8Gb of memory. Note that even though the time taken to fit high-dimensional models to pediatric trauma data set is much more than that taken to fit low-dimensional models, given the scale of the problem (153, 402 patients with 125, 952 predictors), the performance of the methods may be acceptable in many applications.
Table 8.
Computation time (in seconds) taken for training various low and high-dimensional models using Gaussian and Laplacian penalization for pediatric trauma and gene expression data.
Model Type | Pediatric Trauma
|
Gene Expression
|
||||||
---|---|---|---|---|---|---|---|---|
Gaussian
|
Laplacian
|
Gaussian
|
Laplacian
|
|||||
Low-dim | High-dim | Low-dim | High-dim | Low-dim | High-dim | Low-dim | High-dim | |
Exponential | 1 | 4,453 | 1 | 2,588 | 1 | 4 | 1 | 8 |
Weibull | 3 | 3,406 | 2 | 2,600 | 1 | 12 | 1 | 30 |
Log-Logistic | 2 | 2,794 | 2 | 3,278 | 1 | 13 | 1 | 38 |
Log-Normal | 6 | 3,250 | 6 | 3,101 | 1 | 51 | 1 | 52 |
8. Conclusions
We present a method to perform regularized parametric survival analysis on data with 104 – 106 predictor variables and a large number of observations. Through our experiments in the context of two different applications, we have demonstrated the advantage of using high-dimensional survival analysis over the corresponding low-dimensional models. We have provided a freely available software tool that implements our proposed algorithm. Future work will provide the extension to Cox proportional hazards models in addition to accelerated failure time models and the Aalen’s additive hazard model. We have also developed software for high-dimensional regularized generalized linear models that utilizes inexpensive massively parallel devices known as graphics processing units (GPUs) [43]. This provides more than an order-of-magnitude speed-up and in principle could be further developed to include survival analysis. Fully Bayesian extensions to our current work could explore the hierarchical framework to simultaneously model multiple time-to-event endpoints, model multi-level structure such as patients-nested-within-hospitals, and incorporate prior information when available.
A. Appendix
Here we describe the details of our algorithm for Exponential, Weibull, Log-Logistic or Log-Normal distributions of the survival times.
A.1. Exponential
The exponential model is the simplest of all and can be parametrized using a single parameter λ, such that the density and survival functions can respectively be written as and . The likelihood function of can be written as
(12) |
A common form for the mapping function φ(·) is to take the φ(β⊤ xi) = exp(β⊤ xi). The likelihood function is then
(13) |
and the corresponding log-likelihood is
(14) |
Adding the Gaussian prior with mean zero and variance τj, the posterior density can be written as
(15) |
Similarly, for the Laplacian prior, the posterior density can be written as
(16) |
Using the CLG algorithm, the one-dimensional problem involves finding , the value of the j-th entry of β that minimizes −l(β), assuming that the other βj’s are held at their current values. Therefore, using (14) and (15) for Gaussian prior (and ignoring the constants and ), finding is equivalent to the z that minimizes
(17) |
The classic Newton method approximates the objective function g(·) by the first three terms of its Taylor series at the current βj
(18) |
where
(19) |
(20) |
Similarly, for Laplacian prior,
(21) |
(22) |
(23) |
The value of for both types of priors can then be computed as
(24) |
A.2. Weibull
The Weibull model is a more general parametric model with density and survival functions given by and respectively. The corresponding likelihood function can be written as
(25) |
where . Similar to the previous case, using λi = exp(β⊤ xi), the log-likelihood function can be written as
(26) |
Similar to equations (15) and (16) conditional posterior corresponding to the Gaussian and Laplacian priors can be written as
(27) |
(28) |
Using the CLG algorithm, the one-dimensional problem involve finding and , the value of the j-th that gives the minimum value of −l(α, β), assuming that the other αj’s and βj’s are held at their current values. Therefore, for both Gaussian and Laplacian prior, finding α(new) is equivalent to finding the z that minimizes
(29) |
Writing the Taylor series expansion for g(z) around z
(30) |
where
(31) |
(32) |
The value of for both types of priors can then be computed as
(33) |
The stepwise update for β for both Gaussian and Laplacian priors are similar to that of the exponential model and can be obtained by replacing yi with in equations (18)–(24).
A.3. Log-Logistic
For Log-Logistic model, the density and survival functions are given by
(34) |
The corresponding likelihood function can be written as
(35) |
where . Re-parameterizing, λi = exp(β⊤ xi), the log-likelihood function can be written as
(36) |
The posterior distributions corresponding to the Gaussian and Laplacian priors are similar to those for the Weibull model (27) (28). Using the CLG algorithm, for both types of priors, finding the update α(new) is equivalent to finding the z that minimizes
(37) |
The value of for both types of priors can then be computed using (33) where
(38) |
(39) |
For Gaussian prior, finding is equivalent to finding the z that minimizes
(40) |
Similar to the previous cases, the updated value can be computed using (24), where
(41) |
(42) |
Similarly, for Laplacian prior
(43) |
(44) |
(45) |
A.4. Log-Normal
Assuming that the survival times yi, i = 1, …, n follow a log-normal distribution is equivalent to assuming that their logarithms wi = log yi, i = 1, …, n follow a normal N(μ, σ2) distribution with density and survival functions given by and respectively, where Φ(·) is the Gaussian cumulative distribution function. Following the convention, we replace λ and α with μ and σ respectively. The likelihood function of μi, μj, …, μn and σ can be written as
(46) |
where . Re-parameterizing μi = β⊤ xi, the log-likelihood function can be written as
(47) |
The conditional posterior densities corresponding to Gaussian and Laplacian priors can be written as
(48) |
(49) |
Assuming that the other parameters are held at their current values, the one-dimensional problems involve finding σ(new) and , that minimize the posterior. Using the CLG algorithm for both types of priors, finding σ(new) is equivalent to finding the z that minimizes
(50) |
The value of σ(new) can then be computed as
(51) |
Here
(52) |
(53) |
where
(54) |
For Gaussian prior, finding is equivalent to finding the z that minimizes
(55) |
where . The updated value can be computed using (24), where
(56) |
(57) |
and pi and qi are same as (54). Similarly, for Laplacian prior
(58) |
(59) |
(60) |
Footnotes
This research was supported by an NIH-NIGMS grant awarded to Childrens National Medical Center (R01GM087600-01)
References
- 1.Oakes D. Biometrika centenary: Survival analysis. Biometrika. 2001;88(1):99–142. [Google Scholar]
- 2.Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Time Data. John Wiley and Sons; New York: 1980. [Google Scholar]
- 3.Box-Steffensmeier JM, Jones BS. Event History Modeling: A Guide for Social Scientists. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
- 4.Collett D. Modelling Survival Data for Medical Research. 2. Chapman-Hall; London, UK: 2003. [Google Scholar]
- 5.Heckman J, Singer B. Longitudinal Analysis of Labor Market Data. Cambridge University Press; Cambridge, UK: 1985. [Google Scholar]
- 6.Hosmer DW, Lemeshow S, May S. Applied Survival Analysis: Regression Modeling of Time to Event Data (Wiley Series in Probability and Statistics) 2. Wiley-Interscience; 2008. [Google Scholar]
- 7.Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16(4):385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- 8.Klein J, Moeschberger M. Survival Analysis: Techniques For Censored and Truncated Data. 2. John Willey and Sons; New York: 2003. [Google Scholar]
- 9.Lee ET, Wang J. Statistical Methods for Survival Data Analysis (Wiley Series in Probability and Statistics) 2. Wiley-Interscience; 2003. [Google Scholar]
- 10.Ibrahim JG, Chen MH, Sinha D. Bayesian Survival Analysis. Springer-Verlag; 2001. [Google Scholar]
- 11.Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications. 2001;109(3):475–494. [Google Scholar]
- 12.Koh K, Kim SJ, Boyd S. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research. 2007;8:1519–1555. [Google Scholar]
- 13.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33 (1):1–22. [PMC free article] [PubMed] [Google Scholar]
- 14.Genkin, Alexander, Lewis, David D, Madigan, David Large-scale Bayesian logistic regression for text categorization. Technometrics. 2007;49(3):291–304. [Google Scholar]
- 15.Lawless JF. Statistical Models and Methods for Lifetime Data (Wiley Series in Probability and Statistics) 2. Wiley-Interscience; 2003. [Google Scholar]
- 16.Evers L, Messow CM. Sparse kernel methods for high-dimensional survival data. Bioinformatics. 2008;24(14):1632–1638. doi: 10.1093/bioinformatics/btn253. [DOI] [PubMed] [Google Scholar]
- 17.Shivaswamy PK, Chu W, Jansche M. A support vector approach to censored targets. IEEE International Conference on Data Mining; 2007. pp. 655–660. [Google Scholar]
- 18.Van Belle V, Pelckmans K, Van Huffel S, Suykens JAK. Improved performance on high-dimensional survival data by application of survival-SVM. Bioinformatics. 2011;27(1):87–94. doi: 10.1093/bioinformatics/btq617. [DOI] [PubMed] [Google Scholar]
- 19.Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS. High-dimensional variable selection for survival data. Journal of the American Statistical Association. 2010;105(489):205–217. [Google Scholar]
- 20.Ishwaran H, Kogalur UB, Chen X, Minn AJ. Random survival forests for high-dimensional data. Stat Anal Data Min. 2011;4(1):115–132. [Google Scholar]
- 21.Lisboa PJG, Etchells TA, Jarman IH, Arsene CTC, Aung MSH, Eleuteri A, Taktak AFG, Ambrogi F, Boracchi P, Biganzoli E. Partial logistic artificial neural network for competing risks regularized with automatic relevance determination. IEEE Transactions on Neural Networks. 2009;20(9):1403–1416. doi: 10.1109/TNN.2009.2023654. [DOI] [PubMed] [Google Scholar]
- 22.Engler D, Li Y. Survival analysis with high-dimensional covariates: An application in microarray studies. Statistical Applications in Genetics and Molecular Biology. 2009;8(1):1–22. doi: 10.2202/1544-6115.1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21(13):3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
- 24.Goeman JJ. L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal. 2010;52(1):70–84. doi: 10.1002/bimj.200900028. [DOI] [PubMed] [Google Scholar]
- 25.Yang Y, Zou H. Statistics and its Interface. 2012. A cocktail algorithm for solving the elastic net penalized Cox’s regression in high dimensions. [Google Scholar]
- 26.Witten DM, Tibshirani R. Survival analysis with high-dimensional covariates. Statistical Methods in Medical Research. 2010;19(1):29–51. doi: 10.1177/0962280209105024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordiante descent. Journal of Statistical Software. 2011;39(5):1–13. doi: 10.18637/jss.v039.i05. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]
- 29.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B) 1996;58:267–288. [Google Scholar]
- 30.Zhang T, Oles FJ. Text categorization based on regularized linear classification methods. Information Retrieval. 2000;4:5–31. [Google Scholar]
- 31.Kivinen J, Warmuth MK. Machine Learning. MIT Press; 2001. Relative loss bounds for multidimensional regression problems; pp. 301–329. [Google Scholar]
- 32.National Center for Injury Prevention and Control. CDC Injury Fact Book. Atlanta (GA): Centers for Disease Control and Prevention; 2006. [Google Scholar]
- 33.Mackersie RC. History of trauma field triage development andthe american college of surgeons criteria. Prehospital Emergency Care. 2006;10(3):287–294. doi: 10.1080/10903120600721636. [DOI] [PubMed] [Google Scholar]
- 34.Resources for Optimal Care of the Injured Patient Committee on Trauma. American College of Surgeons; Chicago, IL: 2006. [PubMed] [Google Scholar]
- 35.van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
- 36.van de Vijver MJ, He YD, van ’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine. 2002;347(25):1999–2009. doi: 10.1056/NEJMoa021967. [DOI] [PubMed] [Google Scholar]
- 37.Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2007;69(4):659–677. [Google Scholar]
- 38.Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. The Journal of the American Medical Association. 1982;247(18):2543–2546. [PubMed] [Google Scholar]
- 39.Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine. 1996;15(4):361–387. doi: 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4. [DOI] [PubMed] [Google Scholar]
- 40.Chambless LE, Cummiskey CP, Cui G. Several methods to assess improvement in risk prediction models: Extension to survival analysis. Statistics in Medicine. 2011;30(1):22–38. doi: 10.1002/sim.4026. [DOI] [PubMed] [Google Scholar]
- 41.Grønnesby JK, Borgan Ø. A method for checking regression models in survival analysis based on the risk score. Lifetime Data Analysis. 1996;2(4):315–328. doi: 10.1007/BF00127305. [DOI] [PubMed] [Google Scholar]
- 42.May S, Hosmer DW. A simplified method of calculating an overall goodness-of-fit test for the Cox proportional hazards model. Lifetime Data Analysis. 1998;4(2):109–120. doi: 10.1023/a:1009612305785. [DOI] [PubMed] [Google Scholar]
- 43.Suchard M, Simpson S, Zorych I, Ryan P, Madigan D. ACM Transactions on Modeling and Computer Simulation, to appear. 2013. Massive parallelization of serial inference algorithms for a complex generalized linear model. [DOI] [PMC free article] [PubMed] [Google Scholar]