Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Oct 28.
Published in final edited form as: Stat Med. 2013 Apr 28;32(23):3955–3971. doi: 10.1002/sim.5817

Large-Scale Parametric Survival Analysis

Sushil Mittal a,*, David Madigan a, Jerry Cheng b, Randall S Burd c
PMCID: PMC3796130  NIHMSID: NIHMS472885  PMID: 23625862

Abstract

Survival analysis has been a topic of active statistical research in the past few decades with applications spread across several areas. Traditional applications usually consider data with only small numbers of predictors with a few hundreds or thousands of observations. Recent advances in data acquisition techniques and computation power has led to considerable interest in analyzing very high-dimensional data where the number of predictor variables and the number of observations range between 104 – 106. In this paper, we present a tool for performing large-scale regularized parametric survival analysis using a variant of cyclic coordinate descent method. Through our experiments on two real data sets, we show that application of regularized models to high-dimensional data avoids overfitting and can provide improved predictive performance and calibration over corresponding low-dimensional models.

Keywords: Survival analysis, parametric models, regularization, penalized regression, pediatric trauma

1. Introduction

Regression analysis of time-to-event data occupies a central role in statistical practice [1, 2] with applications spread across several fields including biostatistics, sociology, economics, demography, and engineering [3, 4, 5, 6]. Newer applications often gather high-dimensional data that present computational challenges to existing survival analysis methods. As an example, new technologies in genomics have led to high-dimensional microarray gene expression data where the number of predictor variables is of the order of 105. Other large-scale applications include medical adverse event monitoring, longitudinal clinical trials, and business data mining tasks. All of these applications require methods for analyzing high-dimensional data in a survival analysis framework.

In this paper we consider high-dimensional parametric survival regression models involving both large numbers of predictor variables as well as large numbers of observations. While Cox models continue to attract much attention [7], parametric survival models have always been a popular choice among statisticians for analyzing time-to-event data [4, 6, 8, 9]. Parametric survival models feature prominently in commercial statistical software, are straightforward to interpret and can provide competitive predictive accuracy. To bring all these advantages to high-dimensional data analysis, these methods need to be scaled to data involving 104 – 106 predictor variables and even larger numbers of observations.

Computing the maximum likelihood fit of a parametric survival model requires solving a non-linear optimization problem. Standard implementations work well for small-scale problems. Because these approaches typically require matrix inversion, solving large-scale problems using standard software is typically impossible. One possible remedy is to perform feature selection as a pre-processing step. Although feature selection does reduce memory and computational requirements and also serves as a practical solution to overfitting, it introduces new problems. First, the statistical consequences of most feature selection methods remain unclear, making it difficult to choose the number of features for a given task in a principled way. Second, the most efficient feature selection methods are greedy and may choose redundant or ineffective combinations of features. Finally, it is often unclear how to combine heuristic feature selection methods with domain knowledge. Even when standard software does produce estimates, numerical ill conditioning can result in a lack of convergence, large estimated coefficient variances and poor predictive accuracy or calibration.

In this paper, we describe a regularized approach to parametric survival analysis [10]. The main idea is the use of a regularizing prior probability distribution for the model parameters that favors sparseness in the fitted model, leading to point estimates for many of the model parameters being zero. To solve the optimization problem, we use a variation of the cyclic coordinate descent method [11, 12, 13, 14]. We show that application of this type of model to high-dimensional data avoids overfitting, can provide improved predictive performance and calibration over corresponding low-dimensional models, and is efficient both during fitting and at prediction time.

In Section 2, we describe some of the other work that addresses similar problems. In Section 3, we describe the basics of regularized approach to parametric survival analysis. In section 4, we briefly describe the four parametric models used for survival analysis using the high-dimensional formulation. We also describe our optimization technique which is tailored for each specific model for computing point estimates of model parameters. More involved algorithmic details of the method pertaining to each of the four parametric models can be found in Appendix A. We describe the data sets and methods that we use in our experiments in Section 5, and provide experimental results in Section 6. The first application uses a large data set of hospitalized injured children for developing a model for predicting survival. Through our experiments, we establish that an analysis that uses our proposed approach can add significantly to predictive performance as compared to the traditional low-dimensional models. In the second application, we apply our method to a publicly available breast cancer gene expression data set and show that the high-dimensional parametric model can achieve similar performance to a low-dimensional Cox model while being better calibrated. Finally, we conclude in Section 8 with directions for future work.

We have publicly released the C++ implementation of our algorithm which can be downloaded from http://code.google.com/p/survival-analysis-cmake/. This code has been derived from the widely-used BBR/BXR software for performing large-scale Bayesian logistic regression (http://www.bayesianregression.org). All four parametric models discussed in this paper are included in the implementation, however, it is fairly straightforward to also extend it to other parametric models. Computation of various evaluation metrics discussed in Section 5 is also integrated into the code.

2. Related Work

Survival analysis is an old subject in statistics that continues to attract considerable research attention. Traditionally, survival analysis is concerned with the study of survival times in clinical and health related studies [4, 15]. However, over the past several decades, survival analysis has found an array of applications ranging from reliability studies in industrial engineering to analyses of inter-child birth times in demography and sociology [3]. Other application areas have also benefited from the use of these methods [8, 6].

Earlier applications usually consisted of a relatively small number of predictors (usually less than 20) with a few hundred or sometimes a few thousand examples. Recently, there has been considerable interest in analyzing high-dimensional time-to-event problems. For instance, a large body of work has focused on methodologies to handle the overwhelming amount of data generated by new technologies in biology such as gene expression microarrays and single nucleotide polymorphism (SNP) data. The goal of the work in high-dimensional survival analysis has been both to develop new statistical methods [16, 17, 18, 19, 20, 21] and to extend the existing methods to handle new data sets [22, 23, 13, 24]. For example, recent work [16, 18] has extended the traditional support vector machines used for regression to survival analysis by additionally penalizing discordant pairs of observations. Another method [19, 20] extends the use of random forests for variable selection for survival analysis. Similarly, other methods [22, 25] have used an elastic net approach for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model. This method is similar to other work [24] that applies an efficient method to compute L1-penalized parameter estimates for Cox models. A recent review [26] provides a survey of the existing methods for variable selection and model estimation for high-dimensional data.

Some more recent tools such as coxnet [27] and fastcox [25] adopt optimization approaches that can scale to the high-dimensional high sample size data that we focus on. Although other models provided by the R package glmnet do support sparse formats, neither coxnet nor fastcox currently support a sparse matrix format for the input data. However, both coxnet and fastcox provide estimates for the Cox proportional hazards model and fastcox supports elastic net regularization.

3. Regularized Survival Analysis

Denote by n the number of individuals in the training data. We represent their survival times by yi = min(ti, ci), i = 1, …, n, where ti and ci are the time to event (failure time) and right-censoring time for each individual respectively. Let δi = I(tici) be the indicator variable such that δi is one if the observation is not censored and zero otherwise. Further, let xi = [xi1, xi2, …, xip] be a p-vector of covariates. We assume that ti and ci are conditionally independent given xi and that the censoring mechanism is non-informative. The observed data comprise triplets D = {(yi, δi, xi):i = 1, …, n}.

Let θ be the set of unknown, underlying model parameters. We assume that the survival times y1, y2, …, yn arise in an independent and identically distributed fashion from density and survival functions f(y|θ) and S(y|θ) respectively, parametrized by θ. We are interested in the likelihood L(θ|D) of the parametric model, where

L(θD)=i=1nf(yiθ)δiS(yiθ)(1-δi) (1)

We analyze and compare the performance of four different parametric models by modeling the distributions of the survival times using Exponential, Weibull, Log-Logistic or Log Normal distributions. Each of these distributions can be fully parametrized by the parameter pair θ = (λ, α). Typically, the parameter λ is re-parametrized in terms of the covariates x = [x1, x2, …, xp] and the vector β = [β1, β2, …, βp] such that λi = φ (βxi), i = 1, …, n. In general, the mapping function φ(·) is different for each model and standard choices exist. The likelihood function of (1) in terms of the new parameters can be written as

L(β,αD)=i=1nf(yiλi,α)δiS(yiλi,α)(1-δi). (2)

The parameters β and α are estimated by maximizing their joint posterior density

L(β,α)L(β,αD)π(β)π(α). (3)

The joint posterior distribution of (β, α) does not usually have a closed form solution, but it can be shown that the conditional posterior distributions π(β|α, D) and π(α|β, D) are concave and therefore can be solved efficiently. In practice, it is sufficient to estimate β and α by maximizing the conditional posterior of just β

L(β)L(β,αD)π(β). (4)

For a model to generalize well to unseen test data, it is important to avoid overfitting the training data. In the Bayesian paradigm, this goal can be achieved by specifying appropriate prior distribution on β such that each βj is likely to be near zero. As we will observe in the following sections, both Gaussian and Laplacian priors fall under this category. Since we focus on posterior mode estimation in this paper, one can view our procedure as Bayesian or simply as a form of regularization or penalization. We use the Bayesian terminology in part because we view fully Bayesian computation as the next desirable step in large-scale survival analysis. We return to this point in the conclusion section.

3.1. Gaussian Priors and Ridge Regression

For L2 regularization, we assume a Gaussian prior for each βj with zero mean and variance τj, i.e.,

π(βjτj)=N(0,τj)=12πτjexp(-βj22τj). (5)

The mean of zero encodes a prior preference for values of βj that are close to zero. The variances τj are positive constants that control the degree of regularization and are typically chosen through cross-validation. Smaller values of τj imply stronger belief that βj is being close to zero while larger values impose a less-informative prior. In the simplest case we assume that τ1 = τ2 = … = τp. Assuming that the components of β are independent a priori, the overall prior for β can be expressed as the product of the priors of each individual βj, i.e., π(βτ1,,τp)=j=1pπ(βjτj). Finding the maximum a posteriori estimate of β with this prior is equivalent to performing ridge regression [28]. Note that although the Gaussian prior favors the value of βj close to zero, posterior modes are generally not exactly equal to zero for any βj.

3.2. Laplacian Prior and Lasso Regression

For L1 regularization, we again assume that each βj follows a Gaussian distribution with mean zero and variance τj. However, instead of fixed τj’s, instead we assume that each τj arise from an exponential distribution parametrized using γj’s and having density

π(τjγj)=γj2exp(-γj2τj). (6)

Integrating out τj gives an equivalent non-hierarchical, double-exponential (Laplace) distribution with density

π(βjγj)=γj2exp(-γjβj). (7)

Again, in the simplest case, we assume that γ1 = γ2 = … = γp. Similar to the Gaussian prior, assuming that the components of β are independent, πβγ1,,γp)=j=1pπ(βjγj). Finding the maximum a posteriori estimate of β with the Laplacian prior is equivalent to performing a Lasso regression [29]. Using this approach, a sparse solution typically ensues, meaning the posterior mode for many components of the β vector will be zero.

4. Parametric Models

It can be shown that for both Gaussian and Laplacian regularization, the negated log-posterior of (4) for all the four parametric models considered in our work are log-convex. Therefore, a wide range of optimization algorithms can be used. Due to the high-dimensionality of our target applications, usual methods like Newton-Raphson cannot be used because of their high memory requirements. Many alternate optimization approaches have been proposed for maximizing MAP for high-dimensional regression problems [30, 31]. We use the Combined Local and Global (CLG) algorithm of [30] which is a type of cyclic coordinate descent algorithm due to CLG’s favorable property of scaling to high-dimensional data and ease of implementation. This method was successfully adapted to the lasso [14] for performing large-scale logistic regression and implemented in the widely-used BBR/BXR software (http://www.bayesianregression.org).

A cyclic coordinate descent algorithm begins by setting all p parameters βj, j = 1, …, p to some initial value. It then sets the first parameter to a value that minimizes the objective function, holding all other parameters constant. This problem then is a one-dimensional optimization. The algorithm finds the minimizing value of a second parameter, while holding all other values constant (including the new value of the first parameter). The third parameter is then optimized, and so on. When all variables have been traversed, the algorithm returns to the first parameter and starts again. Multiple passes are made over the parameters until some convergence criterion is met. We note that similar to previous work [14], instead of iteratively updating each parameter till convergence, we do it only once before proceeding on to the next parameter. Since the optimal values of the other parameters are themselves changing, tuning a particular parameter to very high precision, in each pass of the algorithm, is not necessary. For more details of the CLG method, we refer readers to a relevant publication in this area [30, 14]. The details of our algorithm for different parametric models are described in Appendix A. We also note that for the case of the Laplacian prior, the derivative of the negated log-posterior is undefined for βj = 0, j = 1, …, p. Section 4.3 of [14] describes the modification of the CLG algorithm that we utilize to address this issue.

5. Experiments

We tested our algorithm for large-scale parametric survival analysis on two different data sets. Below, we briefly describe the data sets and also motivate their choice for our work. In both cases, we compare the performance of the high-dimensional parametric model trained on all predictors with that of a low-dimensional model trained on a small subset of features.

International Classification of Disease, Ninth Revision

5.1. Pediatric Trauma Data

Trauma is the leading cause of death and acquired disability in children and adolescents. Injuries result in more deaths in children than all other causes combined [32]. To provide optimal care, children at high risk for mortality need to be identified and triaged to centers with the resources to manage these patients. The overall goal of our analysis is to develop a model for predicting mortality after pediatric injury. For prediction within the domain of trauma, the literature has traditionally focused on low-dimensional analysis, i.e., modeling using a small set of features from the injury scene or emergency department [33, 34]. While approaches that use low-dimensional data to predict outcome may be easier to implement, the trade-off may be poorer predictive performance.

We obtained our data set from National Trauma Data Bank (NTDB), a trauma database maintained by the American College of Surgeons. The data set includes 210, 555 patient records of injured children < 15 years old collected over five years (2006–2010). We divided these data into a training data set(153, 402 patients for years 2006–2009) and a testing data set (57, 153 patients for year 2010). The mortality rate of the training set is 1.68% while that of the test set is 1.44%. There are a total of 125, 952 binary predictors indicating the presence or absence of a particular attribute (or interaction among various attributes). The high-dimensional model was trained using all the 125, 952 predictors, while the low-dimensional model used only 41 predictors. The information about various predictors used for both high and low-dimensional models is summarized in Table 1.

Table 1.

Description of the predictors used for high and low-dimensional models for pediatric trauma data.

Predictor Type # Predictors Description High-dim Low-dim
Main Effects
 ICD-9 Codes 1, 890 International Classification of Disease, Ninth Revision.
 AIS Codes (predots) 349 Abbreviated Injury Scale codes that includes body region, anatomic structure associated with the injury and the level of injury.
Interactions/Combinations
 ICD-9, ICD-9 102, 284 Co-occurrences of two ICD-9 injury codes.
 AIS Code, AIS Code 20, 809 Co-occurrences of two AIS codes.
 Body Region, AIS Score 41 Combinations of any of the 9 body regions associated to an injury with injury severity score (between 1–6) determined according to AIS coding scheme.
 [Body Region, AIS Score], [Body Region, AIS Score] 579 Co-occurrences of two [Body Region, AIS Score] combinations.

5.2. Breast Cancer Gene Expression Data

For our second application, we analyzed the well-known breast cancer gene expression data set [35, 36]. This data set is publicly available and consists of cDNA expression profiles of 295 tumor samples of patients with breast cancer. These patients were analyzed with breast cancer between 1984 and 1995 at the Netherlands Cancer Institute and were aged 52 years or younger at the time of diagnosis. Overall, 79 (26.78%) patients died during the follow-up time and the remaining 216 are censored. The total number of predictors (number of genes) is 24, 885 each of which represents the log-ratio of the intensities of the two color dyes used for a specific gene. The high-dimensional model was trained using all the 24, 885 predictors. For the low-dimensional model, we used glmpath (coxpath) package of R [37] and generated the entire regularized paths and outputs predictors in the order of their relative importance. We picked the top 5 predictors from the list and used our model to train a low-dimensional model for comparisons. The data was then randomly split into training (67%) and testing (33%) data such that the mortality in both parts was equal to the original mortality rate.

5.3. Hyperparameter Selection

The Gaussian and Laplace priors both require a prior variance, σj2, j = 1, …, p, for parameter values. The actual hyperparameters are τ=τj=σj2 for the Gaussian and γ=γj=2/σj for Laplace. For both the applications, the regularization parameter was selected using a four-fold cross validation on the training data. The variance σ was varied between 10−5 – 106 by multiples of 10. This amounts to varying actual regularization parameter for the Gaussian loss between 10−5 – 106 by multiples of 10 while that of the Laplacian between 0.0014 – 447.21 in multiples of 10. For each choice of the hyperparameter, we computed the sum of log-likelihoods for the patients in all the four validation sets and chose the hyperparameter value that maximized this sum.

5.4. Performance Evaluation

There are several ways to compare the performance of fitted parametric models. These evaluation metrics can be divided into two categories - ones that assess the discrimination quality of the model and the others that assess the calibration. As for any other regression model, the log-likelihood on the test data is an obvious choices for measuring the performance of penalized parametric survival regression. In the past several years, survival analysis-specific metrics have been proposed that take censoring into account. Among these metrics, area under the ROC curve (AUC), also known as the Harrell’s c-statistic [38, 39], a measure of discriminative accuracy of the model, has become one of the important criteria used to evaluate the performance of survival models [40]. Motivated from a similar test for logistic regression model proposed by Hosmer and Lemeshow, we evaluated calibration using an overall goodness-of-fit test that has been proposed for survival data [41, 42]. Although the original test was proposed for Cox models, its use for parametric models have also been described [6, Chapter 8]. Below, we briefly describe these two metrics.

5.4.1. Harrell’s c-statistic

Harrell’s c-statistic is an extension of the tradition area under the curve (AUC) statistic, but is more suited for time-to-event data since it is independent of any thresholding process. The method is based on the comparison of estimated versus the ground truth ordering of risks between pairs of comparable subjects. Two subjects are said to be comparable if at least one of the subjects in the pair has developed the event (e.g., death) and if the follow-up time duration for that subject is less than that of the other. Using the notation of Section 3, the comparability of an ordered pair of subjects (i, j) can be represented by the indicator variable ζij = I(yi < yj, δi = 1), such that ζi,j is one when the two subjects are comparable and zero otherwise. The total number of comparable pairs in the test data containing n subjects can then be computed as

nζ=i=1nj=1jinζi,j. (8)

To measure the predicted discrimination, the concordance of the comparable pairs is then estimated. A pair of comparable subjects (as defined above) is said to be concordant, if the estimated risk of the subject who developed the event earlier is more than that of the other subject. Therefore, the concordance of the ordered subject pair (i, j) can also be represented using the indicator variable ξij = I(ζij = 1, ri > rj), where ri = −β xi and rj = −βxj are the relative risks scores of the ith and jth subjects. Therefore, ξij is one when the two subjects are concordant and zero otherwise. The total number of concordant pairs can then be written as

nξ=i=1nj=1jinξi,j. (9)

Finally, the c-statistic is given by

c-statistic=nξnζ. (10)

5.4.2. Hosmer-Lemeshow Statistic

To assess the overall goodness-of-fit using this test, the subjects are first sorted in an order of their relative risk score (ri = −β xi, i = 1, …, n) and divided into G equal-sized groups. The observed number of events og of gth group is obtained by summing the number of non-censored observations in that group while the number of expected events eg is computed by summing the cumulative hazard of all the subjects in the same group. The χ2 statistic for the overall goodness-of-fit is then by

χ2=g=1G(og-eg)2eg. (11)

6. Results

We now compare the performance of the low and high-dimensional parametric models on both the data sets. For both applications, the hyperparameter σ2 was selected by performing a four-fold cross-validation on the training data.

6.1. Pediatric Trauma Data

Tables 2 and 3 summarize the results for the pediatric trauma data set for low and high-dimensional models using all the four parametric models with Gaussian and Laplacian penalization. Note that under the L2 prior, the estimate for any βj is never exactly zero and thus all variables contribute towards the final model. The number of selected predictors in Table 2 just refers to the number of significant predictors (|βj| > 10−4). The Hosmer-Lameshow χ2 index was computed using the value of G = 50. Both the discriminative measures (Log-likelihood, c-statistic) are always significantly better for high-dimensional models as compared to the corresponding low-dimensional models. In many cases, the high-dimensional model is also better calibrated.

Table 2.

Comparison of low and high-dimensional models for pediatric trauma data with Gaussian penalization. The number of selected predictors refers to the number of significant predictors (βj > 10−4).

Model Type Predictors
Log-likelihood c-statistic χ2
Overall Selected
Exponential
 Low-dim 41 41 −4270.01 0.88 124.39
 High-dim 125,952 101,733 4372.41 0.94 543.28
Weibull
 Low-dim 41 41 −4242.38 0.88 131.60
 High-dim 125,952 101,794 4557.10 0.94 749.22
Log-Logistic
 Low-dim 41 41 −4120.66 0.89 95.37
 High-dim 125,952 100,889 3765.45 0.94 95.02
Log-Normal
 Low-dim 41 41 −3234.00 0.89 76.95
 High-dim 125,952 88,244 3129.02 0.93 165.68

Table 3.

Comparison of low and high-dimensional models for pediatric trauma data with Laplacian penalization.

Model Type Predictors
Log-likelihood c-statistic χ2
Overall Selected
Exponential
 Low-dim 41 41 −4271.23 0.88 122.58
 High-dim 125,952 153 4034.67 0.92 94.34
Weibull
 Low-dim 41 41 −4243.57 0.88 126.84
 High-dim 125,952 151 3997.28 0.92 107.99
Log-Logistic
 Low-dim 41 41 −4122.07 0.89 94.73
 High-dim 125,952 432 3777.83 0.94 83.00
Log-Normal
 Low-dim 41 41 −3236.79 0.89 80.71
 High-dim 125,952 168 2974.49 0.93 89.36

To provide further insight, we grouped the subjects into low, medium and high risk groups by sorting their relative risk scores in increasing order and using threshold values at the 33rd and 66th percentiles. For each group, we counted the number of events (number of non-censored observations). Tables 4 and 5 summarize the results for Gaussian and Laplacian penalization respectively. The results show that in almost all cases, the subjects assigned to high risk group by the high-dimensional models had more events than the ones assigned high-risk by the low-dimensional models. Although both kinds of models assign similar number of subjects to the low risk group, the mean observed survival time of the subjects having events is more for the high-dimensional models than for the corresponding low-dimensional ones. These findings also establish that in most cases the high-dimensional models are better calibrated than their low-dimensional counterparts.

Table 4.

Comparison of number of events in low, medium and high risk groups for various low and high-dimensional models using Gaussian penalization for pediatric trauma data. MST stands for mean observed survival time in days.

Model Type Risk Group # Subjects Low-dim
High-dim
# Events MST # Events MST
Exponential Low 18,860 10 2.00 8 3.38
Medium 18,860 41 4.10 18 4.33
High 19,433 774 3.66 799 3.65
Weibull Low 18,860 11 2.00 7 3.43
Medium 18,860 34 3.44 18 3.67
High 19,433 780 3.69 800 3.66
Log-Logistic Low 18,860 10 2.00 11 3.73
Medium 18,860 32 4.13 21 4.14
High 19,433 783 3.66 793 3.65
Log-Normal Low 18,860 10 2.00 16 6.56
Medium 18,860 31 4.10 39 3.23
High 19,433 784 3.66 770 3.62

Table 5.

Comparison of number of events in low, medium and high risk groups for various low and high-dimensional models using Laplacian penalization for pediatric trauma data. MST stands for mean observed survival time in days.

Model Type Risk Group # Subjects Low-dim
High-dim
# Events MST # Events MST
Exponential Low 18,860 10 2.00 10 3.00
Medium 18,860 40 3.95 28 7.75
High 19,433 775 3.67 787 3.52
Weibull Low 18,860 10 2.00 10 3.00
Medium 18,860 37 3.32 20 9.80
High 19,433 778 3.70 795 3.51
Log-Logistic Low 18,860 10 2.00 4 4.50
Medium 18,860 32 4.13 23 3.65
High 19,433 783 3.66 798 3.66
Log-Normal Low 18,860 10 2.00 12 3.00
Medium 18,860 32 4.10 25 2.96
High 19,433 783 3.67 788 3.69

6.2. Breast Cancer Gene Expression Data

The results of the gene expression data set for low and high-dimensional models using all the four parametric models with Gaussian and Laplacian penalization are summarized in Tables 6 and 7 respectively. The Hosmer-Lameshow χ2 index was computed using G = 10 due to very few observations in the test set. While the discriminative performance of both methods is similar for most of the cases, the high-dimensional models are almost always better calibrated.

Table 6.

Comparison of low and high-dimensional models for gene expression data with Gaussian penalization. The number of selected predictors refers to the number of significant predictors (βj > 10−4).

Model Type Predictors
Log-likelihood c-statistic χ2
Overall Selected
Exponential
 Low-dim 5 5 86.26 0.71 16.83
 High-dim 24,496 24,344 −98.72 0.75 112.69
Weibull
 Low-dim 5 5 −85.64 0.71 12.79
 High-dim 24,496 22,299 85.56 0.70 9.34
Log-Logistic
 Low-dim 5 5 85.65 0.70 14.74
 High-dim 24,496 22,090 −86.14 0.70 5.37
Log-Normal
 Low-dim 5 5 −52.10 0.70 16.02
 High-dim 24,496 24,154 65.14 0.66 23.61

Table 7.

Comparison of low and high-dimensional models for gene expression data with Laplacian penalization.

Model Type Predictors
Log-likelihood c-statistic χ2
Overall Selected
Exponential
 Low-dim 5 5 86.27 0.71 17.14
 High-dim 24,496 13 −87.51 0.67 2.74
Weibull
 Low-dim 5 5 85.63 0.71 12.79
 High-dim 24,496 13 −86.80 0.66 3.75
Log-Logistic
 Low-dim 5 5 85.65 0.70 14.71
 High-dim 24,496 9 −86.20 0.68 3.66
Log-Normal
 Low-dim 5 5 52.10 0.70 15.98
 High-dim 24,496 9 −53.46 0.66 8.57

7. Computation Time

Table 8 summarizes the training time taken for fitting different parametric models for low and high-dimensional data sets. All the experiments were performed on a system with Intel 2.4GHz processor with 8Gb of memory. Note that even though the time taken to fit high-dimensional models to pediatric trauma data set is much more than that taken to fit low-dimensional models, given the scale of the problem (153, 402 patients with 125, 952 predictors), the performance of the methods may be acceptable in many applications.

Table 8.

Computation time (in seconds) taken for training various low and high-dimensional models using Gaussian and Laplacian penalization for pediatric trauma and gene expression data.

Model Type Pediatric Trauma
Gene Expression
Gaussian
Laplacian
Gaussian
Laplacian
Low-dim High-dim Low-dim High-dim Low-dim High-dim Low-dim High-dim
Exponential 1 4,453 1 2,588 1 4 1 8
Weibull 3 3,406 2 2,600 1 12 1 30
Log-Logistic 2 2,794 2 3,278 1 13 1 38
Log-Normal 6 3,250 6 3,101 1 51 1 52

8. Conclusions

We present a method to perform regularized parametric survival analysis on data with 104 – 106 predictor variables and a large number of observations. Through our experiments in the context of two different applications, we have demonstrated the advantage of using high-dimensional survival analysis over the corresponding low-dimensional models. We have provided a freely available software tool that implements our proposed algorithm. Future work will provide the extension to Cox proportional hazards models in addition to accelerated failure time models and the Aalen’s additive hazard model. We have also developed software for high-dimensional regularized generalized linear models that utilizes inexpensive massively parallel devices known as graphics processing units (GPUs) [43]. This provides more than an order-of-magnitude speed-up and in principle could be further developed to include survival analysis. Fully Bayesian extensions to our current work could explore the hierarchical framework to simultaneously model multiple time-to-event endpoints, model multi-level structure such as patients-nested-within-hospitals, and incorporate prior information when available.

A. Appendix

Here we describe the details of our algorithm for Exponential, Weibull, Log-Logistic or Log-Normal distributions of the survival times.

A.1. Exponential

The exponential model is the simplest of all and can be parametrized using a single parameter λ, such that the density and survival functions can respectively be written as f(yθ)=f(yλ)=1λexp(-yλ) and S(yθ)=S(yλ)=exp(-yλ). The likelihood function of can be written as

L(λ1,,λnD)=i=1nf(yiλi)δiS(yiλi)(1-δi)=exp(i=1n(-δilogλi-yiλi)). (12)

A common form for the mapping function φ(·) is to take the φ(β xi) = exp(β xi). The likelihood function is then

L(βD)=exp(-i=1nδiβxi-i=1nyiexp(-βxi)) (13)

and the corresponding log-likelihood is

l(βD)=-i=1nδiβxi-i=1nyiexp(-βxi). (14)

Adding the Gaussian prior with mean zero and variance τj, the posterior density can be written as

lG(β)=l(βD)+log(π(βτ1,,τp))=l(βD)-j=1p(logτj+12log2π+βj22τj). (15)

Similarly, for the Laplacian prior, the posterior density can be written as

lL(β)=l(βD)+log(π(βγ1,,γp))=l(βD)-j=1p(log2-logγj+γjβj). (16)

Using the CLG algorithm, the one-dimensional problem involves finding βj(new), the value of the j-th entry of β that minimizes −l(β), assuming that the other βj’s are held at their current values. Therefore, using (14) and (15) for Gaussian prior (and ignoring the constants logτj and 12log2π), finding βj(new) is equivalent to the z that minimizes

gG(z)=zi=1nxijδi+i=1nyiexp(-zxij-k=1kjpβkxik)+z22τj. (17)

The classic Newton method approximates the objective function g(·) by the first three terms of its Taylor series at the current βj

g(z)g(βj)+g(βj)(z-βj)+12g(βj)(z-βj)2 (18)

where

gG(βj)=dgG(z)dz|z=βj=i=1nxijδi-i=1nyixijexp(-βxi)+βjτj (19)
gG(βj)=d2gG(z)dz2|z=βj=i=1nyixij2exp(-βxi)+1τj. (20)

Similarly, for Laplacian prior,

gL(z)=zi=1nxijδi+i=1nyiexp(-zxij-k=1kjpβkxik)+λjz (21)
gL(βj)=dgL(z)dz|z=βj=i=1nxijδi-i=1nyixijexp(-βxi)+λjsign(βj),βj0 (22)
gL(βj)=d2gL(z)dz2|z=βj=i=1nyixij2exp(-βxi),βj0. (23)

The value of βj(new) for both types of priors can then be computed as

βj(new)=βj+Δβj=βj-g(βj)g(βj). (24)

A.2. Weibull

The Weibull model is a more general parametric model with density and survival functions given by f(yθ)=f(yλ,α)=αyα-1λexp(-yαλ) and S(yθ)=S(yλ,α)=exp(-yαλ) respectively. The corresponding likelihood function can be written as

L(λ1,,λn,αD)=i=1nf(yiλi,α)δiS(yiλi,α)(1-δi)=αdi=1n(yiα-1λiexp(-yiαλi))δi(exp(-yiαλi))(1-δi) (25)

where d=i=1nδi. Similar to the previous case, using λi = exp(β xi), the log-likelihood function can be written as

l(α,βD)=dlogα+i=1n(δi(α-1)logyi-δiβxi-yiαexp(βxi)). (26)

Similar to equations (15) and (16) conditional posterior corresponding to the Gaussian and Laplacian priors can be written as

lG(α,β)=l(α,βD)+log(π(βτ1,,τp))=l(α,βD)-j=1p(logτj+12log2π+βj22τj). (27)
lL(α,β)=l(α,βD)+log(π(βγ1,,γp))=l(α,βD)-j=1p(log2-logγj+γjβj). (28)

Using the CLG algorithm, the one-dimensional problem involve finding αj(new) and βj(new), the value of the j-th that gives the minimum value of −l(α, β), assuming that the other αj’s and βj’s are held at their current values. Therefore, for both Gaussian and Laplacian prior, finding α(new) is equivalent to finding the z that minimizes

g(z)=-dlogz-i=1n(δi(z-1)logyi-yizexp(-βxi)). (29)

Writing the Taylor series expansion for g(z) around z

g(z)g(α)+g(α)(z-α)+12g(α)(z-α)2 (30)

where

g(α)=dg(z)dz|z=α=-d/α-i=1n(δilogyi-yiαexp(-βxi)logyi) (31)
g(α)=d2g(z)dz2|z=α=d/α2+i=1nyiα(logyi)2exp(-βxi). (32)

The value of αj(new) for both types of priors can then be computed as

α(new)=α+Δα=α-g(α)g(α). (33)

The stepwise update for β for both Gaussian and Laplacian priors are similar to that of the exponential model and can be obtained by replacing yi with yiα in equations (18)(24).

A.3. Log-Logistic

For Log-Logistic model, the density and survival functions are given by

f(yθ)=f(yλ,α)=αyα-1λ(1+yαλ)2andS(yθ)=S(yλ,α)=1(1+yαλ). (34)

The corresponding likelihood function can be written as

L(λ1,,λn,αD)=i=1nf(yiλi,α)δiS(yiλi,α)(1-δi)=αdi=1n(yiα-1λi(1+yiαλi)2)δi(1(1+yiαλi))(1-δi). (35)

where d=i=1nδi. Re-parameterizing, λi = exp(β xi), the log-likelihood function can be written as

l(α,βD)=dlogα+i=1n(δi(α-1)logyi-δiβxi-(1+δi)log(1+yiαexp(-βxi))). (36)

The posterior distributions corresponding to the Gaussian and Laplacian priors are similar to those for the Weibull model (27) (28). Using the CLG algorithm, for both types of priors, finding the update α(new) is equivalent to finding the z that minimizes

g(z)=-dlogz-i=1n(δi(z-1)logyi-(1+δi)log(1+yizexp(-βxi))). (37)

The value of αj(new) for both types of priors can then be computed using (33) where

g(α)=dg(z)dz|z=α=-d/α-i=1n(δilogyi-(1+δi)yiαexp(-βxi)logyi1+yiαexp(-βxi)) (38)
g(α)=d2g(z)dz2|z=α=d/α2+i=1n(1+δi)yiαexp(-βxi)(logyi)2(1+yiαexp(-βxi))2. (39)

For Gaussian prior, finding βj(new) is equivalent to finding the z that minimizes

gG(z)=zi=1nxijδi+i=1n(1+δi)log(1+yiαexp(-zxij-k=1kjpβkxik))+z22τj. (40)

Similar to the previous cases, the updated value βj(new) can be computed using (24), where

gG(βj)=dgG(z)dz|z=βj=i=1nxijδi-i=1n(1+δi)yiαxijexp(-βxi)1+yiαexp(-βxi)+βjτj (41)
gG(βj)=d2gG(z)dz2|z=βj=i=1n(1+δi)yiαxij2exp(-βxi)(1+yiαexp(-βxi))2+1τj. (42)

Similarly, for Laplacian prior

gL(z)=zi=1nxijδi+i=1n(1+δi)log(1+yiαexp(-zxij-k=1kjpβkxik))+λjz (43)
gL(βj)=dgL(z)dz|z=βj=i=1nxijδi-i=1n(1+δi)yiαxijexp(-βxi)1+yiαexp(-βxi)+λjsign(βj),βj0 (44)
gL(βj)=d2gL(z)dz2|z=βj=i=1n(1+δi)yiαxij2exp(-βxi)(1+yiαexp(-βxi))2,βj0. (45)

A.4. Log-Normal

Assuming that the survival times yi, i = 1, …, n follow a log-normal distribution is equivalent to assuming that their logarithms wi = log yi, i = 1, …, n follow a normal N(μ, σ2) distribution with density and survival functions given by f(yθ)=f(wμ,σ)=12πσ2exp(-(w-μ)22σ2) and S(yθ)=S(wμ,σ)=1-Φ(w-μσ) respectively, where Φ(·) is the Gaussian cumulative distribution function. Following the convention, we replace λ and α with μ and σ respectively. The likelihood function of μi, μj, …, μn and σ can be written as

L(μ1,,μn,σD)=i=1nf(yiμi,σ)δiS(yiμi,σ)(1-δi)=(2πσ2)-d/2i=1n(exp(-(w-μi)22σ2))δi(1-Φ(w-μiσ))(1-δi) (46)

where d=i=1nδi. Re-parameterizing μi = β xi, the log-likelihood function can be written as

l(σ,βD)=log2π-dlogσ-12σ2i=1nδi(wi-βxi)2+i=1n(1-δi)log(1-Φ(w-βxiσ)). (47)

The conditional posterior densities corresponding to Gaussian and Laplacian priors can be written as

lG(σ,β)=l(σ,βD)+log(π(βτ1,,τp))=l(σ,βD)-j=1p(logτj+12log2π+βj22τj). (48)
lL(σ,β)=l(σ,βD)+log(π(βγ1,,γp))=l(σ,βD)-j=1p(log2-logγj+γjβj). (49)

Assuming that the other parameters are held at their current values, the one-dimensional problems involve finding σ(new) and βj(new), that minimize the posterior. Using the CLG algorithm for both types of priors, finding σ(new) is equivalent to finding the z that minimizes

g(z)=dlogσ+12σ2i=1nδi(wi-βxi)2-i=1n(1-δi)log(1-Φ(w-βxiσ)). (50)

The value of σ(new) can then be computed as

σ(new)=σ+Δσ=σ-g(σ)g(σ). (51)

Here

g(σ)=dg(z)dz|z=σ=dσ-1σi=1nδipi2-1σi=1n(1-δi)piqi (52)
g(σ)=d2g(z)dz2|z=σ=-dσ2+3σ2i=1nδipi2-1σ2i=1n(1-δi)(pi3qi-pi2qi2-2piqi). (53)

where

pi=wi-βxiσandqi=N((wi-βxi)/σ)1-Φ((w-βxi)/σ). (54)

For Gaussian prior, finding βj(new) is equivalent to finding the z that minimizes

gG(z)=12σ2i=1nδi(ri-xijz)2-i=1n(1-δi)log(1-Φ(ri-xijzσ))+z22τj (55)

where ri=wi-k=1kjpβkxik. The updated value βj(new) can be computed using (24), where

gG(βj)=dgG(z)dz|z=βj=-1σi=1nxij(δipi+(1-δi)qi)+βjτj (56)
gG(βj)=d2gG(z)dz2|z=βj=1σ2i=1nxij(δi-qi(1-δi)(pi-qi))+1τj (57)

and pi and qi are same as (54). Similarly, for Laplacian prior

gL(z)=12σ2i=1nδi(ri-xijz)2-i=1n(1-δi)log(1-Φ(ri-xijzσ))+λjz (58)
gL(βj)=dgL(z)dz|z=βj=-1σi=1nxij(δipi+(1-δi)qi)+λjsign(βj),βj0 (59)
gL(βj)=d2gL(z)dz2|z=βj=1σ2i=1nxij(δi-qi(1-δi)(pi-qi)),βj0. (60)

Footnotes

This research was supported by an NIH-NIGMS grant awarded to Childrens National Medical Center (R01GM087600-01)

References

  • 1.Oakes D. Biometrika centenary: Survival analysis. Biometrika. 2001;88(1):99–142. [Google Scholar]
  • 2.Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Time Data. John Wiley and Sons; New York: 1980. [Google Scholar]
  • 3.Box-Steffensmeier JM, Jones BS. Event History Modeling: A Guide for Social Scientists. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
  • 4.Collett D. Modelling Survival Data for Medical Research. 2. Chapman-Hall; London, UK: 2003. [Google Scholar]
  • 5.Heckman J, Singer B. Longitudinal Analysis of Labor Market Data. Cambridge University Press; Cambridge, UK: 1985. [Google Scholar]
  • 6.Hosmer DW, Lemeshow S, May S. Applied Survival Analysis: Regression Modeling of Time to Event Data (Wiley Series in Probability and Statistics) 2. Wiley-Interscience; 2008. [Google Scholar]
  • 7.Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16(4):385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
  • 8.Klein J, Moeschberger M. Survival Analysis: Techniques For Censored and Truncated Data. 2. John Willey and Sons; New York: 2003. [Google Scholar]
  • 9.Lee ET, Wang J. Statistical Methods for Survival Data Analysis (Wiley Series in Probability and Statistics) 2. Wiley-Interscience; 2003. [Google Scholar]
  • 10.Ibrahim JG, Chen MH, Sinha D. Bayesian Survival Analysis. Springer-Verlag; 2001. [Google Scholar]
  • 11.Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications. 2001;109(3):475–494. [Google Scholar]
  • 12.Koh K, Kim SJ, Boyd S. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research. 2007;8:1519–1555. [Google Scholar]
  • 13.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33 (1):1–22. [PMC free article] [PubMed] [Google Scholar]
  • 14.Genkin, Alexander, Lewis, David D, Madigan, David Large-scale Bayesian logistic regression for text categorization. Technometrics. 2007;49(3):291–304. [Google Scholar]
  • 15.Lawless JF. Statistical Models and Methods for Lifetime Data (Wiley Series in Probability and Statistics) 2. Wiley-Interscience; 2003. [Google Scholar]
  • 16.Evers L, Messow CM. Sparse kernel methods for high-dimensional survival data. Bioinformatics. 2008;24(14):1632–1638. doi: 10.1093/bioinformatics/btn253. [DOI] [PubMed] [Google Scholar]
  • 17.Shivaswamy PK, Chu W, Jansche M. A support vector approach to censored targets. IEEE International Conference on Data Mining; 2007. pp. 655–660. [Google Scholar]
  • 18.Van Belle V, Pelckmans K, Van Huffel S, Suykens JAK. Improved performance on high-dimensional survival data by application of survival-SVM. Bioinformatics. 2011;27(1):87–94. doi: 10.1093/bioinformatics/btq617. [DOI] [PubMed] [Google Scholar]
  • 19.Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS. High-dimensional variable selection for survival data. Journal of the American Statistical Association. 2010;105(489):205–217. [Google Scholar]
  • 20.Ishwaran H, Kogalur UB, Chen X, Minn AJ. Random survival forests for high-dimensional data. Stat Anal Data Min. 2011;4(1):115–132. [Google Scholar]
  • 21.Lisboa PJG, Etchells TA, Jarman IH, Arsene CTC, Aung MSH, Eleuteri A, Taktak AFG, Ambrogi F, Boracchi P, Biganzoli E. Partial logistic artificial neural network for competing risks regularized with automatic relevance determination. IEEE Transactions on Neural Networks. 2009;20(9):1403–1416. doi: 10.1109/TNN.2009.2023654. [DOI] [PubMed] [Google Scholar]
  • 22.Engler D, Li Y. Survival analysis with high-dimensional covariates: An application in microarray studies. Statistical Applications in Genetics and Molecular Biology. 2009;8(1):1–22. doi: 10.2202/1544-6115.1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21(13):3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
  • 24.Goeman JJ. L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal. 2010;52(1):70–84. doi: 10.1002/bimj.200900028. [DOI] [PubMed] [Google Scholar]
  • 25.Yang Y, Zou H. Statistics and its Interface. 2012. A cocktail algorithm for solving the elastic net penalized Cox’s regression in high dimensions. [Google Scholar]
  • 26.Witten DM, Tibshirani R. Survival analysis with high-dimensional covariates. Statistical Methods in Medical Research. 2010;19(1):29–51. doi: 10.1177/0962280209105024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordiante descent. Journal of Statistical Software. 2011;39(5):1–13. doi: 10.18637/jss.v039.i05. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]
  • 29.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B) 1996;58:267–288. [Google Scholar]
  • 30.Zhang T, Oles FJ. Text categorization based on regularized linear classification methods. Information Retrieval. 2000;4:5–31. [Google Scholar]
  • 31.Kivinen J, Warmuth MK. Machine Learning. MIT Press; 2001. Relative loss bounds for multidimensional regression problems; pp. 301–329. [Google Scholar]
  • 32.National Center for Injury Prevention and Control. CDC Injury Fact Book. Atlanta (GA): Centers for Disease Control and Prevention; 2006. [Google Scholar]
  • 33.Mackersie RC. History of trauma field triage development andthe american college of surgeons criteria. Prehospital Emergency Care. 2006;10(3):287–294. doi: 10.1080/10903120600721636. [DOI] [PubMed] [Google Scholar]
  • 34.Resources for Optimal Care of the Injured Patient Committee on Trauma. American College of Surgeons; Chicago, IL: 2006. [PubMed] [Google Scholar]
  • 35.van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
  • 36.van de Vijver MJ, He YD, van ’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine. 2002;347(25):1999–2009. doi: 10.1056/NEJMoa021967. [DOI] [PubMed] [Google Scholar]
  • 37.Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2007;69(4):659–677. [Google Scholar]
  • 38.Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. The Journal of the American Medical Association. 1982;247(18):2543–2546. [PubMed] [Google Scholar]
  • 39.Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine. 1996;15(4):361–387. doi: 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4. [DOI] [PubMed] [Google Scholar]
  • 40.Chambless LE, Cummiskey CP, Cui G. Several methods to assess improvement in risk prediction models: Extension to survival analysis. Statistics in Medicine. 2011;30(1):22–38. doi: 10.1002/sim.4026. [DOI] [PubMed] [Google Scholar]
  • 41.Grønnesby JK, Borgan Ø. A method for checking regression models in survival analysis based on the risk score. Lifetime Data Analysis. 1996;2(4):315–328. doi: 10.1007/BF00127305. [DOI] [PubMed] [Google Scholar]
  • 42.May S, Hosmer DW. A simplified method of calculating an overall goodness-of-fit test for the Cox proportional hazards model. Lifetime Data Analysis. 1998;4(2):109–120. doi: 10.1023/a:1009612305785. [DOI] [PubMed] [Google Scholar]
  • 43.Suchard M, Simpson S, Zorych I, Ryan P, Madigan D. ACM Transactions on Modeling and Computer Simulation, to appear. 2013. Massive parallelization of serial inference algorithms for a complex generalized linear model. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES