Abstract
In this paper, we develop a variable selection framework with the spike-and-slab prior distribution via the hazard function of the Cox model. Specifically, we consider the transformation of the score and information functions for the partial likelihood function evaluated at the given data from the parameter space into the space generated by the logarithm of the hazard ratio. Thereby, we reduce the nonlinear complexity of the estimation equation for the Cox model and allow the utilization of a wider variety of stable variable selection methods. Then, we use a stochastic variable search Gibbs sampling approach via the spike-and-slab prior distribution to obtain the sparsity structure of the covariates associated with the survival outcome. Additionally, we conduct numerical simulations to evaluate the finite-sample performance of our proposed method. Finally, we apply this novel framework on lung adenocarcinoma data to find important genes associated with decreased survival in subjects with the disease.
Keywords: Bayesian modeling, latent indicator, Markov chain Monte Carlo, score function, stochastic variable search, lung adenocarcinoma
2010 Mathematics Subject Classifications: 62J05, 62N02
1. Introduction
Some prominent cancers such as lung adenocarcinoma, a type of non-small cell lung cancer, have been widely studied in the 21st century, as many researchers have aimed to find how overexpression or downregulation in various genes can influence a person with lung adenocarcinoma's survival [2,4,6,29]. Finding genes associated with lung adenocarcinoma's survival is important because today, lung adenocarcinoma remains the most common lung cancer in the United States, representing about 40% of all lung cancers [26]. Additionally, this subtype of lung cancer is the most common lung cancer diagnosed to people who have never smoked, stressing the importance of detecting significant, associated genes. A typical approach is to conduct univariate analyzes for each gene, find the relationship between each gene and survival times, and choose a few genes with strong signal. As an illustration, [2] have used this marginal analysis based on Kaplan-Meier survival curves for over 7000 genes and selected a few genes associated with survival in lung adenocarcinoma patients.
Although these existing approaches are useful in expressing individual genes and their relations to lung adenocarcinoma, the genes identified may not sufficiently account for the biological mechanism. Additionally, these existing forms of univariate analysis can not only be inefficient but can also cause complications due to multiple testing and constant adjustments to the significance level. Thus, model-based approaches such as a proportional hazards model would be needed for a more complete analysis and potentially more efficient, as the strength of model-based approaches lie in being able to efficiently and simultaneously detect significant genes while accounting for the joint effects among the covariates on the survival outcome.
The Cox proportional hazards model [5] specifies the association between survival time and a set of predictors via the utilization of a hazard function. In recent years, important applications that examine this association have risen in prominence in many fields of biomedical research such as clinical trials and gene studies [13,30]. One of the primary advantages of the Cox proportional hazards model lies in its semi-parametric nature, which creates interpretability when accounting for the associations between proportional risk and the baseline hazard function. The proportional risk component allows for variable selection, as the hazard function is constructed by the predictors which most heavily influence the survival outcome of interest. In practice, researchers focused in clinical trials may be interested in how clinical treatments or patient attributes such as age and sex may affect their survival to a disease being studied. Similarly, researchers focused in genetic studies may be interested in performing variable selection by pinpointing which overexpressed or downregulated genes among thousands are associated with the survival time to a disease being studied.
While various methods, extensions, and application have been proposed to estimate the hazard function of the Cox model in settings where the number of predictors is a small number [7,10,15,19,20,31], these approaches experience difficulties in constructing the hazard function from high-dimensional predictors. This shortcoming becomes problematic in applications such as gene expression profiling, where data is strictly high-dimensional. Some high-dimensional frequentist approaches in the Cox model framework that perform variable selection and can reduce dimensionality include lasso [33], smoothly-clipped absolute deviation [8], and adaptive lasso [38]. These methods have been successfully utilized in some instances of gene expression profiling [35], but may suffer in performance as a result of diverging spectra, noise accumulation, computational burden, and inferential uncertainty issues [1,9,37].
Alternatively, the high-dimensional Bayesian approaches of variable selection have received much attention and a large number of methods including the works of [12,21,28] have shown that the aforementioned frequentist approaches can be potentially outperformed via Bayesian methods in terms of variable selection. The primary characteristic of these approaches is their usage of independent Laplace type (i.e. double-exponential) distributions as the prior distribution, concentrating more mass near 0 and in the tails. This characteristic, thereby, yields estimates for the regression coefficients as a sparse structure. Moreover, [32] introduced a double-exponential spike-and-slab prior distribution which has been successfully utilized to analyze genes associated with Dutch breast cancer data and myelodysplastic syndromes.
There have been several advantages of the spike-and-slab prior distribution defined by the mixture distribution of the normal distribution and degenerate distribution on a certain point [11,17,23,25,36]. For instance, any prior distribution concentrating more mass near a point can be flexibly generated by adjusting the point. Additionally, it provides nonlinear shrinkage of the regression coefficients, which results in smoothed/regularized estimates as well as fully Bayesian inference after fitting a Markov chain Monte Carlo. Lastly, it would be possible to develop a unified framework to deal with variable selection yielding the sparsity structure of the coefficients for both low-dimensional and high-dimensional circumstances.
The aim of this paper is to develop a variable selection framework with the spike-and-slab prior distribution and consider high-dimensional applications to specify the association between survival time and a set of predictors via the hazard function of the Cox model. Specifically, we consider the transformation of the score and first derivative of the partial likelihood function evaluated at the given data from the parameter space into the space generated by the logarithm of the hazard ratio. Thereby, we reduce the nonlinear complexity of the estimation equation for the Cox model and make it possible to utilize a wider variety of stable variable selection methods. Then, we consider using a stochastic variable search (SVS) Gibbs sampling approach via the spike-and-slab prior distribution in order to obtain the sparsity structure of the covariates associated with the survival outcome. By incorporating these two steps, a more established and potentially more stable form of sparse variable selection can be constructed without any loss of information. We show that this approach provides a model where it is easy to interpret the resulting sparsity structure, as our primary goal is to detect and differentiate significant covariates from white noise. We also conduct numerical simulations to evaluate the finite-sample performance of our method. Finally, we apply our proposed methodology to detect the sparsity structure for the lung adenocarcinoma data [2]. Unlike previous analyzes which focus primarily on utilizing univariate analytical methods to describe the associations between individual genes and lung adenocarcinoma, we will obtain a more efficient form of variable selection to simultaneously select dozens of genes which are associated with survival times.
This paper is organized as follows. In Section 2, we introduce the specifics behind our estimation procedure. In Section 3, we conduct simulation studies to evaluate the finite-sample performance of our method. In Section 4, we apply our method to the lung adenocarcinoma cancer data collected from [2]. We provide concluding remarks in Section 5.
2. Methodology
2.1. Existing estimators of the Cox model
Let the random variables T and C be the survival time and censoring time, respectively. The observed time variable is given by the minimum time denoted by . Let be the censoring indicator that takes the value of 1 when the observed time experienced the event of interest, and takes the value of 0 when the observed time was censored. In most clinical and gene expression profiling applications, the event of interest is generally death, while censoring occurs when an individual is no longer at risk of dying from the disease. Let a random vector be a set of covariates. Let be the true survival time of the ith individual and assume that it takes its value in for some positive constant τ. We consider a sample of n subjects given by , where denotes a vector of covariates of the ith individual and denotes the censoring indicator of the ith individual.
The Cox proportional hazards model, which is semi-parametric in nature, does not depend on an underlying baseline hazard function, and utilizes to model the relationship between survival time and a vector of covariates given by the following hazard function:
(1) |
where is a vector of parameters to be estimated and is the baseline hazard function, for which the baseline hazard function in (1) is the underlying hazard for any individual with .
Let be the counting process and be the at-risk process for the ith individual for any . Without loss of generality, suppose that there are no ties in observed event times. The maximum likelihood estimator for β can be found through maximizing the partial log-likelihood function [10,19,20] given by
(2) |
It should be noted, however, that the primary limitation of the Cox model framework is that it does not work with high-dimensional data. In these scenarios, other estimators of β can be found through regularization methods such as the lasso and ridge regressions. Generally, the regularized estimator, which utilizes the penalty term to the partial log-likelihood function, can be obtained by maximizing the following regularized partial log-likelihood:
(3) |
where denotes the penalty function of β and is a regularization parameter.
When imposing the penalty, i.e. on the partial log-likelihood in (2), maximizing the objective function in (3) yields the ridge estimator, for which the penalty term serves in encouraging smoothness and avoiding problems with overfitting. Similarly, when imposing the penalty, on (2), maximizing (3) yields the lasso estimator, which serves to directly detect the sparsity structure within the Cox model. Many forms of literature associated with variable selection assume the sparsity condition that there are only a few covariates truly related to the outcome, whereas all other covariates serve as noise which have no real effect on the outcome. Following this assumption, we will develop the Cox proportional hazards model framework identifying the sparsity structure of the coefficients, based on the Bayesian approach given by the spike-and-slab prior distribution proposed by [32].
2.2. Estimation
In this subsection, we develop our proposed framework by transforming the score defined on into the function defined on and the negative information defined on into the function defined on . Define as the linear predictor of the ith observation for . We consider the score function for (2) with respect to the linear predictors , whose jth component is specifically given by
(4) |
The original score function with p-dimensional length for (2) can be written as
(5) |
Then, the relation between (4) and (5) can be represented as
where is a vector with its kth entry being for .
Define
(6) |
as an vector and a matrix with its th entry being , respectively.
The information function of the linear predictors can similarly be constructed and derived from the partial log-likelihood function in (2). Thus we have
(7) |
and
(8) |
for and . The derivative of the score function with the matrix, denoted by , can be approximated as
(9) |
Notice that the Fisher scoring method [24] allows the first approximation and the argument introduced in [14] allows the second approximation.
Now we utilize these transformed components including (6) and (9) in the framework of the partial log-likelihood given by (2). Considering the quadratic approximation of the partial log-likelihood function around β, we derive an algebraic form for the maximum likelihood estimator based on the transformed functions. For the fixed τ, the usual Taylor expansion of the estimation equation about β yields
(10) |
where , , and denotes some value close to β.
Then the right-hand side in (10) is equal to
(11) |
(12) |
where is the inverse of such that to be an identity matrix.
In practice, we recommend utilizing the ridge estimator for β in high-dimensional settings, while the partial likelihood estimator is used in low-dimensional settings. Note that the second term given in (12) is not dependent on Z including β within the approximated partial log-likelihood function given in (10). Thus, Equation (11) can be rewritten in the following algebraic form:
(13) |
where , , and are an response vector, an design matrix, and an weight matrix, respectively. For notational convenience, we denote and , where and .
After the transformation inducing (13) is done via the proposed approximation, any existing sparse method of variable selection within a linear regression framework can be utilized. While there are a wide variety of sparse methods to choose from, we consider using the spike-and-slab prior distribution that employs stochastic variable search Gibbs sampling [17]. Specifically, we consider the following model with the prior distributions:
(14) |
where is used to denote a degenerate point mass distribution concentrated at the value 1, and denotes a small value near zero. Additionally, denotes the latent indicator variable, which takes if the kth covariate is classified in the nonzero group, or if the kth covariate is classified in the zero group for each k. Our model allows as the conditional variance of the prior distribution for the kth parameter, i.e. , where is a small positive value. While is treated as an independent Bernoulli random variable with parameter w, where w is a complexity parameter controlling the probability 0<w<1 as mentioned in [17], we will use the uniform prior as an indifference distribution. Thereby, the variance yields a continuous bimodal distribution with a spike at and a right-continuous tail, where we denote for brevity. This is important as the spike allows the posterior to shrink insignificant parameters towards 0 while the right-continuous tail can identify nonzero parameters. Additional details on the derivation and reasoning behind this structure can be found in [17].
We fit the Bayesian model in (14) using the SVS Gibbs sampler. The SVS Gibbs sampler works with the transformed design matrix and transformed response vector to obtain the posterior sample , where , , and . We present the computational algorithm in [17], which provides the posterior distributions as follows:
Step 1: Simulate the conditional distribution for :
where , , and .
Step 2: Simulate from its conditional distribution:
where and for .
Step 3: Simulate from its conditional distribution:
Step 4: Simulate w from its conditional distribution:
Step 5: Simulate from its conditional distribution,
Step 6: Set .
Once we obtain the transformed design matrix and response vector, we can approach the problem as a Bayesian regression problem and develop it with the spike-slab prior being able to identify the sparse structure of the covariates. There are several possible approaches using Markov chain Monte Carlo (MCMC) from the complete conditional distributions aforementioned. One possible approach is to obtain the MCMC samples as part of the spike-slab package in R that have been developed in recent years [16]. We implement this framework to produce the posterior samples. Model results can be somewhat sensitive to hyperparameter values, so we fit our model under the prior distributions including two gamma distributions with hyperparameters that are chosen as and , respectively, and the point mass that is chosen as .
3. Simulation
In this section, we conducted a simulation study to examine the performance of our proposed method with simulated time-to-event data. We considered data sets generated from both low-dimensional and high-dimensional simulation settings. For the low-dimensional settings, we conducted scenarios with sample sizes of n = 1000 and n = 3000 with three different censoring rates of 0.2, 0.3, and 0.4. The number of covariates was set to be p = 100. For the high-dimensional settings, identical sample sizes of n with the identical corresponding censoring rates were considered, but the number of covariates was set to be p = 4000. In all simulation settings, four significant parameters were used with , , , , and . Hence, the true set of the indices associated with the nonzero covariates can be denoted as .
All of the covariates are independently generated from for and . We independently generated from and then generated the true survival time based on the proportional hazards model given by
(15) |
where denotes the baseline hazard function set to be 1 in this simulation. Note that the data generating process in (15) is equivalent to an exponential model. After drawing the censoring time from , the observed time and the censoring indicator for , we generated a simulated data set of n subjects given by for each censoring rate.
As a form of comparison, we considered existing sparse variable selection methods including the Bayesian Cox model introduced by [32], denoted by (B), the Cox lasso method [33], denoted by (C), and the sparse estimation method using approximated information criterion introduced by [31], denoted by (D), where our proposed method was denoted by (A). The two methods including (B) and (D) serve as two competing sparse Bayesian approaches, whereas we consider the lasso (C) as a competing sparse frequentist approach. We define as an estimated set of indices associated with the nonzero covariates. In order to compare with the other competing methods, we examined three types of performance measures. Specifically, we considered the probability of obtaining the correct model, defined as , the probability of obtaining an overfitted model, defined as , and the expected value of incorrect nonzero covariates, defined as , where is the compliment of . For each scenario, 100 simulated data sets were generated, and three performance measures were empirically computed across all simulated data sets, i.e. , , and , respectively, where denotes an estimated set of nonzero indices for the mth simulated data for and M denotes the total number of simulated data sets.
Table 1 reports the performance measures for the low-dimensional scenarios. In order to transform the score and information functions, we chose the initial to be the Cox partial likelihood estimator. This estimator is identically utilized as the initial estimator of the Cox model based on the approximated information criterion. For lasso, we chose to work with the smallest λ value within one standard error of the minimum λ selected based on 10-fold cross validation. After transforming the score and information functions, 1000 iterations of the spike-and-slab Gibbs sampler were run with a total of 500 iterations of burn-in. For the existing Bayesian Cox model (B), we conducted 50 iterations of the EM algorithm to obtain the estimator. With n = 1000 and p = 100, we see that the proposed method (A) experiences the best performances out of all four methods. Additionally, the proposed method has by far the least amount of incorrect covariates. Notably, lasso (C) performs the second best, but never classifies more than of models correctly. For n = 3000 and p = 100, the proposed method (A) is still the best performing Bayesian method by a wide margin. We see that the lasso estimator (C) provides similar results to (A), now that n has been increased. Meanwhile, the existing Bayesian Cox method (B) and approach accompanying the approximated information criterion (D) experience decreased performances when the sample size is increased in the low-dimensional settings.
Table 1.
Simulation results for the low-dimensional data (p = 100): We used the proposed method, denoted by (A), Bayesian Cox method, denoted by (B), Cox lasso method, denoted by (C), and Cox method with the approximated information criterion, denoted by (D).
n | p | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 100 | 0.13 | 0.88 | 0.12 | 34.80 | 0.00 | 1.00 | 0.72 | 0.57 | 0.43 | 3.59 | 0.08 | 0.92 | |
(0.13) | (0.11) | (0.11) | (21.64) | (0.00) | (0.00) | (1.29) | (0.25) | (0.25) | (5.09) | (0.07) | (0.07) | |||
0.06 | 0.94 | 0.06 | 32.07 | 0.00 | 1.00 | 0.73 | 0.62 | 0.38 | 2.86 | 0.12 | 0.88 | |||
(0.06) | (0.06) | (0.06) | (16.83) | (0.00) | (0.00) | (1.86) | (0.24) | (0.24) | (4.22) | (0.11) | (0.11) | |||
0.11 | 0.89 | 0.11 | 28.11 | 0.00 | 1.00 | 0.52 | 0.69 | 0.31 | 2.86 | 0.09 | 0.91 | |||
(0.10) | (0.10) | (0.10) | (21.07) | (0.00) | (0.00) | (1.02) | (0.22) | (0.22) | (4.88) | (0.08) | (0.08) | |||
3000 | 100 | 0.05 | 0.95 | 0.05 | 58.11 | 0.00 | 1.00 | 0.12 | 0.92 | 0.08 | 2.81 | 0.05 | 0.95 | |
(0.05) | (0.05) | (0.05) | (19.59) | (0.00) | (0.00) | (0.21) | (0.07) | (0.07) | (2.80) | (0.05) | (0.05) | |||
0.08 | 0.93 | 0.07 | 54.35 | 0.00 | 1.00 | 0.08 | 0.93 | 0.07 | 3.03 | 0.05 | 0.95 | |||
(0.09) | (0.07) | (0.07) | (24.96) | (0.00) | (0.00) | (0.09) | (0.07) | (0.07) | (3.42) | (0.05) | (0.05) | |||
0.11 | 0.90 | 0.10 | 53.19 | 0.00 | 1.00 | 0.14 | 0.92 | 0.08 | 3.32 | 0.05 | 0.95 | |||
(0.12) | (0.09) | (0.09) | (19.85) | (0.00) | (0.00) | (0.38) | (0.07) | (0.07) | (3.92) | (0.05) | (0.05) |
Note: is the expected number of the incorrect nonzero covariates, is the empirical probability of the correct model, is the empirical probability of the overfitted model, n is the sample size, p is the number of the covariates, and censor denotes the censoring rate. The variances are provided in parentheses.
Table 2 contains the performance measures for the high-dimensional scenarios. It should be noted that the approximated information criterion method (D) is incompatible with high-dimensional data, meaning high-dimensional results were compared solely between the three other methods. To choose for the transformation, we chose to work with the ridge estimator as the initial estimator in a high-dimensional setting. To use the ridge estimator, we first selected the tuning parameter based on the smallest λ within one standard error of the minimum of the λ value chosen via 10-fold cross validation. After choosing λ, was computed by minimizing the penalty function. To obtain the samples from the posterior distribution, 2600 iterations of the spike-and-slab Gibbs sampler were run with a total of 100 iterations of burn-in. We set the iterations of the EM algorithm to be 50 for (B), and similarly chose the tuning parameter λ for (C) as mentioned above. For the setting with n = 1000 and p = 4000, we see that the proposed method (A) experiences the best performance out of all simulation settings. Meanwhile, we see that the spike-and-slab lasso via (B) and lasso (C) both have more difficulty detecting the sparsity structure within the simulated data. It should be noted that over of the time, the proposed method (A) correctly selects the model while lasso (C) does not manage to do so more than of the time. Similarly, less than 5 nonzero covariates are selected across 100 simulations for the proposed method. As for the simulation settings with and p = 4000, the proposed method (A) still maintains an excellent performance. Again, the existing Bayesian Cox model with spike-and-slab (B) experiences worsened performance as n increases. While lasso (C) experiences improved performance when n increases like the low-dimensional settings, the performance is still not comparable to the precision and performance of our method (A). Figure 1 shows the graphical representations for the empirical average of incorrect nonzero covariates across all scenarios. Panels A, B, C, and D depict the graphical representations of for all scenarios including the two low dimensional settings (n = 1000, p = 100 and n = 3000, p = 100), and two high dimensional settings (n = 1000, p = 4000 and n = 3000, p = 400), respectively.
Table 2.
Simulation results for the high-dimensional data (p = 4000): We used the proposed method, denoted by (A), Bayesian Cox method, denoted by (B), and Cox lasso method, denoted by (C).
n | p | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 4000 | 0.03 | 0.97 | 0.03 | 553.48 | 0.00 | 1.00 | 1.91 | 0.41 | 0.59 | |
(0.03) | (0.03) | (0.03) | (185.93) | (0.00) | (0.00) | (8.37) | (0.24) | (0.24) | |||
0.04 | 0.96 | 0.04 | 492.63 | 0.00 | 1.00 | 1.77 | 0.50 | 0.50 | |||
(0.04) | (0.04) | (0.04) | (183.71) | (0.00) | (0.00) | (11.55) | (0.25) | (0.25) | |||
0.02 | 0.98 | 0.02 | 427.08 | 0.00 | 1.00 | 1.88 | 0.44 | 0.56 | |||
(0.02) | (0.02) | (0.02) | (138.11) | (0.00) | (0.00) | (8.81) | (0.25) | (0.25) | |||
3000 | 4000 | 0.08 | 0.93 | 0.07 | 1695.04 | 0.00 | 1.00 | 0.68 | 0.67 | 0.33 | |
(0.09) | (0.07) | (0.07) | (406.50) | (0.00) | (0.00) | (1.57) | (0.22) | (0.22) | |||
0.09 | 0.92 | 0.08 | 1562.78 | 0.00 | 1.00 | 0.56 | 0.76 | 0.24 | |||
(0.10) | (0.07) | (0.07) | (497.41) | (0.00) | (0.00) | (2.31) | (0.18) | (0.18) | |||
0.13 | 0.88 | 0.12 | 1415.71 | 0.00 | 1.00 | 0.40 | 0.83 | 0.17 | |||
(0.13) | (0.11) | (0.11) | (382.23) | (0.00) | (0.00) | (3.17) | (0.14) | (0.14) |
Note: is the expected number of the incorrect nonzero covariates, is the empirical probability of the correct model, is the empirical probability of the overfitted model, n is the sample size, p is the number of the covariates, and censor denotes the censoring rate. The variances of each member are provided below in parentheses.
Figure 1.
The empirical average of incorrect nonzero covariates across all scenarios: Panels A, B, C, and D depict the graphical representations of for all scenarios including the two low dimensional settings (n = 1000, p = 100 and n = 3000, p = 100), and two high dimensional settings (n = 1000, p = 4000 and n = 3000, p = 400), respectively. In each panel, we included the proposed method (solid line with open circle), Bayesian Cox method (dashed line with open triangle), Cox lasso method (dotted line with filled circle), and the method with minimized approximated information criterion (dot-dashed line with filled triangle).
In summary, it should be noted that while the performances of lasso (C), Bayesian Cox model (B), and Cox model with the approximated information criterion (D) are heavily influenced by sample sizes, the performances of the proposed method are relatively consistent across numerous sample sizes in this simulation. These results suggest that, in practice, the proposed method is a method that is capable of pinpointing sparsity structures within datasets regardless of sample size. Similarly, increasing the amount of noisy variables within the data can significantly influence the incorrect amount of incorrect covariates chosen by methods including (B) and (C), leading to overfitted models. On the other hand, our method (A) tends to maintain similar, if not improved, results when the total amount of covariates is increased.
Tables 3 and 4 contain the performance of the prediction measures. We considered the areas under the receiver operating characteristic curve, denoted by AUC, the area under the precision-recall curve, denoted by PRC and concordance index, denoted by CCI as three predictive measures. It can be seen in the low-dimensional settings that ultimately all of the methods had a rather similar performance. However, in the high-dimensional setting, we note that (A) and (C) performed significantly better in the n = 1000 case than (B) which severely overfitted models. We see that for all three performance measures, (B) has a significantly worse performance than (A) and (C). In the n = 3000 case, our proposed method (A) performed slightly better than (C) when we considered the concordance index. Additionally, we see that (B) performs significantly worse than (A) and (C) when we considered the performance measures of AUC and PRC.
Table 3.
Predictive results for the low-dimensional data (p = 100): We used the proposed method, denoted by (A), Bayesian Cox method, denoted by (B), Cox lasso method, denoted by (C), and Cox method with the approximated information criterion, denoted by (D).
n | p | AUC | PRC | CCI | AUC | PRC | CCI | AUC | PRC | CCI | AUC | PRC | CCI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 100 | 0.834 | 0.831 | 0.821 | 0.832 | 0.829 | 0.821 | 0.834 | 0.831 | 0.822 | 0.831 | 0.828 | 0.820 | |
(0.012) | (0.015) | (0.009) | (0.012) | (0.015) | (0.009) | (0.012) | (0.015) | (0.009) | (0.012) | (0.015) | (0.009) | |||
0.837 | 0.833 | 0.825 | 0.836 | 0.831 | 0.824 | 0.837 | 0.832 | 0.825 | 0.835 | 0.831 | 0.823 | |||
(0.013) | (0.016) | (0.010) | (0.013) | (0.016) | (0.010) | (0.011) | (0.015) | (0.009) | (0.013) | (0.016) | (0.010) | |||
0.837 | 0.833 | 0.825 | 0.836 | 0.832 | 0.824 | 0.837 | 0.833 | 0.825 | 0.835 | 0.831 | 0.823 | |||
(0.013) | (0.016) | (0.010) | (0.013) | (0.016) | (0.010) | (0.013) | (0.016) | (0.010) | (0.013) | (0.016) | (0.010) | |||
3000 | 100 | 0.836 | 0.834 | 0.824 | 0.835 | 0.832 | 0.824 | 0.836 | 0.834 | 0.824 | 0.835 | 0.833 | 0.824 | |
(0.006) | (0.008) | (0.005) | (0.006) | (0.008) | (0.005) | (0.006) | (0.008) | (0.005) | (0.006) | (0.008) | (0.005) | |||
0.837 | 0.835 | 0.825 | 0.837 | 0.834 | 0.825 | 0.836 | 0.834 | 0.825 | 0.837 | 0.834 | 0.824 | |||
(0.006) | (0.007) | (0.004) | (0.006) | (0.007) | (0.004) | (0.006) | (0.007) | (0.004) | (0.006) | (0.008) | (0.004) | |||
0.837 | 0.834 | 0.825 | 0.836 | 0.834 | 0.825 | 0.836 | 0.834 | 0.825 | 0.836 | 0.833 | 0.824 | |||
(0.006) | (0.007) | (0.004) | (0.006) | (0.007) | (0.004) | (0.006) | (0.007) | (0.004) | (0.006) | (0.007) | (0.004) |
Note: Performance was assessed via three measures abbreviated as AUC (area under the receiver operating characteristic curve), PRC (area under the precision-recall curve), and CCI (concordance index). n is the sample size, p is the dimension of the covariates, and censor denotes the censoring rate, where the standard deviations are provided in parentheses.
Table 4.
Predictive results for the high-dimensional data (p = 4000): We used the proposed method, denoted by (A), Bayesian Cox method, denoted by (B), and Cox lasso method, denoted by (C).
n | p | AUC | PRC | CCI | AUC | PRC | CCI | AUC | PRC | CCI | |
---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 4000 | 0.838 | 0.831 | 0.827 | 0.811 | 0.802 | 0.811 | 0.837 | 0.833 | 0.825 | |
(0.012) | (0.016) | (0.009) | (0.014) | (0.019) | (0.010) | (0.011) | (0.015) | (0.008) | |||
0.837 | 0.829 | 0.828 | 0.817 | 0.807 | 0.815 | 0.837 | 0.832 | 0.824 | |||
(0.011) | (0.016) | (0.008) | (0.012) | (0.017) | (0.008) | (0.012) | (0.016) | (0.008) | |||
0.838 | 0.831 | 0.828 | 0.824 | 0.815 | 0.817 | 0.837 | 0.831 | 0.825 | |||
(0.011) | (0.014) | (0.008) | (0.012) | (0.016) | (0.010) | (0.013) | (0.016) | (0.009) | |||
3000 | 4000 | 0.838 | 0.833 | 0.829 | 0.803 | 0.790 | 0.832 | 0.837 | 0.834 | 0.825 | |
(0.006) | (0.008) | (0.004) | (0.006) | (0.009) | (0.005) | (0.006) | (0.008) | (0.005) | |||
0.838 | 0.832 | 0.828 | 0.817 | 0.805 | 0.832 | 0.835 | 0.832 | 0.824 | |||
(0.006) | (0.007) | (0.004) | (0.007) | (0.009) | (0.004) | (0.006) | (0.008) | (0.004) | |||
0.838 | 0.833 | 0.828 | 0.827 | 0.817 | 0.832 | 0.837 | 0.833 | 0.825 | |||
(0.006) | (0.008) | (0.004) | (0.006) | (0.010) | (0.004) | (0.005) | (0.008) | (0.004) |
Note: Performance was assessed via three measures abbreviated as AUC (area under the receiver operating characteristic curve), PRC (area under the precision-recall curve), and CCI (concordance index). n is the sample size, p is the dimension of the covariates, and censor denotes the censoring rate, where the standard deviations are provided in parentheses.
4. Real data application
4.1. Primary biliary cirrhosis data
We demonstrate our method by first performing an analysis on the primary biliary cirrhosis (PBC) data collected by the Mayo Clinic between 1974 and 1984. This dataset, which is publicly available in the survival package in R, consists of 418 observations and 17 covariates. Prior to applying our method on the PBC dataset, we first performed some filtering measures. All observations that contained missing covariate values were removed, leaving us with 276 observations. The PBC dataset, unlike most survival datasets, consists of three censoring indicators. The first indicator, which corresponds to 0, represents patients who survived and did not need a liver transplant. The indicator corresponding to 1 corresponds to patients who survived, but needed a liver transplant, while the indicator corresponding to 2 corresponds to patients who died from PBC. To ensure censoring was binary, we assigned patients who survived the censoring indicator of 0 and patients who died the censoring indicator of 1. After performing these adjustments to the PBC data, we were left with 111 censored observations, creating a censoring rate of 40.22%.
We first standardized all of the covariates. This was due to the varying scales within the PBC data, specifically with regards to variables such as alkaline phosphatase and serum bilirubin. The initial estimator was chosen as the maximum likelihood estimator provided by the partial likelihood function. After transforming the score and information functions, 1000 iterations of the spike-and-slab Gibbs sampler were run with a total of 500 iterations of burn-in. Again, we compared our results on the standardized survival data set with the existing Bayesian Cox model (B), Cox lasso method (C), and Cox method using approximated information criterion (D). As this data set was already explored in the publications of [27,33], we reported their original results for (C) and (D).
Table 5 shows the results for the PBC data set. Six variables were selected by all methods, which consist of age, the presence of edema, the amount of serum bilirubin in mg/dl, albumin in gm/dl, urine copper in ug/day, and histologic stage of disease. Two additional variables were selected by all of the methods except the Bayesian lasso (B), which consisted of the AST enzyme in U/liter, and the prothrombin time in seconds. Notice that sex was selected by the lasso (C) but not via any other methods. In the PBC data, 242 out of 276 subjects are female, creating a large imbalance for sex which may have contributed to the result where only the lasso estimator detected sex as significant. Four additional covariates were selected via our method (A) and were not selected by the method with the approximated information criterion (D). These covariates are the presence of ascites, the presence of hepatomegaly, the presence of spiders, and the level of serum cholesterol in mg/dl.
Table 5.
Results for the PBC data set: We used the proposed method (A), Bayesian Cox method (B), Cox lasso method (C), and Cox method with the approximated information criterion (D).
Covariate | ||||
---|---|---|---|---|
Trt | No | No | No | No |
Age | Yes | Yes | Yes | Yes |
Sex | No | No | Yes | No |
Ascites | Yes | No | Yes | No |
Hepato | Yes | No | No | No |
Spiders | Yes | No | Yes | No |
Edema | Yes | Yes | Yes | Yes |
Bili | Yes | Yes | Yes | Yes |
Chol | Yes | No | No | No |
Albumin | Yes | Yes | Yes | Yes |
Copper | Yes | Yes | Yes | Yes |
Alk.phos | No | No | No | No |
Ast | Yes | No | Yes | Yes |
Trig | No | No | No | No |
Platelet | No | No | No | No |
Protime | Yes | No | Yes | Yes |
Stage | Yes | Yes | Yes | Yes |
Figure 2 contains Kaplan-Meier plots for these four covariates. For the cholesterol covariate, the groups were split based on the median value. The lasso estimator (C) selected the presence of ascites and the presence of spiders to be significant, but did not select the presence of hepatomegaly or the level of serum cholesterol to be significant. However, two studies after the initial lasso publication support the significance of these variables [18,34]. In all four plots, there is clear visual evidence to suggest that there are significant differences between the survival curves of the groups. Additionally, the p-values from Kaplan-Meier log-rank test, which are calculated and displayed within each Kaplan-Meier plot, suggest that there are statistically significant differences between the two groups for each variable.
Figure 2.
Kaplan-Meier plots for the PBC data set: Panels A, B, C, and D contain the survival probabilities for the presence of ascites, presence of spiders, presence of hepatomegaly, and amount of serum cholesterol, respectively. For A, B, and C, the black lines indicate the presence of the covariate whereas the grey lines indicate the absence of the covariate. For D, the black line represents serum cholesterol levels higher than the median level, whereas the grey line represents subjects who had serum cholesterol levels lower than the median level. In each panel, p-values for the log-rank test are provided.
Within the context of clinical trials, under-fitted models can lead to severe consequences, as they might overlook important and significant covariates which are associated with patient survival times. The results of this analysis suggest that the proposed method can detect important and significant covariates that may not necessarily be reported by other existing methods within low-dimensional settings.
4.2. Lung adenocarcinoma data
Next, we demonstrate the performance of our method in a high-dimensional setting with the lung adenocarcinoma data presented by [2]. The data set consists of 86 observations with 7129 genes serving as covariates. Of the 86 observations, 62 are censored, yielding a censoring rate of 72.09%. As the data set was high-dimensional, the initial for the transformation was chosen utilizing an initial ridge estimator. The tuning parameter was chosen as the smallest λ within one standard error of the minimum λ chosen by leave-one-out cross-validation (LOOCV). After choosing λ, we used the penalty function to calculate . After transforming the score and information functions, 5500 iterations of the spike-and-slab process were run with 500 iterations of burn-in.
Ultimately, our method detected 28 genes as being significant to predicting the survival of patients with lung adenocarcinoma. To elaborate on the results, we discuss four of the significant genes reported by our proposed method, which correspond to genes which have been previously reported to have strong associations with lung cancer. These genes are referred to as PRKACB, GAPDH, KLF6 and STX1A. In previous studies, there has been strong evidence to suggest that downregulation in the PRKACB genes can increase risk for those with lung adenocarcinoma [4]. Similarly, overexpression in GAPDH, KLF6, and STX1A also leads to the increased risk [2,6,29].
To further examine these genes, we showcase Kaplan-Meier plots in Figure 3 that correspond to the four previously mentioned genes. Although the genes have continuous values, observations were split into two groups based on the median values for each gene. In the Kaplan-Meier plots, we observe a clear difference between survival times within the groups plotted. For PRKACB, GAPDH, and STX1A, the p-values from the Kaplan-Meier log-rank test [3] also suggest significant differences between the subgroups of each gene. These results are consistent with other studies, and suggest that our proposed method is correctly detecting signals within a high-dimensional setting with a high censoring rate.
Figure 3.
Kaplan-Meier plots for the lung adenocarcinoma data set: Panels A, B, C, and D contain the survival probabilities for the PRKACB, GAPDH, KLF6, and STX1A genes, respectively. The black lines indicate levels of the gene higher than the median level, whereas the grey lines levels of the gene lower than the median level. In each panel, p-values for the log-rank test are provided.
The Kaplan-Meier log-rank test for the KLF6 gene yields a p-value of 0.06, which normally suggests that there is no significant difference between the survival curves. However, the loss of power within the Kaplan-Meier log-rank test can be attributed to the early crossings between the two survival curves and relatively small sample size [22]. Visually, it is clear after the time point of 25 that there is a clear difference in survival probabilities between those who experience overexpression in the KLF6 gene and those who do not. This gene presents an important example of our method being able to detect a significant gene that may have been overlooked by other popular methods such as the log-rank test. The results of this analysis suggest that even in high-dimensional settings with high censoring rates, our method is able to detect significant variables and sparsity structures.
5. Discussion
In this paper, we have proposed a sparse variable selection method that transforms the nonlinear estimating equation for the partial likelihood function into a linear estimating equation framework prior to performing a method of sparse variable selection. Although our primary focus was motivated by improving sparse variable selection within the context of clinical trials and cancer gene-expression profiling, the method can easily be extended to other low- and high-dimensional contexts within the Cox model.
In our real data analysis on lung adenocarcinoma data, we found that the proposed method is capable of detecting signal and joint effects on the survival outcome among many noisy covariates. The proposed method was able to efficiently select 28 significant genes associated with decreased survival in patients with lung adenocarcinoma, whereas competing sparse variable selection methods were unable to identify any signal amongst the genes. The results suggest that in future genetic studies, our method can be utilized as a reliable way to efficiently detect joint effects of genes associated with survival times, serving as an improvement over the popular forms of univariate analysis that have been commonly used within the field.
It should be noted that the ridge estimator chosen for the initial transformation can be sensitive and time consuming to obtain in high-dimensional settings. In our simulations, we chose to utilize 10-fold cross validation to select our regularization parameter λ for better computational speed. However, in our real data application, LOOCV was utilized to select a more precise λ at the cost of computational time. Additionally, in future work, we may look into alternatives to the SVS Gibbs sampler for the variable selection step. Although the SVS Gibbs sampler has demonstrated exceptional performance, improvements may be possible by devising a variable selection method specific to the proposed transformed linear estimating equation.
Supplementary Material
Acknowledgments
We thank the Associate Editor and two reviewers for their insightful comments and constructive suggestions to improve the paper.
Correction Statement
This article has been corrected with minor changes. These changes do not impact the academic content of the article.
Funding Statement
Wu and Ahn were supported by National Institute of General Medical Sciences of the National Institutes of Health (NIH/NIGMS) under grant number P20 GM103650. Ahn was also supported by the Simons Foundation Award ID 714241. Yang was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. NRF2021R1C1C1007023).
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Ahn M., Zhang H.H., and Lu W., Moment-based method for random effects selection in linear mixed models, Stat. Sin. 22 (2012), pp. 1539–1562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Beer D.G., Kardia S.L., Huang C.C., Giordano T.J., Levin A.M., Misek D.E., Lin L., Chen G., Gharib T.G., Thomas D.G., Lizyness M.L., Kuick R., Hayasaka S., Taylor J.M., Iannettoni M.D., Orringer M.B., and Hanash S., Gene-expression profiles predict survival of patients with lung adenocarcinoma, Nat. Med. 8 (2002), pp. 816–824. [DOI] [PubMed] [Google Scholar]
- 3.Bland J.M. and Altman D.G., The logrank test, BMJ 328 (2004), p. 1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chen Y., Gao Y., Tian Y., and Tian D.L., PRKACB is downregulated in non-small cell lung cancer and exogenous PRKACB inhibits proliferation and invasion of LTEP-A2 cells, Oncol. Lett. 5 (2013), pp. 1803–1808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cox D.R., Regressioin models and life-tables, J. R. Stat. Soc. Ser. B (Methodol.) 34 (1972), pp. 187–220. [Google Scholar]
- 6.DiFeo A., Feld L., Rodriguez E., Wang C., Beer D.G., Martignetti J.A., and Narla G., A functional role for KLF6-SV1 in lung adenocarcinoma prognosis and chemotherapy response, Cancer. Res. 68 (2008), pp. 965–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fan J. and Jiang J., Non- and semi-parametric modeling in survival analysis, in New Developments in Biostatistics and Bioinformatics, World Scientific & Higher Education Press, New Jersey, 2009, pp. 3–33.
- 8.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]
- 9.Fan J. and Lv J., A selective overview of variable selection in high dimensional feature space, Stat. Sin. 20 (2010), pp. 101–148. [PMC free article] [PubMed] [Google Scholar]
- 10.Fleming T.R. and Harrington D.P., Counting Processes and Survival Analysis, John Wiley & Sons, Hoboken, New Jersey, 2011. [Google Scholar]
- 11.George E.I. and McCulloch R.E., Approaches for bayesian variable selection, Stat. Sin. 7 (1997), pp. 339–373. [Google Scholar]
- 12.Griffin J.E. and Brown P.J., Bayesian hyper-lassos with non-convex penalization, Aust. N. Z. J. Stat. 53 (2011), pp. 423–442. [Google Scholar]
- 13.Güler E., Gene expression profiling in breast cancer and its effect on therapy selection in early-stage breast cancer, Eur. J. Breast Health 13 (2017), pp. 168–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hastie T. and Tibshirani R., Generalized Additive Model, Chapman and Hall, New York, 1990. [Google Scholar]
- 15.Ibrahim J.G., Chen M.H., and Sinha D., Bayesian Survival Analysis, Spring, New York, 2001. [Google Scholar]
- 16.Ishwaran H., Kogalur U.B., and Rao J., spikeslab: Prediction and variable selection using spike and slab regression, R Journal 2 (2010), pp. 68–73. [Google Scholar]
- 17.Ishwaran H. and Rao J.S., Spike and slab variable selection: Frequentist and bayesian strategies, Ann. Statist. 33 (2005), pp. 730–773. [Google Scholar]
- 18.Janičko M., Veselíny E., Leško D., and Jarčuška P., Serum cholesterol is a significant and independent mortality predictor in liver cirrhosis patients, Ann. Hepatol. 12 (2013), pp. 413–419. [PubMed] [Google Scholar]
- 19.Kalbfleisch J.D. and Prentice R.L., The Statistical Analysis of Failure Time Data, 2nd ed., John Wiley & Sons, Hoboken, New Jersey, 2011. [Google Scholar]
- 20.Lawless J.F., Statistical Models and Methods for Lifetime Data, 2nd ed., John Wiley & Sons, Hoboken, New Jersey, 2011. [Google Scholar]
- 21.Leng C., Tran M.N., and Nott D., Bayesian adaptive lasso, Ann. Inst. Stat. Math. 66 (2014), pp. 221–244. [Google Scholar]
- 22.Li H., Han D., Hou Y., Chen H., and Chen Z., Statistical inference methods for two crossing survival curves: a comparison of methods, PLoS One 10 (2015), p. e0116774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Madigan D. and Raftery A.E., Model selection and accounting for model uncertainty in graphical models using Occam's window, J. Am. Stat. Assoc. 89 (1994), pp. 1535–1546. [Google Scholar]
- 24.McCullagh P. and Nelder J., Generalized Linear Models, 2nd ed., Chapman & Hall, London, UK, 1989. [Google Scholar]
- 25.Mitchell T.J. and Beauchamp J.J., Bayesian variable selection in linear regression, J. Am. Stat. Assoc. 83 (1988), pp. 1023–1032. [Google Scholar]
- 26.Myers D.J. and Wallen J.M., Cancer, lung adenocarcinoma, in StatPearls [Internet], StatPearls Publishing, Treasure Island, FL, 2019.
- 27.Nabi R. and Su X., coxphMIC: An R package for sparse estimation of Cox proportional hazards models via approximated information criteria, R Journal 9 (2017), pp. 229–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Park T. and Casella G., The Bayesian lasso, J. Am. Stat. Assoc. 103 (2008), pp. 681–686. [Google Scholar]
- 29.Puzone R., Savarino G., Salvi S., Dal Bello M.G., Barletta G., Genova C., Rijavec E., Sini C., Esposito A.I., Ratto G.B., Truini M., Grossi F., and Pfeffer U., Glyceraldehyde-3-phosphate dehydrogenase gene over expression correlates with poor prognosis in non small cell lung cancer patients, Mol. Cancer. 12 (2013), p. 97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Singh K. and Gupta N., Palateless custom bar supported overdenture: A treatment modality to treat patient with severe gag reflex, Indian. J. Dent. Res. 23 (2012), pp. 145–148. [DOI] [PubMed] [Google Scholar]
- 31.Su X., Wijayasinghe C.S., Fan J., and Zhang Y., Sparse estimation of Cox proportional hazards models via approximated information criteria, Biometrics 72 (2016), pp. 751–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Tang Z., Shen Y., Zhang X., and Yi N., The spike-and-slab lasso Cox model for survival prediction and associated genes detection, Bioinformatics 33 (2017), pp. 2799–2807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Tibshirani R., The lasso method for variable selection in the Cox model, Stat. Med. 16 (1997), pp. 385–395. [DOI] [PubMed] [Google Scholar]
- 34.Uddenfeldt P. and Danielsson Å., Primary biliary cirrhosis: survival of a cohort followed for 10 years, J. Intern. Med. 248 (2008), pp. 292–298. [DOI] [PubMed] [Google Scholar]
- 35.Xu J., High-dimensional Cox regression analysis in genetic studies with censored survival outcomes, J. Probab. Stat. 2012 (2012), p. 478680. [Google Scholar]
- 36.Yang H., Baladandayuthapani V., Rao A.U., and Morris J.S., Quantile function on scalar regression analysis for distributional data, J. Am. Stat. Assoc. 115 (2020), pp. 90–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Yang H., Zhu H., and Ibrahim J.G., MILFM: Multiple index latent factor model based on high-dimensional features, Biometrics 74 (2018), pp. 834–844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhang H.H. and Lu W., Adaptive Lasso for Cox's proportional hazards model, Biometrika 94 (2007), pp. 691–703. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.