Abstract
Efficient variable selection in high dimensional cancer genomic studies is critical for discovering genes associated with specific cancer types and for predicting response to treatment. Censored survival data is prevalent in such studies. In this article we introduce a Bayesian variable selection procedure that uses a mixture prior composed of a point mass at zero and an inverse moment prior in conjunction with the partial likelihood defined by the Cox proportional hazard model. The procedure is implemented in the R package BVSNLP, which supports parallel computing and uses a stochastic search method to explore the model space. Bayesian model averaging is used for prediction. The proposed algorithm provides better performance than other variable selection procedures in simulation studies, and appears to provide more consistent variable selection when applied to actual genomic datasets.
Keywords: Bayesian Variable Selection, Nonlocal Prior, High Dimensional Data, Survival Data Analysis, Cox Proportional Hazard Model, Cancer Genomics
1. Introduction.
Recent developments in sequencing technology have made it easier to collect massive genomic datasets that can be used to study cancer and other diseases. Given such data, there is great interest in linking genomic data to patient outcomes, and in many cases such outcomes are censored survival times.
Survival times for patients generally represent either the time to death or disease progression, the time to study termination, or the time until the subject is lost to follow up. In the latter cases, the subject’s survival time is censored. The relation between survival times and covariates is modeled through the conditional hazard function, which is the limiting probability of death in the interval (t, t+Δt) as Δt becomes small, given patient covariates. More precisely, the hazard function h for patient i may be defined as
| (1.1) |
where xi is a p vector of covariates thought to influence survival. We denote by X the n × p design matrix obtained by stacking n patient covariate vectors. Proportional hazard models take the form
| (1.2) |
with an identifiability constraint of Φ(0) = 1. In this formula, h0(t) denotes the baseline hazard function. The Cox proportional hazards model (Cox, 1972) is defined by taking , leading to
| (1.3) |
Here, β is a p × 1 vector of coefficients.
An important feature of the proportional hazards model is that it yields a partial likelihood function that is independent of the baseline hazard function, h0. For complete survival analyses, however, the baseline hazard function is necessary for predicting survival times and can be estimated non-parametrically. Further details regarding the Cox proportional hazard model may be found in Cox and Oakes (1984), Kalbfleisch and Prentice (1980) or Cox (1972).
Gene expression datasets usually contain measurements on thousands of genes collected for only hundreds of subjects. Biologically it seems plausible that only a relatively small number of these genes contribute significantly to survival. This implies that most of the elements in the vector β are small or are close to zero. The challenge is to find covariates with non-zero coefficients or, equivalently, those genes that contribute the most in determining the survival outcome.
Many common penalized likelihood methods originally introduced for linear regression have been extended to survival data. These methods include LASSO (Tibshirani et al., 1997), in which an L1 penalty is imposed on regression coefficients. Zhang and Lu (2007) utilized adaptive LASSO methodology for time to event data, while Antoniadis, Fryzlewicz and Letué (2010) adopted the Dantzig selector for survival outcomes. The extension of non-convex penalized likelihood approaches, in particular SCAD, to the Cox proportional hazard model is discussed in Fan and Li (2002). The Iterative Sure Independence Screening (ISIS) approach introduced by Fan and Lv (2008) is also extended for ultrahigh dimensional survival data in Fan et al. (2010), where it is used on Cox proportional hazard models and the SCAD penalty is employed for variable selection.
Some Bayesian approaches have also been proposed. Faraggi and Simon (1998) proposed a method based on approximating the posterior distribution of the parameters in the proportional hazard model by defining a Gaussian prior on regression coefficients. A loss function was then imposed to select a parsimonious model. A semi-parametric Bayesian approach was utilized by Ibrahim, Chen and MacEachern (1999), who employed a discrete gamma process for the baseline hazard function and a multivariate Gaussian prior for the coefficient vector. Sha, Tadesse and Vannucci (2006) considered Accelerated Failure Time (AFT) models along with data augmentation to impute failure times. A mixture prior proposed by George and McCulloch (1997) was used to impose sparsity. In more recent work, Held, Gravestock and Sabanés Bové (2016) proposed the use of a g-prior model for the coefficient vector and employed test-based Bayes factors (Johnson, 2005) to the Cox proportional hazard models. However, this method is intended for use only when the number of covariates is less than the number of observations; that is, when p < n.
To our knowledge, all previous Bayesian procedures for variable selection in survival data have used local priors on model coefficients. In this article, we propose a Bayesian method based on a mixture prior comprised of a point mass at zero and a nonlocal prior on the regression coefficients. To handle the computational burden of implementing the resulting procedure we employ a stochastic search method, S5 (Shin, Bhattacharya and Johnson, 2018), which we implement in an R package BVSNLP. We also discuss a general procedure for setting the tuning parameter of the nonlocal prior.
This article is structured as follows. In Section 2 we introduce notation, and discuss the modeling of the problem in a Bayesian framework. Section 3 discusses the proposed method, with details of parameter selection, model search, and assessment of the accuracy of the proposed variable selection procedure. Sections 4 and 5 provide simulation and real data analyses with various predictive performance measures to demonstrate how the proposed method compares to several other competing methods. Section 6 concludes with discussion.
2. Problem Modeling.
2.1. Preliminaries.
Let Ti denote the survival time and Ci denote the censoring time for individual i. Each element in the observed vector of survival times, y, is defined as yi = min{Ti, Ci}. The status for each individual is defined as δi = I(Ti ≤ Ci). The status vector is represented by δ = (δ1, δ2,…,0 δn)T. We assume that the censoring mechanism is at random, meaning that Ci and Ti are conditionally independent given xi, where are the covariates for individual i and comprise the ith row of X. The observed data is of the form {(yi, δi, xi); i = 1, 2,…,n}.
Model k is defined as k = {k1,…, kj} where (1 ≤ k1 < · · · < kj ≤ p) and it is assumed that and all other elements of β are 0. The design matrix corresponding to model k is denoted by Xk, and the regression vector by .
Let represent the risk set at time t, the set of all individuals who are still present in the study at time t and are neither dead nor censored. We assume throughout this article that the failure times are distinct. In other words, only one individual fails at a specific failure time. With this assumption and letting , the partial likelihood (Cox, 1972) for βk in model k can be written as
| (2.1) |
Our method uses this partial likelihood as the sampling distribution in our Bayesian model selection procedure. We acknowledge that there is some information loss in (2.1) with respect to βk. For instance, Basu (Ghosh, 1988) argues that partial likelihoods cannot usually be interpreted as sampling distributions. On the other hand, Berger et al. (1999) encourage the use of partial likelihoods when the nuisance parameters are marginalized out.
Sorting the observed unique survival times in ascending order and consequently re-ordering the status vector δ as well as the design matrix X with respect to the ordered y, the sampling distribution of y for model k can be written as
| (2.2) |
A Bayesian hierarchical model can be defined in which π(y |βk) in (2.2) represents the sampling distribution, πk(βk) is the prior of model coefficients βk and p(k) is the prior for model k. Using Bayes rule, the posterior probability for model j is written as
| (2.3) |
where is the set of all possible models and the marginal probability of the data under model k is defined by
| (2.4) |
The prior density for βk and the prior on the model space impact the overall performance of the selection procedure and the amount of sparsity imposed on candidate models. Note that the sampling distribution in (2.2) is continuous in βk, and in Section 2.3 we define an inverse moment prior (Johnson and Rossell, 2010) on each of the coefficients in model k.
2.2. Prior on Model Space.
Let γk = {γ1, · · · , γp} denote a binary vector indicating which covariates are included in model k. Suppose the size of model k is k. That is, there are k nonzero indices in γk. The nonzero indices of γk represent the indices of the nonzero elements in the coefficient vector, β, which a priori are modeled as independent Bernoulli random variables with success probability P (γi = 1) = θ for every 1 ≤ i ≤ p. As discussed in Scott et al. (2010), no fixed value for θ adjusts for multiplicity. As a result, it is necessary to define a prior on θ, say π(θ). The resulting marginal probability for model k in a fully Bayesian approach may then be written as
| (2.5) |
A common choice for π(θ) is the beta distribution, θ ~ Beta(a, b), where in the special case of a = b = 1, π(θ) is a uniform distribution. The marginal probability for model k derived from (2.5) is then equal to
| (2.6) |
where B(·) is the Beta function. A priori, the model size, k, thus follows a Beta-binomial distribution. By choosing b = p − a, the mean and variance of the selected model size k is
| (2.7) |
The approximation in the variance formula follow from a large p and a fairly small a under the sparsity assumption on the true model size. To incorporate the belief that the optimal predictive models are sparse, we recommend setting a = 1 and b = p − a. The resulting prior assigns comparatively small prior probabilities to models that contain many covariates.
2.3. Product Inverse MOMent (piMOM) Prior.
We impose nonlocal prior densities on the non-zero coefficients, βk. Specifically, we assume the prior densities on the non-zero coefficients in model k take the form of a product of independent iMOM priors, or piMOM densities (Johnson and Rossell, 2012), expressible as
| (2.8) |
The hyperparameter τ represents a scale parameter that determines the dispersion of the prior around 0, while r determines the tail behavior of the density. These priors have two symmetric modes with Cauchy-like tails when r = 1, and assign negligible probability to a region around zero. In comparison to local priors, this characteristic of nonlocal priors potentially leads to smaller false positive rates in selection procedures by discouraging the selection of variables with small coefficients. On the other hand, piMOM priors possess Cauchy-like tails, which introduce comparatively small penalties on large coefficients. Unlike many penalized likelihood methods, large values of regression coefficients are thus not heavily penalized by these priors. As a result, they do not necessarily impose significant penalties on non-sparse models provided that the estimated coefficients in those models are not small. For these reasons, piMOM priors work well as a default choice of priors on non-negligible coefficients in variable selection problems.
Consistency properties of piMOM priors for linear models were studied in Shin, Bhattacharya and Johnson (2018). In that setting it was shown that piMOM priors are consistent for , 0 < α < 1. By comparison, this property does not hold for pMOM priors. An example of an iMOM prior is depicted in Figure 4 for r = 1 and τ = 0.5. The source of inconsistency for pMOM priors in p » n settings stems from the fact that their densities go to zero only at an inverse polynomial rate in a neighborhood of the origin.
Fig 4:

iMOM and MOM prior with r = 1 and τ = 0.5.
3. Methods.
3.1. Selection of Hyperparameters.
We use the procedure described in Nikooienejad, Wang and Johnson (2016) to select hyperparameter values for the piMOM prior. In that method, the null distribution of the maximum likelihood estimator for βk (i.e., all components of βk are 0), obtained from randomly selected design matrices Xk, is compared to the prior density on βk for various values of (r, τ). Fixing r = 1 to achieve Cauchy-like tails, a value of τ is chosen so that the overlap between the two densities is less than a specified threshold, , and is denoted by τ1. It can be shown that the maximum of the iMOM prior occurs at . We also allow users to input a prior parameter α that controls where the modes in the prior occur. This can be useful in constraining the prior density when covariates are highly correlated (resulting in an over-dispersed prior when the sampling distribution of the null MLE under the null model becomes overly broad). We then set the value of τ according to
| (3.1) |
To implement the procedure for computing τ1 for survival models, we generate response vectors under the null model using the procedure described by Bender, Augustin and Blettner (2005). Survival times are sampled from a standard exponential model.
Let ts and cs be the vector of sampled survival times and censoring times, respectively. The sampled survival time and status for each observation is then computed as
| (3.2) |
which comprise ys and δs under the null model. Using the pair (ys, δs), the MLE from Cox model is computed. It should be noted that the asymptotic distribution of the MLE for the Cox model under the null hypothesis is , where I(β) is the information matrix of the partial likelihood function. Thus, it is appropriate to approximate the pooled estimated coefficients in that algorithm with a normal density function. When the sample size gets large, the variance of the MLE decreases and causes the overlap to become small and consequently small values of τ are selected.
In general, we find that r = 1 and τ = 0.25 are good default values if one chooses not to run the hyperparameter selection algorithm. When r = 1, the peaks of the iMOM prior occur at and . By equating to the absolute value of the expected effect size for a given application, insight can be gained on what value of τ is appropriate. Further details regarding this algorithm can be found in Nikooienejad, Wang and Johnson (2016).
3.2. Computing Posterior Probability of Models.
Computing the posterior probability for each model requires the marginal probability of observed survival times under each model as shown in (2.3), (2.4). The marginal probability is approximated by using the Laplace approximation, where the regression coefficients in βk are integrated out. This leads to
| (3.3) |
Here, is the maximum a posteriori (MAP) estimate of βk, is the Hessian of the negative of the log posterior function,
| (3.4) |
computed at and k is the size of model k. Finding the MAP of βk is equivalent to finding the minimum of g(βk).
The details of computing the gradient and Hessian matrix of g(βk) are discussed in Section A of the Appendix. The gradient and Hessian matrix, described by equations (A.3) to (A.7), are used to find the MAP, and to compute the Laplace approximation of the marginal probability of y.
We use the limited memory version of the Broyden-Fletcher-Goldfarb-Shanno optimization algorithm (L-BFGS) (Liu and Nocedal, 1989) to find the MAP. The initial value for the algorithm is the MLE for the Cox proportional hazard model.
Having all the components of formula (2.3), it is possible to define a MCMC framework to sample from the posterior distribution on the model space. A birth-death scheme, similar to that used in Nikooienejad, Wang and Johnson (2016), could be used for this purpose. However, for computational reasons we use another stochastic algorithm to search the model space; this algorithm is described in the next section.
The highest posterior probability model (HPPM) is defined as the model having the highest posterior probability among all visited models. In practice, many models may be assigned probabilities that are close to the probability achieved by the HPPM. For this reason and for predictive purposes, it is useful to obtain the Median Probability Model (MPM) (Barbieri et al., 2004), which is the model containing covariates that have posterior inclusion probabilities of at least 0.5. According to Barbieri et al. (2004), the posterior inclusion probability for covariate i is defined as
| (3.5) |
That is, the sum of posterior probabilities of all models that have covariate i as one of their variables. In this expression, γki is a binary value determining the inclusion of the ith covariate in model k.
3.2.1. Stochastic Search Algorithm.
To increase the efficiency of exploring the model space, we use the S5 algorithm. S5 was proposed by Shin, Bhattacharya and Johnson (2018) for variable selection in linear regression problems, and we adapt it here for survival models. It is a stochastic search method that screens covariates at each step. The algorithm is scalable and its computational complexity is only linearly dependent on p (Shin, Bhattacharya and Johnson, 2018).
Screening is the essential part of the S5 algorithm. In linear regression, screening is based on the correlation between excluded covariates and the residuals of the regression using the current model (Fan and Lv, 2008). The concept of screening covariates for survival response data is proposed in Fan et al. (2010), and is defined based on the marginal utility for each covariate.
To illustrate the screening technique, suppose that the current model is k. Let kc denote the complement of set k containing columns of the design matrix that are not in the current model, k. The conditional utility of covariate m ∈ kc represents the amount of information covariate m contributes to the survival outcome, given model k, and is defined as
| (3.6) |
By comparing um | k to (A.1), it follows heuristically that the conditional utility is the maximum likelihood for covariate m after accounting for the information provided by model k. Finding um | k is a univariate optimization procedure that can be computed rapidly.
With this background, the S5 algorithm for survival data works as follows. At each step, the d covariates with highest conditional utility are candidates to be added to the current model k and comprise the addition set, Γ+. The deletion set, Γ− contains the current model, except that one variable is removed. From the current model, k, we consider moves to each of its neighbors in Γ+ and Γ− with a probability proportional to the marginal probabilities of these neighboring models.
To avoid local maxima, the model probabilities used in S5 are raised to the power of 1/tl, where tl is the lth temperature in an annealing schedule in which “temperatures” decrease. To increase the number of visited models, a specified number of iterations are performed at each temperature. At the end of the procedure, the model with the highest posterior probability of visited models is identified as the HPPM.
In our version of the S5 algorithm, we used 10 equally spaced temperatures varying from 3 to 1 and 30 iterations within each temperature. Section D of the Appendix provides some discussion on how these values are chosen for this application. To increase the number of visited models, we parallelized the S5 procedure so that it could be distributed to multiple CPUs. Each CPU executes the S5 algorithm independently with a different starting model. All visited models are pooled together at the end and the HPPM and MPM are determined. Using posterior probabilities of the visited models, the posterior inclusion probability for each covariate can be computed using (3.5). In our simulations, we used 120 CPUs to explore the model space for design matrices with O(104) covariates.
3.3. Predictive Accuracy Assessment.
In addition to looking at the selected genes and their pathways to determine their biological relevance in analyzing the real data sets, we used the time dependent AUC, obtained from time dependent ROC curves as introduced by Heagerty, Lumley and Pepe (2000) for survival times to summarize and compare the predictive performance of the various algorithms. This measure has a relatively straightforward interpretation, and unlike other summary measures such as the c-index (Harrell Jr et al., 1982), can be computed without requiring specific conditions or additional assumptions to hold (Blanche, Kattan and Gerds, 2018). However, predictive performance measures including the c-index, Integrated Brier Score (IBS)(Gerds and Schumacher, 2006), and prediction error curves are investigated and reported in Sections 4 and 5 for both simulation and real data sets.
There are different methods to estimate time dependent sensitivity and specificity. In our algorithm, we adapted a method proposed by Uno et al. (2007), henceforth called Uno’s method. In that method, after splitting data into training and test sets, sensitivity is estimated by and specificity is estimated by
| (3.7) |
and specificity is estimated by
| (3.8) |
These values are estimated for the test set. Therefore, in the equations above, n is the number of observations in the test set, δi is the status of observation i and Ti is the observed time for that observation in the test set. The variable c is the discrimination threshold that is varied to obtain the ROC curve. The function is the Kaplan-Meier estimate of the survival function obtained from the training set. For each observation i in the test set with observed time Ti, is computed by a basic interpolation procedure. That is,
| (3.9) |
Here, Ttr is the set of all observed survival times in the training set. In (3.7) and (3.8), represents the estimated coefficient under a specific model.
3.3.1. Bayesian Model Averaging (BMA).
BMA can be used to improve the predictive accuracy by accounting for the uncertainty in selected models. From (3.7) and (3.8) the final sensitivity and specificity using BMA may be defined as
| (3.10) |
and
| (3.11) |
where, is the posterior probability of model The value of depends on what type of BMA is used. We use Occam’s window, which means only models that have posterior probability of at least are used in model averaging. We set w = 0.01 for our applications.
In the proposed method, individual survival curves are estimated using the highest posterior probability model. Section B provides the details of this procedure. Similar approaches were also adopted by Held, Gravestock and Sabanés Bové (2016) in estimating the survival curve for each individual in a study.
4. Simulation Results.
To investigate the performance of the proposed model selection procedure, we applied our method to simulated datasets. To design different simulation cases we followed the guidance of Morris, White and Crowther (2019) as a basis for our simulation study protocol. In particular, the simulation design was based on the ADEMP structure (Aims, Data generating mechanism, Estimands, Methods, and Performance measures) discussed in that article. We refer to each of those elements as we explain different parts of the simulation design in the following.
Regarding ‘Methods’, we compared the performance of our algorithm to ISIS-SCAD (Fan et al., 2010) and GLMNET (Friedman, Hastie and Tibshirani, 2010), two of the most highly used algorithms for high dimensional variable selection for survival data. We used the published R packages of those two methods to run the simulations. We also performed a comparison with a case when pMOM priors are used as the prior for nonzero coefficients instead of piMOM priors.
The ‘Aim’ of the simulation study is to compare the performance of our method with the other two methods with respect to the correlation structure between covariates in the design matrix. More specifically, we reported three different simulation settings that consider different combinations of correlation structure, true model size, and the magnitude of true coefficients. This is the basis of our ‘Data generating mechanism’. The correlation structure used in those settings are similar to the simulations reported in Fan et al. (2010).
For Case 1, X1,…,Xp are multivariate Gaussian random variables with mean 0 and marginal variance of 1. The correlation structure is corr(Xi, X5) = 0 for all i ≠ 4, 5, , and corr(Xi, Xj) = 0.5 for i, j ∈ {1,…, p} \ {4, 5}. The size of the true model is 5 with non-zero regression coefficients β1 = 0.8, β2 = −0.8, β3 = 0.8, β4 = −2.2, β5 = 0.66 and βi = 0 for i > 5. The number of observations and covariates are n = 400 and p = 1000. The censoring rate for this case is approximately 27:6%. The survival and censoring times are both sampled from an exponential distribution. The rate parameter for the distribution of censoring times is set to 0:1.
For Case 2, X1,…, Xp are multivariate Gaussian random variables with mean 0 and marginal variance of 1. The correlation structure between variables is corr(Xi, Xj) = 0.5; i ≠ j. The size of the true model is 6 with nonzero regression coefficients β1 = −0.5140, β2 = 1.2799, β3 = −2.5307, β4 = 0.7164, β5 = 1.3020, β6 = −0.8833, and βi = 0 for i > 6. The number of observations and covariates are n = 400 and p = 1000. In this case, the survival times are sampled from a Weibull distribution with rate parameter λ = 0.1 and shape parameter k = 15. The censoring times are sampled uniformly from [0, 8], and the resulting censoring rate for this case is approximately 14.8%.
For Case 3, the design matrix and correlation structure between variables is the same as Case 2, where corr(Xi, Xj) = 0.5, i ≠ j. The size of the true model is 20 with non-zero regression coefficients β1 = −1.6702, β2 = −1.7122, β3 = 0.2313, β4 = −1.3798, β5 = 0.9305, β6 = −1.3835, β7 = −1.8575, β8 = 1.0676, β9 = 1.2013, β10 = 0.5116, β11 = 0.9871, β12 = −0.7680, β13 = 0.5159, β14 = −0.2649, β15 = −1.7482, β16 = −1.6041, β17 = 1.4234, β18 = −1.2884, β19 = −1.1299, β20 = 1.2569, and βi = 0 for i > 20.The number of observations and covariates are n = 400 and p = 1000. The censoring rate for this case is approximately 34.1%. The survival and censoring times are both sampled from an exponential distribution. The rate parameter for the distribution of censoring times is set to 0.1.
Each simulation case is then repeated 50 times, niter = 50, and each time with different random seed numbers in order to generate different datasets.
The primary targets of our simulation study or the ‘Estimands’, according to Morris, White and Crowther (2019), are identifying the true model as well as estimating the vector of coefficients of the true model. Accordingly, we reported four different quantities as ‘Performance measures’ for those estimands. The first two quantities are the mean l1 norm of the error in estimating vector of coefficients, and the mean squared error (MSE). The mean l1 norm is computed as , and the MSE is computed as . The third quantity is the mean model size of the selected models and is denoted by MMS. MTP and MFP denote mean false positive and mean true positive values for each algorithm. Formal definitions of MFP, MTP are provided in Section C in the Appendix.
Table 1 compares the performance of our method, BVSNLP, the default variant type of ISIS-SCAD and GLMNET algorithms. The λ parameter in GLMNET wass picked by cross validation.
Table 1.
Comparison between BVSNLP, ISIS-SCAD and GLMNET for simulation Cases 1, 2, and 3 with n = 400 and p = 1000.
| BVSNLP | ISIS-SCAD | GLMNET | |
|---|---|---|---|
| Case 1: | |||
| MSE | 0.185 | 0.367 | 2.314 |
| Mean l1 norm | 0.534 | 0.749 | 5.719 |
| MMS | 5.16 | 5.68 | 67.06 |
| MTP | 4.98 | 4.94 | 4 |
| MFP | 0.18 | 0.74 | 63.06 |
| Case 2: | |||
| MSE | 0.058 | 0.268 | 0.983 |
| Mean l1 norm | 0.464 | 0.557 | 3.224 |
| MMS | 6 | 6.22 | 44.22 |
| MTP | 6 | 6 | 6 |
| MFP | 0 | 0.22 | 38.22 |
| Case 3: | |||
| MSE | 0.302 | 1.520 | 2.627 |
| Mean l1 norm | 1.915 | 6.268 | 13.416 |
| MMS | 18.38 | 16 | 117.64 |
| MTP | 18.34 | 15.96 | 19.52 |
| MFP | 0.04 | 0.04 | 98.12 |
Table 2 compares the Monte Carlo standard errors (Morris, White and Crowther, 2019) of the MSEs for all three different methods.
Table 2.
Monte Carlo Standard Errors for the MSE of the coefficient vector for all three methods.
| Case 1 | Case 2 | Case 3 | |
|---|---|---|---|
| BVSNLP | 0.086 | 0.015 | 0.021 |
| ISIS-SCAD | 0.063 | 0.014 | 0.053 |
| GLMNET | 0.080 | 0.025 | 0.037 |
In the S5 algorithm, 30 iterations are used within each temperature. The parameter d was chosen as 2⎡log(p)⎤. As described in Section 3.2.1, d represents the number of candidate covariates that are added to the current model to make the addition set, Γ+ Each S5 algorithm was run in parallel on 120 CPUs for both simulation cases. The beta-binomial prior was imposed on the model space with a = 1, b = p − a. The hyperparameters of the piMOM prior were selected using the algorithm discussed in Section 3.1 with α = 0.4 for Case 1 and α = 0.8 for Cases 2 and 3, imposed as the prior mode. Finally, the average run-time of the BVSNLP algorithm for the entire simulation is summarized in Table 3.
Table 3.
Average BVSNLP run time over 50 iterations in each simulation case using 120 CPUs.
| Case 1 | Case 2 | Case 3 | |
|---|---|---|---|
| Run time (seconds) | 21.5 | 19.0 | 23.7 |
As demonstrated in Table 1, our method performs better than the other two methods according to all selected metrics, regardless of the size of the true model. The di erence between BVSNLP and ISIS-SCAD is best illustrated as the size of the true model increases. GLMNET has significantly higher mean false positive rates than the other two methods.
Figures 1, 2, and 3 compare the average IBS over 50 iterations between the methods discussed above. IBS is computed using the R package pec (Mogensen, Ishwaran and Gerds, 2012) based on a five-fold cross validation. A benchmark model based on Kaplan-Meier estimate, which includes no covariates, is also added to the figures as a reference for the comparison. The average c-index measures for all the methods are also reported in Table 4. The c-index measures are computed based on the method discussed in van Houwelingen and Putter (2011), using the dynpred package in R. Because a new dataset was created at each iteration, it was not possible to get the average prediction errors, due to the fact that the times points where prediction errors change were different for different data sets.
Fig 1:

Average IBS for all methods in simulation case 1.
Fig 2:

Average IBS for all method in simulation case 2.
Fig 3:

Average IBS for all method in simulation case 3.
Table 4.
Average c-index measures over 50 iterations in each simulation case.
| Case 1 | Case 2 | Case 3 | |
|---|---|---|---|
| BVSNLP | 0.823 | 0.872 | 0.939 |
| ISIS-SCAD | 0.828 | 0.872 | 0.925 |
| GLMNET | 0.845 | 0.911 | 0.971 |
As shown in the IBS plots, all three methods perform better than the reference. BVSNLP and ISIS-SCAD have a very similar performance. For Case 3, where the true model has 20 covariates, BVSNLP outperforms the other two methods, whereas in the Case 2, GLMNET has the best performance. The c-index is similar for all methods, and seems to provide a smaller penalty for model size. This feature of the c-index is discussed further in Section 6.
4.1. Comparison With pMOM Prior.
Another nonlocal prior that might be considered as a potential candidate for the prior densities on the non-zero coefficients in model k is the product of independent MOM priors, or the pMOM densities (Johnson and Rossell, 2012), specified by
| (4.1) |
The hyperparameter τ has the same role as in piMOM densities in (2.8) and r is the order of the density. An example of a MOM prior for r = 1 and τ = 0.5 is depicted in Figure 4.
Following the discussion of nonlocal priors in Section 2.3, we again note that for r = 1 and a fixed τ, piMOM densities assign negligible probability to a wider region around zero than do pMOM densities. More specifically, pMOM densities decrease to zero at only an inverse polynomial rate while piMOM densities decrease at a rate that is order exp(−τ/β2), which is much faster. Consequently, smaller false positive rates are expected for procedures based on piMOM priors that those based on pMOM priors. On the other hand, pMOM densities have tails that converge to zero at an exponential rate, while piMOM densities have heavier, Cauchy-like tails. Moreover, the consistency property of piMOM priors discussed previously for linear models does not hold for pMOM priors. For these reasons, piMOM-based procedures are more effective for variable selection in p » n settings.
To better demonstrate the practical importance of these properties, we performed 20 simulation studies where the number of observations and covariates were n = 200 and p = 10, 000. The true model had size 6 with true coefficients equal to (0.5, 0.85, 1.00, 1.50, 1.85, 2.5). The sign of coefficients were chosen randomly with probability 0.5 in each simulation. The columns of the design matrix were multivariate Gaussian random variables with mean 0 and marginal variance of 1. The correlation between every two variables was 0.5. In each of the simulations, we fixed r = 1 and assigned τ = 15 different values in the interval [0.01, 10]. The survival times were simulated from an exponential distribution with mean 10.
The proposed variable selection algorithm was implemented on the simulation data using both pMOM and piMOM priors. Table 5 summarizes the outcome of the selection procedure using different hyperparameter values for both priors. The numbers are averaged over 20 simulations. In that table, MTPR is the mean true positive rate, MFPR is the mean false positive rate and TMP is the proportion of times that the true model was found without any false positives.
Table 5.
Comparison of variable selection outcomes between pMOM and piMOM for different values of hyperparameter,τ, over 20 simulations.
| Hyperparameter τ | MTPR(%) | MFPR(%) | TMP | |||
|---|---|---|---|---|---|---|
| piMOM | pMOM | piMOM | pMOM | piMOM | pMOM | |
| 0.01 | 100 | 22.50 | 0 | 0.15 | 1 | 0 |
| 0.2 | 100 | 21.67 | 0 | 0.12 | 0.9 | 0 |
| 0.4 | 99.17 | 21.67 | 0.01 | 0.12 | 0.9 | 0 |
| 0.6 | 99.17 | 20.83 | 0.01 | 0.12 | 0.9 | 0 |
| 0.8 | 97.50 | 20.83 | 0.01 | 0.12 | 0.85 | 0 |
| 1.0 | 95 | 20.83 | 0 | 0.12 | 0.70 | 0 |
| 1.25 | 95 | 20.83 | 0 | 0.12 | 0.70 | 0 |
| 1.6 | 94.16 | 20 | 0 | 0.12 | 0.65 | 0 |
| 2.0 | 93.33 | 20 | 0 | 0.12 | 0.60 | 0 |
| 2.3 | 93.33 | 19.17 | 0 | 0.12 | 0.60 | 0 |
| 3.8 | 90.83 | 18.33 | 0 | 0.12 | 0.45 | 0 |
| 5.4 | 87.50 | 18.33 | 0 | 0.12 | 0.35 | 0 |
| 6.9 | 85 | 18.33 | 0 | 0.12 | 0.30 | 0 |
| 8.4 | 81.67 | 18.33 | 0 | 0.12 | 0.20 | 0 |
| 10.0 | 81.67 | 18.33 | 0 | 0.12 | 0.20 | 0 |
As shown in Table 5, the pMOM model never finds the true model for any of the τ values. Moreover, the average true positive rate for pMOM is always 5 times less than that for piMOM, and the average false positive rate for pMOM is higher than it is for piMOM. This suggests variable selection based on piMOM priors in ultrahigh dimensional settings is likely to perform better than variable selection based on pMOM priors.
5. Application to Real Data.
We applied our method to selected genes associated with patient survival times for two common cancer types using datasets from The Cancer Genome Atlas (TCGA) projects: kidney renal clear cell carcinoma (KIRC) (Cancer Genome Atlas Research Network, 2013) and kidney renal papillary cell carcinoma (KIRP) (Cancer Genome Atlas Research Network, 2016). We compared the performance of our algorithm to ISIS-SCAD (Fan et al., 2010), GLMNET (Friedman, Hastie and Tibshirani, 2010) and Stability Selection (Meinshausen and Bühlmann, 2010). Stability Selection is combined with a high dimensional selection algorithm such as GLMNET and selects the most stable features for a given level of Type I error.
We included patient’s ‘Age’, ‘Gender’ and a clinical stage variable, ‘Stage’, in the design matrix. On the advice of a clinician, the ‘Stage’ variable was developed by combining the histological stage, pathological stage and clinical stage, into one variable that is a summary of how advanced each subject’s cancer was when the tissue sample was taken. ‘Stage’, like ‘Gender’, is a categorical variable but with 3 levels, where ‘Stage i’ represents the ith class of that variable; ‘Stage 3’ represents the most advanced stage.
To remove stromal contaminations from the gene expression data, the DeMixT algorithm (Wang et al., 2017), was performed on the design matrix and the tumor-specific expression data were used in the analyses for all algorithms.
The predictive performance was measured by a time-dependent AUC, as discussed in Section 3.3, based on a five-fold cross-validation. The observations in each fold were randomly chosen under a constraint which balanced censoring rate between folds. The AUC values were computed for the test set using the model that was obtained by performing variable selection on the training set. The selected covariates for each cancer type were also compared. For our method, we report the covariates associated with the highest posterior probability model. The hyperparameter τ of the piMOM prior was selected using the algorithm in Section 3.1 with α = 0.1 as the maximum of the piMOM prior. This is our choice of α for real datasets. (In simulation data analyses, α was chosen to be 0.8 since the magnitude of non-zero coefficients in the dataset designed by Fan and Li (2002) were unrealistically large for unit variance covariates). The results for each cancer type are discussed in separate sections below. Note that GLMNET has a random output when the hyperparameter is selected by cross-validation. As a result, based on the recommendation of the inventors of that algorithm, we ran it 100 times for each fold and took the average of results as the outcome for that fold.
We treated categorical variables ‘Stage’ and ‘Gender’, as well as the continuous variable ‘Age’ as fixed covariates in our model. However, available ISIS-SCAD and Stability Selection software packages are not able to fix preselected covariates to include in all models. For this reason, dummy variables associated to ‘Stage’ and ‘Gender’ were manually added to the design matrix and were subject to the selection procedure for those procedures.
To run the Stability Selection method, we used the c060 R package (Sill et al., 2014) and the recommended values for function arguments.
5.1. Kidney Renal Clear Cell Carcinoma (KIRC).
The KIRC dataset (Cancer Genome Atlas Research Network, 2013) contains 490 observations with 13,267 covariates, after removing covariates with missing expressions and observations with missing survival times. The censoring rate for this dataset is 66.94%. Table 6 shows the covariates selected by each method. As mentioned previously, GLMNET produces random outputs at each run and therefore for this table, only the output for one of the runs are indicated; other runs produced a similar number of selected covariates.
Table 6.
Selected genes and covariates for KIRC across different variable selection algorithms
| BVSNLP | Age | Gender | Stage |
| SUDS3 | AR | ||
| ISIS-SCAD | Stage 3 | AR | Age |
| HEBP1 | ATP2C1 | GADD45A | |
| MTERF2 | ADGRL3 | GPSM1 | |
| SERPINI1 | SP6 | ZNF815P | |
| INAFM2 | |||
| GLMNET | Stage | AR | Age |
| HEBP1 | SEC61A2 | TRMT6 | |
| PCBP4 | FAHD2A | MCM8 | |
| E2F5 | SLC5A6 | NARF | |
| RAB28 | DONSON | GPSM1 | |
| HACD1 | MARS | FASN | |
| TRAIP | RPL17P50 | SLC26A6 | |
| GPR162 | INAFM2 | ACACA | |
| Stability Selection | Stage 3 | AR | Age |
| INAFM2 |
In addition to the categorical covariate ‘Stage’, BVSNLP selects ‘AR’ and ‘SUDS3’ in the HPPM as the most significant covariates in the design matrix. The posterior inclusion probabilities for ‘AR’ and ‘SUDS3’ are 0.80 and 0.08, respectively. The ‘Age’, ‘Gender, and ‘Stage’ were fixed in all models and thus were selected with probability 1. The MAP estimates for the coefficients of ‘Age’, ‘Gender Male’, ‘Stage 2’, ‘Stage 3’, ‘AR’ and ‘SUDS3’ were 0.33, −0.11, 0.45, 1.61, −0.60 and 0.36, respectively. These coefficients indicate that patients with the most advanced stages of cancer had the poorest survival rates, and that a patient with tumor sample characterized as advanced has a hazard rate that was exp(1.61) ≈ 5 times higher than a patient with tumor sample characterized as localized, when all other covariates were the same. These coefficients also show that the hazard rate in females is 1.12 times that in males, and age has an unfavorable impact on the hazard rate, as expected. Moreover, the negative sign for the ‘AR’ gene indicates it has a favorable impact on survival for KIRC. ‘AR’, the Androgen Receptor gene, functions as a steroid-hormone activated transcription factor. It has been well documented that ‘AR’ promotes the progression of renal cell carcinoma (RCC) through hypoxia-inducible factors HIF-2α and vascular endothelial growth factor regulation (Fenner, 2016). The favorable impact of the ‘AR’ gene was also studied by Hata et al. (2017) in bladder cancer. ‘SUDS3’ is a regulatory protein that is part of the SIN3A corepressor complex component that potentially has a role in tumor suppressor pathways through regulation of apoptosis. There was previous evidence of the down-regulation of the SIN3A gene in tumorigenesis of lung cancer (Suzuki et al., 2008).
It is noteworthy that the algorithm selected the same highest posterior probability model for different values of the hyperparameter τ in the range [0.01, 0.9], where there were no constraints on the modes of the piMOM prior. This shows the robustness of the proposed variable selection algorithm to the choice of hyperparameter τ for a range of plausible values.
For this particular run of GLMNET, a much larger model was selected with 24 variables including two of the variables reported by BVSNLP. ISIS-SCAD selected 13 covariates, which included the 4 covariates that were selected by the Stability Selection method. ‘AR’ and the last level of ‘Stage’ are the common covariates among all methods.
The time dependent AUC plot for all four methods, obtained by performing a five-fold cross validation, is depicted in Figure 5.
Fig 5:

Average AUC of different variable selection methods based on a five fold cross validation for KIRC dataset.
As illustrated in Figure 5, BVSNLP has slightly better predictive accuracy than GLMNET and Stability Selection. However, it achieves this accuracy with a much sparser model. We investigated the covariates that were selected by each of those algorithms in all five folds and found that BVSNLP, in addition to those fixed covariates, selects only 10 unique genes in total, where ‘AR’ is selected in three of the 5 folds.
GLMNET selected 160 different covariates across all 5 folds. Only five out of 24 selected covariates in Table 6 were selected in all five training datasets in cross validation. Those include ‘Age’, ‘Stage’ and ‘AR’. GLMNET was run 100 times for each fold.
ISIS-SCAD selected 45 different covariates and only ‘Stage 3’ was selected in all training datasets in cross validation. The Stability Selection method selected sparser models compared to ISIS-SCAD and GLMNET by selecting 13 different covariates. It picked and only ‘Age’ and ‘Stage 3’ in all five folds.
Figures 6 and 7 compares IBS and prediction error curves, respectively, between different methods for the KIRC dataset. These two measures are computed based on a five-fold cross validation. Computation of IBS and prediction error were done using the R package pec (Mogensen, Ishwaran and Gerds, 2012). A benchmark model based on the Kaplan-Meier estimate, which includes no covariates, was also added to the figures as a reference for the comparison. The c-index measures are also reported in Table 7. The c-index was computed as it was in Section 4 using the dynpred package in R.
Fig 6:

IBS comparison between all methods for the KIRC dataset.
Fig 7:

Comparison of prediction errors between all methods for the KIRC dataset.
Table 7.
Average c-index measure of different methods for the KIRC dataset.
| BVSNLP | GLMNET | ISIS-SCAD | Stability Selection | |
|---|---|---|---|---|
| c-index measure | 0.804 | 0.816 | 0.846 | 0.797 |
GLMNET has almost the same IBS curve as the reference Kaplan Meier curve. BVSNLP outperforms ISIS-SCAD and Stability selection has the best IBS performance among all. For prediction error curves, BVSNLP is second to ISIS-SCAD, and GLMNET and Stability Selection have almost the same performance. A different behavior can be seen for c-index measures where GLMNET and ISIS-SCAD have higher c-indices than BVSNLP.
The average run-time for different methods in each fold of the cross validation is summarized in Table 8. BVSNLP is run on 120 CPUs, Stability Selection is run on 4 CPUs, while GLMNET and ISIS-SCAD are run on a single CPU.
Table 8.
Average run time for different methods in each fold of the cross validation for KIRC data set.
| BVSNLP | GLMNET | ISIS-SCAD | Stability Selection | |
|---|---|---|---|---|
| Run time (minutes) | 6.4 | 180 | 5.0 | 1.3 |
In our previous study of binary outcomes using the same dataset (Nikooienejad, Wang and Johnson (2016)), we performed hierarchical clustering on the de-convolved tumor-specific expression matrix and identified two clusters of patient samples. We saw these two groups of patients present significantly different survival outcomes and therefore assigned good vs. bad survival to the groups. The dichotomization was done solely based on the clustering results of de-convolved gene expression levels. Survival times and censoring did not play any role in that process. However, there was a loss of information in dichotomizing a survival dataset and analyzing it with logistic regression. Now, with BVSNLP, we are able to use the original time to event with censoring information. To further compare the biological implications between the two analyses, we looked for known expression regulation networks between the gene sets found in the binary analysis, SAV1 and NUMBL, and the new genes found in this analysis, AR and SUDS3, using Pathway Studio® (Nikitin et al., 2003; Elsevier, 2018). We found that the well-studied cancer genes TGFB1, BCL2, PPARG, NEDD4, and CTNNB1, and a regulatory microRNA, MIR21, constitute the shortest paths between SAV1 and AR. Similarly, we found cancer genes CDKN1A, WNT3A, two genes that determine cell fate (SOX17 (connected with CTNNB1) and NANOG), and PAX6 that regulates transcription, to constitute the shortest paths between NUMBL and SUDS3. These are depicted in Figure 8. These findings suggest a high biological consistency between our two analyses, using BVSNLP to select features for binary and survival outcomes.
Fig 8:

Expression regulation networks connecting the old and new gene sets. a) This diagram shows all genes that are in the shortest pathways through the expression regulation between SAV1 and AR. b) This diagram shows all genes in the shortest pathways through expression regulation between NUMBL and SUDS3.
In summary, the binary model using SAV1 and NUMBL to predict overall survival of patients with kidney cancer is not as effective as the model using AR and SUDS3, as shown in Figure 9. Thus, although the findings of Nikooienejad, Wang and Johnson (2016) were all biologically justified, some limitations were associated with those findings due to the information loss incurred by clustering and dichotomizing the data, and the BVS-NLP model provides better insight on the genes associated with this cancer type.
Fig 9:

Comparison between BVSNLP model selection using survival and dichotomized versions of the KIRC dataset.
5.2. Kidney Renal Papillary Cell Carcinoma (KIRP).
The KIRP dataset (Cancer Genome Atlas Research Network, 2016) contains 244 samples with 13,335 covariates (after necessary data cleaning) and has a fairly high censoring rate of 85.7%. The covariates selected by each method are summarized in Table 9.
Table 9.
Selected covariates for KIRP across different variable selection algorithms
| BVSNLP | Age | Gender | Stage |
| CDK1 | |||
| ISIS-SCAD | CDK1 | COL6A1 | C19orf33 |
| GLMNET | No covariates were slected | ||
| Stability Selection | Stage 3 | MTC02P12 | RPL39P3 |
In addition to the fixed covariates ‘Age’, ‘Gender’, and ‘Stage’, BVSNLP selects the ‘CDK1’ gene in the HPPM as the most significant covariate in the design matrix. The posterior inclusion probability for ‘CDK1’ was 0.12. The MAP estimates for the coefficients of ‘Age, ‘Gender Male’, ‘Stage 2’, ‘Stage 3’ and ‘CDK1’ were 0.12, −0.10, 0.11, 0.79 and 1.13, respectively. This shows that a unit increase in ‘CDK1’ (Cyclin dependent kinase 1) gene expression increases the hazard rate by a factor of 3, for given values of the other covariates. CDK1 is a cell cycle regulator and has been reported previously as a prognostic marker gene for various cancer types. Many experimental studies have been performed to further understand the molecular mechanism behind the complex functions of CDK1 (Malumbres and Barbacid, 2009). This is the first time, however, that CDK1 has been reported as a prognostic marker gene in human data for papillary renal cell carcinoma. As expected, patients at the most advanced stage cancer have a hazard rate that is 2.2 times higher than patients at a localized stage of cancer, given the values of all other covariates. As in the case of KIRC patients, age and male gender have unfavorable and favorable impacts on the hazard rate, respectively.
Surprisingly, GLMNET does not select any covariates and ISIS-SCAD selects covariates that do not intersect BVSNLP. Stability Selection picked 3 covariates, with only ‘Stage 3’ in common with BVSNLP. As in the previous dataset, we tested BVSNLP for different choices of τ in the interval [0.01, 0.9] and the same model was selected for all values within this range. The total run-time of BVSNLP for this dataset was around 5 minutes using 120 CPUs.
Figure 12 shows the predictive accuracy for the proposed method based on a five-fold cross-validation. The outcomes for GLMNET, ISIS-SCAD and Stability Selection are not displayed in the plot because those methods did not converge or failed to produce results for at least one of the five folds in the cross-validation experiment.
Fig 12:

Average AUC of BVSNLP based on a five fold cross validation for the KIRP dataset.
The small AUC values in this plot for t < 1 warrant comment. Because there were few events soon after entry of tissue samples into the TCGA database, the AUC for early timepoints falls close to the 50% benchmark reflecting no predictive value.
Figures 10 and 11 respectively depict IBS and prediction error curves of the BVSNLP method, based on a five-fold cross validation for the KIRP dataset and compares it to the reference curve obtained by the Kaplan Meier method.
Fig 10:

IBS of BVSNLP for the KIRP dataset.
Fig 11:

Prediction error of BVSNLP for the KIRP dataset.
The c-index measure for the BVSNLP method is 0.876. The average run-time for BVSNLP in each fold of the cross validation was 3.6 minutes on 120 CPUs.
6. Discussion.
In this article a Bayesian variable selection method, BVSNLP, was proposed for selecting variables in high and ultrahigh dimensional datasets with survival time as outcomes. BVSNLP uses an inverse moment nonlocal prior density on non-zero regression coefficients. Analyses of simulated and real data suggest that BVSNLP performs comparably or better than other existing methods for variable selection for survival data. Moreover, the real data results indicated that the proposed algorithm is robust to the choice of the hyperparameter τ in the piMOM prior for values of τ in the range [0.01, 0.9].
Various outputs are provided by the algorithm. These include the HPPM, MPM and the posterior inclusion probability for each covariate in the model. For real datasets, Bayesian model averaging is used to incorporate uncertainty in selected models when computing time dependent AUC plots using Uno’s method (Uno et al., 2007). Finally, an R package named BVSNLP has been implemented to make the algorithm freely available and adaptable to interested researchers. The package can be run in parallel fashion where hundreds of CPUs can be exploited in order to increase the number of visited models in the search for highest posterior probability model. The BVSNLP package is available in the R repository, CRAN, at https://CRAN.R-project.org/package=BVSNLP. The user manual for the package is also available from this site.
Two real cancer genomic datasets from the TCGA website were considered in this article. Compared to other methods, BVSNLP found sparser models with biologically relevant genes. The proposed method showed a reliable predictive accuracy as measured by AUC using substantially fewer variables.
We have based our assessments on time dependent AUC and biological interpretation of the results, but other measures, like IBS, prediction error and the concordance index (also know as the c-index or Harrell’s c-index) are also reported. Difficulties associated with such measures are identified in Blanche, Kattan and Gerds (2018). In particular, the authors of the article demonstrate that the concordance index can favor misspecified models over the correctly specified model because it is based on the order of event times rather than the event status at the prediction horizon. This may explain the slightly higher c-index values for GLMNET in both simulation and real data sets. The time dependent AUC does not suffer from this deficiency. Of course, different evaluation criteria can be expected to result in different rankings of models, and criteria that emphasize prediction error over low false positive rates can be expected to favor larger models. Similarly, criteria that place a higher premium on eliminating false positives will tend to select smaller models.
Acknowledgments
Supported by NIH grant R01CA158113.
Supported by 1R01CA174206, 1R01CA183793, 5R01CA158113 and P30CA016672.
APPENDIX A: CALCULATING THE GRADIENT AND HESSIAN OF g(βK)
Let l(y; βk) = log(π(y | βk)) and lπ (βk) = log(π(βk)). For a n × p matrix A, let A(i) denote the n × 1 vector corresponding to the ith column of A and Aj denote the 1 × p vector corresponding to the ith row of A. Also let , where Ai:n,. is the sub-matrix of A from row i to the last row where all columns are included. This makes the dimension of equal to p × (n − i + 1). Similarly, for a vector α of size n, let denote the sub-vector of α components i, i + 1,…, n, a vector of size (n − i + 1).
Let and . Also let η denote the n × 1 column vector exp{Xkβk}. The logarithm of π(y | βk) in (2.2) can then be expressed as
| (A.1) |
For each n × k design matrix Xk and βk vector, define a new k × n matrix , with ith column
| (A.2) |
Here, and are obtained from matrix Xk and vector η, respectively, using the notation described in the beginning of this section.
The negative gradient of l(y; βk) can then be written as
| (A.3) |
To compute the Hessian matrix, let be the (i, j) element of . The k × k identity matrix is denoted by Ik and D(α) denotes a diagonal matrix with the elements of the vector α on its diagonal. Finally, let ζj = Xk(j) denote the jth column of Xk.
Row j of the k × k Hessian matrix of −l(y; βk) is defined as
| (A.4) |
The matrix itself is constructed row by row, with row i equal to
| (A.5) |
Computing the Hessian can be implemented with a computational complexity of O(n).
The gradient and Hessian of the logarithm of the piMOM prior is more straightforward, and is given by
| (A.6) |
The Hessian of −lπˇ(βk) is a diagonal matrix, D(α), where
| (A.7) |
APPENDIX B: ESTIMATING INDIVIDUAL SURVIVAL CURVES
In the Cox proportional hazard model the survival function for individual i under model is defined as
| (B.1) |
where is the cumulative baseline hazard function, which can be estimated by
| (B.2) |
This is known as the Breslow estimator of (the observed times are sorted as in (2.2)). At this point three approaches can be exploited to estimate the survival curve for individual i. The first approach is to compute the HPPM survival curve by replacing with , and use the MAP estimate of β under the HPPM, , in (B.1) and (B.2). That is,
| (B.3) |
The second approach is computationally more intensive but takes into account the uncertainty of the posterior samples of the model space. In this approach, samples from the posterior distribution of the survival function are generated by replacing k in (B.1) with every posterior sample of the model space. The estimated survival curve is then obtained by taking the average of the posterior samples. That is,
| (B.4) |
where is the number of posterior samples.
The third approach is to use Bayesian model averaging. As discussed in the previous section, we use Occam’s window where only the models with posterior probability of at least are used in model averaging. Suppose models fall in Occam’s window. Then
| (B.5) |
APPENDIX C: DEFINITIONS OF MTP, MFP AND P
Let Si be the set of all covariates selected as the final model by the method at iteration i. Also let k be the set of covariates in the true model.
Define
| (C.1) |
where |A| denotes cardinality of set A, A \ B denotes set minus operation.
Following definitions above, MTP and MFP are obtained as follows:
| (C.2) |
where m is the total number of iterations.
APPENDIX D: DISCUSSION ON THE PARAMETERS OF THE S5 ALGORITHM
It should be noted that the final model is the one with the highest posterior probability out of all visited models, collectively obtained from 120 S5 procedures with different starting models. Thus, the main objective in our algorithm is to increase the number of visited models. This is the first attempt towards reducing the sensitivity of finding the highest posterior probability model (HPPM) to the parameters of the S5 algorithm.
There are two important parameters in the S5 algorithm. The temperature vector for the annealing schedule, and the number of iterations at each temperature. We use 10 equally spaced temperature values decreasing from 3 to 1, where at temperature tl, the posterior probability is raised to the power of 1/tl. Values of t<1 increase the posterior probability to unreasonably large values, making it susceptible to being trapped in local extremes, this reducing the number of visited models. Therefore, 1.0 is the lowest chosen temperature for the annealing schedule. Values higher than 3, on the other hand, were found empirically to not improve the performance of the algorithm because it then visited too high a proportion of models with comparatively low posterior probability.
The other parameter, the number of iterations at each temperature, can be chosen by the user in the R package. Theoretically, the higher number of iterations, the more models that will be visited. For the analyses in this paper, the number of iterations was chosen to be 30. This choice was based on a sensitivity analysis performed on simulation data for different numbers of iteration values ranging from 20 to 50, where we investigated the identification of the HPPM and number of visited models. The details of this experiment follow.
We defined a simulation batch as 50 different datasets that were generated with the same settings as Case 3 of the simulations discussed in Section 4, but with true model size of 6 and coefficients equal to β1 = −1.5140, β2 = 1.2799, β3 = −1.5307, β4 = 1.5164, β5 = −1.3020, β6 = 1.5833, and βi = 0 for i > 6. A run of the BVSNLP was run on each dataset to find the simulation truth. For each simulation batch, in addition to the average number of visited models, the proportion of times (out of 50) that the algorithm selected the true model, without any false positives or false negatives, was also recorded. The niter parameter of the S5 algorithm was varied for each simulation batch, ranging from 10 to 50 in increments of 5.
The outcome of the sensitivity analysis for this parameter of the S5 algorithm is summarized in Figure 13 for the average number of visited models, and Table 10 for the proportion of times, P, the true model was found with no false positives or negatives.
These results suggest the S5 algorithm’s performance in finding the true model was not significantly impacted by the parameter niter, and that the number of visited unique models just changed 1.73% from an average of 18,674.78 unique models in 10 iterations per temperature to 18,997.64 unique models in 50 iterations per temperature in the S5 algorithm. This experiment suggests that the BVSNLP algorithm is relatively insensitive to the parameters of the S5 stochastic search algorithm, at least within the range of values considered in this study.
Fig 13:

Number of visited unique models by BVSNLP for different iterations in S5 algorithm, for simulation Case 3.
Table 10.
Proportion of times the true model is found for different iterations in S5 algorithm, for simulation Case 3.
| niter | 10 | 15 | 20 | 25 | 30 | 35 | 40 | 45 | 50 |
|---|---|---|---|---|---|---|---|---|---|
| P | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
REFERENCES
- Antoniadis A, Fryzlewicz P and Letué F (2010). The Dantzig selector in Cox’s proportional hazards model. Scandinavian Journal of Statistics 37 531–552. [Google Scholar]
- Barbieri MM, Berger JO et al. (2004). Optimal predictive model selection. The Annals of Statistics 32 870–897. [Google Scholar]
- Bender R, Augustin T and Blettner M (2005). Generating survival times to simulate Cox proportional hazards models. Statistics in medicine 24 1713–1723. [DOI] [PubMed] [Google Scholar]
- Berger JO, Liseo B, Wolpert RL et al. (1999). Integrated likelihood methods for eliminating nuisance parameters. Statistical Science 14 1–28. [Google Scholar]
- Blanche P, Kattan MW and Gerds TA (2018). The c-index is not proper for the evaluation of t-year predicted risks. Biostatistics. [DOI] [PubMed] [Google Scholar]
- Cox DR (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological) 36 187–220. [Google Scholar]
- Cox DR and Oakes D (1984). Analysis of survival data 21 CRC Press. [Google Scholar]
- Elsevier (2018). PathwayStudio® pathwaystudio.com Elsevier Inc. [Google Scholar]
- Fan J and Li R (2002). Variable selection for Cox’s proportional hazards model and frailty model. Annals of Statistics 74–99. [Google Scholar]
- Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Feng Y, Wu Y et al. (2010). High-dimensional variable selection for Cox’s proportional hazards model In Borrowing Strength: Theory Powering Applications-A Festschrift for Lawrence D. Brown 70–86. Institute of Mathematical Statistics. [Google Scholar]
- Faraggi D and Simon R (1998). Bayesian variable selection method for censored survival data. Biometrics 1475–1485. [PubMed] [Google Scholar]
- Fenner A (2016). Kidney cancer: AR promotes RCC via lncRNA interaction. Nature Reviews Urology 13 242. [DOI] [PubMed] [Google Scholar]
- Friedman J, Hastie T and Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33 1–22. [PMC free article] [PubMed] [Google Scholar]
- George EI and McCulloch RE (1997). Approaches for Bayesian variable selection. Statistica sinica 7 339–373. [Google Scholar]
- Gerds TA and Schumacher M (2006). Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biometrical Journal 48 1029–1040. [DOI] [PubMed] [Google Scholar]
- Ghosh J (1988). Statistical Information and Likelihood: A Collection of Critical Essays by Dr. D. Basu. Lect. Notes in Statist 45. [Google Scholar]
- Harrell FE Jr, Califf RM, Pryor DB, Lee KL, Rosati RA et al. (1982). Evaluating the yield of medical tests. Jama 247 2543–2546. [PubMed] [Google Scholar]
- Hata S, Ise K, Azmahani A, Konosu-Fukaya S, McNamara KM, Fujishima F, Shimada K, Mitsuzuka K, Arai Y, Sasano H et al. (2017). Expression of AR, 5αR1 and 5αR2 in bladder urothelial carcinoma and relationship to clinicopathological factors. Life sciences 190 15–20. [DOI] [PubMed] [Google Scholar]
- Heagerty PJ, Lumley T and Pepe MS (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56 337–344. [DOI] [PubMed] [Google Scholar]
- Held L, Gravestock I and Sabanés Bové D (2016). Objective Bayesian model selection for Cox regression. Statistics in medicine 35 5376–5390. [DOI] [PubMed] [Google Scholar]
- Ibrahim JG, Chen M-H and MacEachern SN (1999). Bayesian variable selection for proportional hazards models. Canadian Journal of Statistics 27 701–717. [Google Scholar]
- Johnson VE (2005). Bayes factors based on test statistics. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 689–701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson VE and Rossell D (2010). On the use of non-local prior densities in Bayesian hypothesis tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 143–170. [Google Scholar]
- Johnson VE and Rossell D (2012). Bayesian Model Selection in High-Dimensional Settings. Journal of the American Statistical Association 107 649–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kalbfleisch J and Prentice R (1980). The statistical analysis of time failure data. The statistical analysis of time failure data. [Google Scholar]
- Liu DC and Nocedal J (1989). On the limited memory BFGS method for large scale optimization. Mathematical programming 45 503–528. [Google Scholar]
- Malumbres M and Barbacid M (2009). Cell cycle, CDKs and cancer: a changing paradigm. Nature reviews cancer 9 153. [DOI] [PubMed] [Google Scholar]
- Meinshausen N and Bühlmann P (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 417–473. [Google Scholar]
- Mogensen UB, Ishwaran H and Gerds TA (2012). Evaluating Random Forests for Survival Analysis Using Prediction Error Curves. Journal of Statistical Software 50 1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morris TP, White IR and Crowther MJ (2019). Using simulation studies to evaluate statistical methods. Statistics in medicine 38 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499 43–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cancer Genome Atlas Research Network (2016). Comprehensive molecular characterization of papillary renal-cell carcinoma. New England Journal of Medicine 374 135–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nikitin A, Egorov S, Daraselia N and Mazo I (2003). Pathway studio—the analysis and navigation of molecular networks. Bioinformatics 19 2155–2157. [DOI] [PubMed] [Google Scholar]
- Nikooienejad A, Wang W and Johnson VE (2016). Bayesian variable selection for binary outcomes in high-dimensional genomic studies using non-local priors. Bioinformatics 32 1338–1345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scott JG, Berger JO et al. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics 38 2587–2619. [Google Scholar]
- Sha N, Tadesse MG and Vannucci M (2006). Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics 22 2262–2268. [DOI] [PubMed] [Google Scholar]
- Shin M, Bhattacharya A and Johnson VE (2018). Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Statistica Sinica 28 1053–1078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sill M, Hielscher T, Becker N and Zucknick M (2014). c060: Extended Inference with Lasso and Elastic-Net Regularized Cox and Generalized Linear Models. Journal of Statistical Software 62 1–22. [Google Scholar]
- Suzuki H, Ouchida M, Yamamoto H, Yano M, Toyooka S, Aoe M, Shimizu N, Date H and Shimizu K (2008). Decreased expression of the SIN3A gene, a candidate tumor suppressor located at the prevalent allelic loss region 15q23 in non-small cell lung cancer. Lung Cancer 59 24–31. [DOI] [PubMed] [Google Scholar]
- Tibshirani R et al. (1997). The lasso method for variable selection in the Cox model. Statistics in medicine 16 385–395. [DOI] [PubMed] [Google Scholar]
- Uno H, Cai T, Tian L and Wei L (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102 527–537. [Google Scholar]
- van Houwelingen H and Putter H (2011). Dynamic prediction in clinical survival analysis. CRC Press. [Google Scholar]
- Wang Z, Morris JS, Cao S, Ahn J, Liu R, Tyekucheva S, Li B, Lu W, Tang X, Wistuba II et al. (2017). Transcriptome Deconvolution of Heterogeneous Tumor Samples with Immune Infiltration. bioRxiv 146795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang HH and Lu W (2007). Adaptive Lasso for Cox’s proportional hazards model. Biometrika 94 691–703. [Google Scholar]
