Summary
This article considers the problem of selecting predictors of time to an event from a high-dimensional set of candidate predictors using data from multiple studies. As an alternative to the current multistage testing approaches, we propose to model the study-to-study heterogeneity explicitly using a hierarchical model to borrow strength. Our method incorporates censored data through an accelerated failure time model. Using a carefully formulated prior specification, we develop a fast approach to predictor selection and shrinkage estimation for high-dimensional predictors. For model fitting, we develop a Monte Carlo expectation maximization (MC-EM) algorithm to accommodate censored data. The proposed approach, which is related to the relevance vector machine (RVM), relies on maximum a posteriori estimation to rapidly obtain a sparse estimate. As for the typical RVM, there is an intrinsic thresholding property in which unimportant predictors tend to have their coefficients shrunk to zero. We compare our method with some commonly used procedures through simulation studies. We also illustrate the method using the gene expression barcode data from three breast cancer studies.
Keywords: Accelerated failure time, Expectation maximization (EM) algorithm, Lasso, Maximum a posteriori (MAP) estimation, Meta-analysis, Relevance vector machine, Shrinkage
1. Introduction
In modern biomedical research, it has become routine to encounter problems involving massive numbers of predictors, with gene expression data as one example. Often, interest focuses on identifying important predictors of an event time, such as patient survival following cancer treatment. Because the sample size from any single study is typically insufficient to allow accurate selection of important predictors, there has been increased emphasis in recent years on borrowing of strength across data from multiple studies. Different studies are often conducted by different labs and may involve varying platforms and event definitions. These differences lead to study-to-study heterogeneity, which must be accommodated in statistical analysis. This article focuses on the problem of flexibly borrowing strength across studies in selecting predictors of an event time from a massive number of candidates.
Variable selection is typically of interest, even when prediction is the focus, since one may get better insight into the biological mechanisms by reducing the dimensionality of the predictive model. For variable selection using data from a single study, a broad variety of methods has been developed for prediction based on large numbers of predictors. Bovelstad et al. (2007) provided a recent review of the literature in this area, while also comparing predictive performance for different methods. They concluded that ridge regression had the best performance in terms of prediction for their data sets, with shrinkage outperforming simple variable selection methods, such as univariate selection or forward selection. Unlike ridge regression, Lasso (Tibshirani, 1996, 1997; Zhang and Lu, 2007) results in simultaneous shrinkage and variable selection, as many of the coefficients will be estimated to be zero. An alternative to Lasso, which also has this property and is widely used in the machine learning community, is the relevance vector machine (RVM; Tipping, 2001).
In considering generalizations of these approaches to accommodate data from multiple studies, an important factor is the computational speed. When data are available from several studies and for thousands of genes, standard Bayes methods of posterior computation that rely on Markov chain Monte Carlo (MCMC) algorithms will be very time consuming to implement. In fact, for large numbers of predictors, the size of the model space is too enormous to run stochastic search variable selection algorithms (George and McCulloch, 1997; Hans, Dobra, and West, 2007) to make it converge. Hence, as a pragmatic approach, it is useful to consider fast alternatives to MCMC based on the maximum a posteriori (MAP) estimation. The Lasso and RVM procedures both have a Bayesian interpretation as MAP estimates, with the Lasso placing a double exponential prior on the coefficients, while the RVM uses an improper t prior with zero degrees of freedom. In addition, computational speed can be improved by focusing on normal linear regression models.
In the presence of multiple studies, a common approach for gene selection is to conduct independent analysis of each dataset, and then examine the intersection of the genes selected (see, e.g., Chan et al., 2008). An alternative is to pool the data, and conduct a single analysis ignoring heterogeneity among the studies. Recently, there has been increased emphasis on multistage designs, which identify a subset of candidate genes in an initial study and then validate these genes in the subsequent studies. Refer to Beckly et al. (2008) for a recent example of this approach. Both the independent analyses and multistage approaches fail to borrow strength across the studies. The multistage method fails to detect genes that are very highly significant in the second stage study, but were not significant enough at the first stage to be maintained after false discovery rate control. The same problem arises in the independent analyses approach, which requires genes to be significant in each study. If one uses a union of genes from the different studies instead of an intersection, the false discovery rate will be increased and important genes may still be missed.
To address these problems, one would ideally simultaneously analyze the data from the different studies, while accommodating heterogeneity. Motivated by the very different problem of borrowing information across related signals in performing signal reconstruction from compressive sensing measurements, Ji, Dunson, and Carin (2009) proposed a multitask RVM (MT-RVM). The MT-RVM approach incorporates dependence in the selection of basis functions for related signals, and can potentially be used directly to incorporate dependence in variable selection across studies. However, the method proposed by Ji et al. (2009) does not account for censoring.
To adapt the variable selection models from normal linear regression settings to the analysis of censored data, one can use an accelerated failure time (AFT) model (Buckley and James, 1979; Kalbfleisch and Prentice, 1980; Koul, Susarla, and Van Ryzin, 1981). For example, Datta, Le-Rademacher, and Datta (2007) used an AFT model to predict patient survival from microarray data for a single study, with partial least squares (PLS) and Lasso used for estimation and imputation used to account for censoring. They concluded that Lasso had better performance. Wang et al. (2008) instead related high-dimensional genomic data to survival outcomes using a semi-parametric AFT model, with a doubly penalized Buckley–James method used for estimation. The approach utilized an elastic net penalty (Zou and Hastie, 2005), which is a hybrid of ridge regression and Lasso. To our knowledge, there are no methods currently available for formally combining data from multiple studies in conducting fast high-dimensional variable selection for survival outcomes.
In this article, we propose a multistudy AFT model, which accounts for heterogeneity among studies. The study-specific coefficients for the different genes are assigned carefully chosen hierarchical t priors. Expressing the t prior as a scale mixture of normals following West (1987) leads to a Gamma prior for gene-specific precision parameters. Taking an approach related to that of Ji et al. (2009), we propose to utilize the same gene-specific precision parameters for the different studies in order to borrow information. This specification allows the gene-specific coefficients to vary across studies, while including dependence in the degree of shrinkage toward zero. To allow censoring, a Monte Carlo expectation maximization (MC-EM) algorithm is developed for simultaneous variable selection and coefficient estimation. The proposed approach, which we refer to as hierarchical RVM with censoring (HRVM-C), produces sparse estimates of the gene-specific coefficients, with many of the coefficients set to zero.
The remainder of this article is organized as follows. We first give a brief review of MT-RVM in Section 2 and then introduce HRVM-C in Section 3. Section 3 also discusses the computational details. In Section 4, we present results from simulation studies. Section 5 demonstrates our method with gene expression barcode data from three breast cancer studies in Zilliox and Irizarray (2007). We give our final conclusion and the discussions in Section 6.
2. MT-RVM
Consider S related studies. Let ni be the number of samples in the ith study (i = 1, …, S), ti j be the response of the jth subject in study i, (j = 1, …, ni), and xi j k (k = 1, …, p) be the corresponding kth predictor variable. For simplicity, we set ti = (ti1, …, tini)′, , xi j = (xi j1, …, xi j p)′, and . The MT-RVM model in Ji et al. (2009) can be represented as
(1) |
where α0i is the precision of the errors and βi = (βi1, …, βi p)′ is the coefficient vector for study i. To encourage sparsity and information sharing across studies, the MT-RVM further places independent Gamma priors on α = (α1, …, αp).
(2) |
Similarly, Gamma priors are specified, independently, for α0i, p(α0i| a, b) = Ga(α0i| a, b).
The MAP estimate for α is defined as α̂MAP = arg maxα Σk log p(αk | c, d) + log L(t; α), where log L(t; α) is the log-likelihood in (1) after integrating out βi and α0 with respect to their prior distributions. We have
(3) |
with , and A = diag(α1, …, αp). A recommended default choice, given in Ji et al. (2009), is a = b = c = d = 0. Under the default choice of the hyperparameters, the MAP estimate for α is equal to the maximum likelihood estimate (MLE).
The values of the hyperparameters a, b, c, and d control the shape of the priors, where small values refer to distributions with a large spike at 0 and heavy right tails. Although the default choice of the hyperparameters will result in an improper posterior distribution, the MAP estimates exist and have a sparseness-favoring property in which many of the regression coefficients will be exactly zero, with such elements shared across the different studies. We focus on the standard default noninformative prior for the error variances, which lets a, b → 0. When prior information is available, this prior can be easily modified. Under the recommended choice, one obtains simultaneous variable selection across studies, while allowing heterogeneity. The value of reflects the importance of predictor k. In particular, for certain k, one obtains α∩k = ∞, which implies that β∩i k = 0 for all i. As another extreme, for values of α∩k close to zero, substantial heterogeneity is allowed, with βi k and βi′k potentially very different for i ≠ i′.
Note that αj is closely related to the shrinkage factor. In fact, the conditional posterior distribution of βi can be written as p(βi | ti, α0i, α) ~ N(μi, Σi), where , and A = diag(α1, …, αp). When the value of αk is close to 0, the conditional posterior distribution of βi j is centered close to the MLE instead of being shrunk toward zero; this suggests the importance of the jth variable. On the other hand, when the value of αk is ∞, the βi j s will be shrunk to 0, and the jth variable will be excluded. In this sense, the value of αj controls the importance of the jth predictor, with the predictors associated with smaller values being more important.
The key in borrowing information is to use common hyperprior variances, which occurs in the second hierarchy of MT-RVM. To further elaborate on this point, we consider the following simplified case, ti j ~ N(μi, 1), μi ~ N(0, α−1), i = 1, …, S, j = 1, …, n. We can write the log-likelihood for α terms sufficient statistics, , where . Differentiating ℓ(α) with respect to α and setting the result to 0, we get the MLE for , if ; and α̂= ∞, otherwise.
The estimation equation for α involves the sufficient statistics t̄i· from all studies. Further, if α∩ = ∞, it allows simultaneous variable selection by setting all μi to 0. Borrowing information also occurs in estimation of the coefficients. In our simple example, the posterior mean of μi is , which has been shrunk toward the prior mean of zero.
In many applications, the goal of the selection process is to identify not only the predictors that are consistently important in all of the studies, but also those that are very significant in some of the studies. The example suggests that the MT-RVM is sensitive to both types of signals. A predictor will be selected whenever , which include the following two cases: (a) for all i. This corresponds to the cases when μi is included in all the studies. It is obvious to see that α∩ < ∞, and thus μi will be selected; (b) for some i. This corresponds to the cases when the signal is very strong in some studies. Again, it is clear that α∩ < ∞, and thus μi will be selected.
3. HRVM-C
3.1 Formulation
Building on the MT-RVM approach, we propose an HRVM-C method for high-dimensional variable selection in meta-analysis of survival data. We first extend the AFT model (Wei, 1992) to a multistudy AFT model as follows. Denoting the log-failure time (survival time) for subject j in study i by ti j, we first model the log-failure time for each individual study by the AFT model as in (1), and then combine data from multiple studies by placing multivariate Student’s t distributions as the priors for the study-specific coefficients as , where A = diag(α1, …, αp). For the precision parameters α, we specify Gamma priors as in (2). Following West (1987), we can express the multivariate Student’s t distribution as a scale mixture of normals, which leads to the MT-RVM model as in (1) and (2). The hyperparameters are set as the default values in the MT-RVM.
For censored data, the log-likelihood for α in (3) no longer holds. Here, we focus on the case of right censoring. Accounting for interval censoring will be straightforward using the same type of strategy. Denote the censored observation by yi j and the censoring indicator by δi j, that is, yi j = ti j if δi j = 1, and ti j > yi j if δi j = 0. We thus observe (yi, δi), for i = 1, …, S, where and . In the rest of this article, we use and to denote the vector of the noncensored observations and the censored observations, respectively, for study i. The corresponding matrices of the predictor variables are set and . Setting , and y = (yn c, yc), the log-likelihood can be written as log L(y, δ, α) = log L(yn c; α) + Σi, j : δi j = 0 log ∫ti j > yi j p(ti j | xi j, α)dti j, where log L(yn c ; α) is defined in (3) and p(ti j | xi j, α) is defined according to (1), . The MAP estimate for α under censoring is then defined as
(4) |
After obtaining α∩MAP, we keep variable k in the model as long as . In practice, some elements of α∩MAP may be large but not infinite, implying that the corresponding co-efficients are very small but not exactly zero. However, the number of such large finite values is typically very small, so that the procedure tends to exhibit a thresholding behavior in which coefficients close to zero are shrunk exactly to zero.
3.2 MC-EM Algorithm for Censored Data
The optimization problem in (4) is a critical step of our method and it is challenging due to the high dimensionality of α. We develop a MC-EM algorithm to solve this problem. This algorithm follows Wei and Tanner (1990) in implementing the intractable E-step using Monte Carlo integration. In addition, similar to Tipping and Faul (2003), we develop an algorithm that breaks up the M-step into a series of alternating conditional maximization steps. Due to the dimensionality of the maximization problem, direction maximization is infeasible, but solving the high-dimensional maximization through a sequence of one-dimensional maximizations makes the computation tractable.
3.2.1 E-step
Let tc be the complete data associated with the censored observations yc and set t = (yn c, tc). In the E-step, we treat the censored observations as missing data, and define
(5) |
where L(t; α) is the likelihood for the complete data defined in (3) and p(tc | yn c, yc, α(h −1)) is the posterior predictive distribution of the censored data given the MAP estimate of α in the previous step (details are provided in the Appendix). The posterior predictive distribution of the censored data is the product of conditionally independent truncated univariate t distributions, which is straightforward to sample from. Let t(1), …, t(M) be M independent draws of complete data generated by setting the uncensored log-survival times equal to the observed values and imputing the censored log-survival times from their conditional predictive distribution given α (h −1). We approximate the integral in Q(α; α(h −1)) by Monte Carlo integration,
(6) |
where is the completed observations for the ith study in the mth imputed data.
3.2.2 M-step
Our goal in the M-step is to update α by maximizing Q(α; α (h −1)), that is, α (h) = arg maxα Q(α; α (h −1)). Note that this is a very challenging maximization problem due to the high dimensionality. In order to simplify this high-dimensional maximization task, we propose to use an alternating conditional maximization approach, which only requires a sequence of one-dimensional conditional optimizations. In our experience, the steps are all simple and efficient to implement, and convergence has occurred rapidly in each of the cases we have considered.
The key is to consider the dependence of the target function on a single hyperparameter, say αk. Using results from linear algebra, we first write |Bi| and as and . Here, Bi,k denotes the matrix after removing the contribution of xi,k from Bi Plugging the above facts into (6), we have
(7) |
where , and α−k is the resulting vector of α after removing the kth component.
Differentiating Q(α; α(h −1)) with respect to αk and setting the result to zero, we have,
Assuming αk ≪ si,k, we obtain an estimate of αk that maximizes the conditional log-likelihood with respect to the kth dimension,
(8) |
In each iteration of the M-step, we first calculate α∩k and the conditional likelihood ℓ(α∩k) for all k, and then update the kth element in α which has the maximum ℓ(α∩k) among all k. Performing the above local maximization iteratively for varying k until convergence, we obtain a simple and seemingly (in the cases we have considered) efficient algorithm for the M-step. In practice, we monitor convergence by specifying a threshold η2, and stop the M-step when the change in maxk ℓ(α∩k) is less than η2.
The proposed optimization strategy is a version of alternating conditional maximization, which is well known to converge to a local mode in the likelihood surface. In each iteration, the algorithm leads to one of the three operations: (a) If and the algorithm sets , we have βi k = 0 for all i, thereby removing the kth predictor from the model in all studies; (b) If and the algorithm sets , we have βi k ≠ 0 for all i, thereby adding the kth predictor to the model in all studies; (c) If and the algorithm sets , we have simply re-estimated the hyperparameter value and the kth predictor remains in the model. From expression (8), it is clear that can take a value of exactly ∞ and each of the operations (a)–(c) is possible prior to convergence.
We iterate between the E-step and the M-step until the MC-EM algorithm converges, which is judged to have occurred when the change in the maximized log-likelihood between iterations is less than the threshold η1. After convergence, the kth predictor will be excluded from all studies if α∩k = ∞. If α∩k is finite but large, then the coefficients for the kth predictor will greatly shrink toward zero, and thus tends to be small in all the studies. If α∩k is small, the coefficients for the kth predictor will be shrunk less and the model allows for substantial heterogeneity in the kth predictor across different studies.
4. Simulation Study
4.1 Comparison with Existing Methods
In this section, we assess the performance of HRVM-C and compare it with popular existing methods for variable selection in meta-analysis. Because HRVM-C is the only method that accounts for censoring, the comparison with the other methods is carried out in the setting of complete data. We then consider separately how HRVM-C performs with censored data.
We first consider the Group Lasso method (Grp-Lasso) of Yuan and Lin (2006), which is applied by augmenting the model as , where for a given k, we define βi k (i = 1, …, S) as a group, that is, the regression coefficients in different studies for a given gene. In addition, the following generic methods are considered for combining multiple studies in high-dimensional variable selection problems:
Fitting each study independently and then reporting the union of the selected predictor variables for each study (Ind).
Reducing the false positive rate by a multistage analysis (MSA), that is, only considering the selected predictor variables in the analysis of the follow-up data sets.
Fitting with pooled data (Pool).
For each individual study, we consider the following variable selection techniques: RVM, Lasso, and using p-values from simple linear regression models (Pvals). For the Pvals method, predictors are ranked according to their p-values each of which is obtained from a simple linear regression model only including that predictor. We thus consider 11 procedures: HRVM-C, Grp-Lasso, Ind-RVM, Ind-Lasso, Ind-Pvals, MSA-RVM, MSA-Lasso, MSA-Pvals, Pool-RVM, Pool-Lasso, and Pool-Pvals.
The simulation is set up to mimic the sample size of the real data in Section 5. In particular, we simulate three related studies with 226 subjects in study 1, 156 subjects in study 2, and 101 subjects in study 3. We fix p, the total number of predictors, at 1000 and then randomly choose p0 = 20 of them to be related to the survival time, with the regression coefficients simulated, independently, from a uniform U([−1, −0.1] ∪ [0.1, 1]) distribution.
In simulating the predictor variables for microarray data, we use the strategy of Gui and Li (2005) and Sha, Tadesse, and Vannucci (2006), which runs as follows. For a study with sample size n, we first draw an n × n matrix A from a uniform U(−1.5, 1.5) distribution and randomly choose p0 columns of A to be relevant to the survival time. The orthonormal basis of A, constructed by Gram–Schmidt orthonormalization, is then obtained as {ξ1, …, ξp0, ζ1, …, ζp−p0}, where {ξ1, …, ξp0} is an orthonormal basis for the p0 columns that are relevant to the survival time. Let T be a p0 × (n − p0) matrix such that the largest eigenvalue of T′T is ρ2. By Cauchy’s inequality, for any vector in the linear space spanned by {ξ1, …, ξp0} and any vector in the linear space spanned by ζ+ ξT, the maximum correlation between them is less than or equal to . We thus generate the remaining p − p0 variables not relevant to the survival time from the linear space ζ + ξT. The log-survival time is then generated using the AFT model according to (1), where εi j ~ N(0, 1).
We choose the value of ρ such that the maximum correlation is controlled at 0 (low level), 0.5 (medium level), and 0.9 (high level). For each chosen ρ, we generate 100 data sets, and select variables by the 11 considered methods, respectively. In fitting HRVM-C and RVM, we set the hyperparameters a, b, c, and d at their recommended values. The tuning parameter in Grp-Lasso is chosen as the maximal value of the penalty parameter in Group Lasso. For the thresholds required in the Lasso and Pvals methods, we utilize sliding thresholds that keep the most influential 5, 10, 15, 20, 25, 30, 35, 40, 45, or 50 predictors in each study. For a fair comparison, we also apply the same thresholds to HRVM-C and RVM methods. The final selected predictors in Ind-RVM, Ind-Lasso, and Ind-Pvals are then the union set of the selected predictors after thresholding in all studies, while the predictors reported in MSA-RVM, MSA-Lasso, MSA-Pval are obtained by keeping the predictors after thresholding in the first stage and then restricting attention to those kept predictors in the subsequent analysis. Finally, in Pool-RVM, Pool-Lasso, and Pool-Pvals, we pool the data together and select the most influential predictors after thresholding.
To evaluate the performance of each procedure, we consider the true positive rate (TPR) and the false discovery rate (FDR), which are defined as the ratio of the number of correctly identified predictors and the total number of truly active predictors, and the ratio of the number of the falsely identified predictors and the total number of selected predictors, respectively.
The FDR is the error rate in variable selection and is typically controlled at a prespecified level. We report, in Table 1, the TPRs for the 11 considered approaches, based on the average of 100 simulations, while controlling the FDRs at 0.05 in all methods. HRVM-C, with TPRs about 0.97, clearly outperforms all the other methods. Table 1 also reflects the drawbacks of the other methods. First, we note that in the Grp-Lasso, the model tends to put a large penalty in the nonzero coefficients due to the large number of groups, leading to a final model that is too sparse. Indeed, from our experience, Grp-Lasso can only identify one or two predictors each time, while missing most predictors. Therefore, the TPRs for Grp-Lasso is very small, as shown in the table. Second, in the independent methods, after FDR control, few predictors are selected and thereby the methods still miss important predictors. Third, the multistage approaches, which control the FDR in the first-stage study, miss important predictors that are significant only after the second-stage study. Finally, by pooling the data together, we may miss the predictors that are positively correlated with the response in one study, but are negatively correlated with the response in another study. Apparently, the advantage of HRVM-C over the other methods is consistent under all correlation levels being considered.
Table 1.
Low | Medium | High | |
---|---|---|---|
HRVM-C | 0.9745 | 0.9685 | 0.9725 |
Grp-Lasso | 0.0205 | 0.0195 | 0.026 |
Ind-RVM | 0.7675 | 0.7430 | 0.7580 |
Ind-Lasso | 0.8515 | 0.8445 | 0.8495 |
Ind-Pvals | 0.8540 | 0.8485 | 0.8545 |
MSA-RVM | 0.7825 | 0.7435 | 0.7450 |
MSA-Lasso | 0.6495 | 0.6490 | 0.6495 |
MSA-Pvals | 0.6165 | 0.6125 | 0.6125 |
Pool-RVM | 0.7395 | 0.7150 | 0.7215 |
Pool-Lasso | 0.7400 | 0.7375 | 0.7380 |
Pool-Pvals | 0.7415 | 0.7360 | 0.7370 |
4.2 Censored Case
In this section, we conduct simulation studies to investigate the performance of HRVM-C in the presence of censoring. We first simulate the complete log-survival time ti j as in Section 4.1. We then generate the censored observations using the strategy described in Sha et al. (2006), as follows. Given the censoring rate λ, we first set the censoring indicators δi j for the first 100 * λ percentage of subjects in each study to 0. If δi j = 0, we observe a censored data yi j from the distribution exp(yi j ) ~ Uniform(0, exp(ti j)). Otherwise, we observe the noncensored data ti j. The simulation leads to a right-censoring data set. In this simulation, the total number of predictor variables p are fixed at 1000, 20 of which are related to the survival time (these are the same as in Section 4.1). For the censoring rate λ, we consider low censoring rate λ = 0.1, medium censoring rate λ = 0.5, and high censoring rate λ = 0.9.
We apply HRVM-C to 100 simulated data sets. To avoid numerical problems, we choose a = 10−4 and b = 10−4. In the EM algorithm, we impute 1000 complete data sets within each E-step, and the two thresholds are chosen as η1 = 0.01 and η2 = 0.01. For each data set, we start the EM algorithm from the empty set, that is, no predictors are included in the model. After convergence, we rank the selected predictors based on the values of the shrinkage factors αk and apply the same sliding thresholds as in Section 4.1. In Table 2, we report the TPRs based on the average of 100 replications while controlling the FDRs at 0.05. For low censoring rate, the performance of HRVM-C, with TPRs about 0.95, is very good under all correlation levels. As expected, as the censoring rat increases, the performance gets worse. Nevertheless, with TPRs around 0.75 for the medium censoring rate and around 0.5 for the high censoring rate, HRVM-C performs reasonably well when data are censored. Finally, in the last column of Table 2, we report the CPU time for one replicate under the varying censoring rates and correlation levels. The computation is inexpensive. We also note that the computational time decreases as the censoring rate gets higher. From our experience, this is due to the fact that the M-step needs fewer iterations to converge when the censoring rate increases.
Table 2.
Censoring rate | Correlation | TPR | CPU time |
---|---|---|---|
Low | Low | 0.9555 | 183.2900 |
Medium | 0.9595 | 190.0400 | |
High | 0.9510 | 188.7900 | |
Medium | Low | 0.7490 | 98.2700 |
Medium | 0.7475 | 95.6500 | |
High | 0.7495 | 89.4200 | |
High | Low | 0.4895 | 25.8700 |
Medium | 0.4955 | 24.9900 | |
High | 0.4930 | 31.5000 |
In Table 3, we summarize the frequencies with which the 20 predictors that related to the response are selected by HRVM-C. As expected, the predictors with larger coefficients are more likely to be selected. We also note that the result does not vary much as the maximum correlation between related predictors and the unrelated predictors increases. This is an advantage of borrowing strength from all studies. Two predictors that are highly correlated in one specific study are not necessarily correlated in other studies. Therefore, by incorporating all information by hierarchical modeling, we are able to avoid the collinearity issues in any single study. Finally, as the censoring rate increases, predictors with smaller coefficients have less chances to be selected.
Table 3.
Coefficients | (λ, ρ)
|
|||||
---|---|---|---|---|---|---|
(0.1, 0) | (0.1, 4/3) | (0.5, 0) | (0.5, 4/3) | (0.9, 0) | (0.9, 4/3) | |
(−0.97, −0.36, −0.30) | 0.96 | 0.94 | 0.83 | 0.87 | 0.34 | 0.41 |
(−0.35, −0.23, −0.81) | 1.00 | 1.00 | 1.00 | 0.99 | 0.50 | 0.44 |
(0.40, 0.50, 0.69) | 1.00 | 1.00 | 1.00 | 1.00 | 0.69 | 0.78 |
(0.61, −0.68, 0.39) | 1.00 | 0.99 | 0.86 | 0.91 | 0.56 | 0.40 |
(−0.59, −0.43, 0.78) | 1.00 | 1.00 | 1.00 | 1.00 | 0.83 | 0.84 |
(0.26, 0.23, 0.44) | 1.00 | 1.00 | 0.88 | 0.87 | 0.21 | 0.16 |
(−0.21, −0.33, −0.46) | 1.00 | 0.99 | 0.93 | 0.92 | 0.19 | 0.25 |
(0.35, −0.43, −0.11) | 0.76 | 0.68 | 0.55 | 0.51 | 0.11 | 0.12 |
(−0.48, −0.67, −0.90) | 1.00 | 1.00 | 1.00 | 1.00 | 0.89 | 0.90 |
(0.51, 0.16, 0.77) | 1.00 | 1.00 | 0.98 | 1.00 | 0.68 | 0.72 |
(−0.25, 0.64, 0.66) | 1.00 | 1.00 | 1.00 | 0.99 | 0.67 | 0.65 |
(−0.28, 0.25, 0.97) | 1.00 | 1.00 | 1.00 | 1.00 | 0.46 | 0.38 |
(0.95, −0.94, 0.81) | 1.00 | 1.00 | 1.00 | 1.00 | 0.90 | 0.88 |
(0.89, 0.32, −0.94) | 1.00 | 1.00 | 1.00 | 1.00 | 0.98 | 0.96 |
(−0.66, −0.15, −0.25) | 0.90 | 0.92 | 0.63 | 0.80 | 0.26 | 0.29 |
(−0.59, 0.36, 0.74) | 1.00 | 1.00 | 1.00 | 1.00 | 0.73 | 0.84 |
(0.80, −0.65, 0.83) | 1.00 | 1.00 | 1.00 | 1.00 | 0.89 | 0.86 |
(0.64, 0.98, −0.96) | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 | 0.97 |
(−0.75, −0.16, −0.25) | 0.93 | 0.97 | 0.76 | 0.79 | 0.39 | 0.25 |
(−0.85, −0.48, −0.83) | 1.00 | 1.00 | 1.00 | 1.00 | 0.88 | 0.90 |
5. Analysis of the Gene Expression Barcode Data
We demonstrate our method with the gene expression bar-code data in Zilliox and Irizarray (2007). The data consist of three breast cancer studies (A3ymetrix HGU133A array) in Miller et al. (2005), Pawitan et al. (2005), and Sotiriou et al. (2006), that include patient survival data. There are 243 subjects in the first study, 156 in the second study, and 101 in the third study. In the first study, 52 patients are censored and 15 have missing data in the survival status. All observations in the second study are censored and in the third study, 61 observations are censored. Here, we focus on gene selection by HRVM-C, and hence remove the patients with missing data from consideration.
The gene expression profile consists of 22, 215 genes. To remove the variability in the gene expression profile between different studies, we use the gene barcode of the microarray data as our predictor variables. Many genes are in the same status (barcode is 1 or 0) among all the subjects. To avoid identifiability issues arising from including those genes in the model, we eliminate them from consideration. This reduces the total number of genes to 11, 879. Our goal is to select the genes that affect the patient survival time.
We perform the gene selection by the HRVM-C for the gene barcode data, under the choice of a = 10−4, b = 10−4, N = 1000, η1 = 0.001, and η2 = 0.001. From our experience, the result does not appear to be sensitive to the choices of a, b, and N. On the other hand, the choice of η1 and η2 are critical. These thresholds control how long the algorithm would run. In general, one would want to avoid values that are too small, which lead to bad convergence of the algorithm. Larger values, on the other hand, could make the algorithm run longer. We have tried a variety of values, and the ones we chose seem to provide a reasonable balance between convergence and running time. At the end of the algorithm, 34 genes are selected to be related to the survival time.
Finally, we check the biological meanings of the selected genes. Most of the selected genes are known to be cancer related. In Table 4, we list the information of those genes that are selected by our method.
Table 4.
A3yID | Gene symbol | Description |
---|---|---|
201167 x at | ARHGDIA | Rho GDP dissociation inhibitor (GDI) alpha |
200969 at | SERP1 | Stress-associated endoplasmic reticulum protein 1 |
201280 s at | DAB2 | Disabled homolog 2, mitogen-responsive phosphoprotein (Drosophila) |
200008 s at | GDI2 | GDP dissociation inhibitor 2 |
200958 s at | SDCBP | Syndecan binding protein (syntenin) |
201341 at | ENC1 | Ectodermal-neural cortex (with BTB-like domain) |
201384 s at | NBR1 | Neighbor of BRCA1 gene 1 |
201399 s at | TRAM1 | Translocation associated membrane protein 1 |
160020 at | MMP14 | Matrix metallopeptidase 14 (membrane-inserted) |
200957 s at | SSRP1 | Structure specific recognition protein 1 |
201275 at | FDPS | Farnesyl diphosphate synthase |
200994 at | IPO | Importin 7 |
200835 s at | MAP4 | Microtubule-associated protein 4 |
200902 at | SEP15 | 15 kDa selenoprotein |
201404 x at | PSMB2 | Proteasome (prosome, macropain) subunit, beta type, 2 |
200626 s at | MATR3 | Matrin 3 |
200923 at | LGALS3BP | Lectin, galactoside-binding, soluble, 3 binding protein |
201087 at | PXN | Paxillin |
201040 at | GNAI2 | Guanine nucleotide binding protein (G protein), alpha inhibiting activity polypeptide 2 |
201264 at | COPE | Coatomer protein complex, subunit epsilon |
200744 s at | GNB1 | Guanine nucleotide binding protein (G protein), beta polypeptide 1 |
200672 x at | SPTBN1 | Spectrin, beta, non-erythrocytic 1 |
200914 x at | KTN1 | Kinectin 1 (kinesin receptor) |
200607 s at | RAD21 | RAD21 homolog (S. pombe) |
201091 s at | CBX3 | Chromobox homolog 3 (HP1 gamma homolog, Drosophila) |
200749 at | RAN | Member RAS oncogene family |
201316 at | PSMA2 | Proteasome (prosome, macropain) subunit, alpha type, 2 |
201343 at | Hs.693967 | Transcribed locus |
201041 s at | DUSP1 | Dual specificity phosphatase 1 |
201069 at | MMP2 | Matrix metallopeptidase 2 (gelatinase A, 72kDa gelatinase, 72kDa type IV collagenase) |
200962 at | RPL31 | Ribosomal protein L31 |
201129 at | SFRS7 | Splicing factor, arginine/serine-rich 7, 35kDa |
201291 s at | TOP2A | Topoisomerase (DNA) II alpha 170kDa |
200920 s at | BTG1 | B-cell translocation gene 1, anti-proliferative |
6. Discussion
In this article, we develop the HRVM-C to combine multiple studies in high-dimensional variable selection. In contrast with the commonly used approaches, our method systematically borrows information across the studies using an explicit overall statistical model. For model fitting with censored data, we develop an MC-EM algorithm that can be implemented quickly even in high dimensions. In simulations studies, our method is found to outperform existing approaches in the setting of complete data, and to perform well with censored data. We demonstrate the usefulness of our method in a meta-analysis of multiple breast cancer studies.
HRVM-C provides a useful tool for dealing with censored data across multiple studies in high-dimensional variable selection problems. The MC-EM algorithm developed here can be easily extended to a Monte Carlo Expectation Conditional Maximization (MC-ECM) algorithm where the M-step is replaced by the Conditional Maximization (CM) step (Meng and Rubin, 1993). According to (8), αk can be (approximately) maximized conditional on the rest of parameters, and therefore MC-ECM algorithm has the potential to further reduce the computational expense. HRVM-C requires at least one noncensored observation in each study in order to impute the complete data sets (see the discussion in Appendix). To overcome this difficulty, one may consider to utilize a similar structure in the Cox proportional hazards model. We will explore this extension in our future work.
7. Supplementary Material
Further details and complete information needed to recapitulate the analyses reported are available at the Biometrics website http://www.biometrics.tibs.org. This includes the Matlab code to conduct the analysis and a brief readme on use of the code.
Supplementary Material
Acknowledgments
This research was supported in part by the Statistical and Applied Mathematical Sciences Institute (SAMSI) Summer 2008 research program on Meta-analysis: Synthesis and Appraisal of Multiple Sources of Empirical Evidence. The gene barcode data used in this article were kindly provided by Dr Rafael Irizarry and Dr Michael Zilliox. The authors gratefully acknowledge the many helpful comments received from the editor, the associate editor, and the two anonymous referees. Research of Fei Liu was partially supported by the University of Missouri-Columbia research board award. Research of Fei Zou was partially supported by NIH (R01GM074175). Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NIH.
Appendix
Imputing Missing Data
The conditional distribution for the censored observation can be written as
(A.1) |
where I(·) is the indicator function of the set ti j > yi j, , with , and . Integrating out βi and α0i, we have . This is a truncated noncentral Student’s t distribution, with the degree of freedom , the location parameter xi j μi, and the scale parameter . The distribution of tc is thus given as . At each E-step, given the current value of α, we obtain a complete data set by sampling the censored observations from this distribution.
Note that the above distribution is well defined only when the degree of freedom . This implies that there must be at least one noncensored observation in any single study. For this reason, HRVM-C cannot incorporate studies where all the observations are censored.
References
- Beckly J, Hancock L, Geremia A, Cummings J, Morris A, Cooney R, Pathan S, Guo C, Jewell D. Two-stage candidate gene study of chromosome 3p demonstrates an association between nonsynomous variants in the mst1r gene and Crohn’s disease. Inflammatory Bowel Diseases. 2008;14:500–507. doi: 10.1002/ibd.20365. [DOI] [PubMed] [Google Scholar]
- Bovelstad H, Nygard S, Storvold H, Aldrin M, Borgan O, Frigessi A, Lingjaerde O. Predicting survival from microarray data—a comparative study. Bioinformatics. 2007;23:2080–2087. doi: 10.1093/bioinformatics/btm305. [DOI] [PubMed] [Google Scholar]
- Buckley J, James I. Linear regression with censored data. Biometrika. 1979;66:429–436. [Google Scholar]
- Chan S, Griffith O, Tai I, Jones S. Meta-analysis of colorectal cancer gene expression profiling studies identifies consistently reported candidate biomarkers. Cancer Epidemiology Biomarkers and Prevention. 2008;17:543–552. doi: 10.1158/1055-9965.EPI-07-2615. [DOI] [PubMed] [Google Scholar]
- Datta S, Le-Rademacher J, Datta S. Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and lasso. Biometrics. 2007;63:259–271. doi: 10.1111/j.1541-0420.2006.00660.x. [DOI] [PubMed] [Google Scholar]
- George E, McCulloch R. Approaches for Bayesian variable selection. Statistica Sinica. 1997;7:339–373. [Google Scholar]
- Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
- Hans C, Dobra A, West M. Shotgun stochastic search for “large p” regression. Journal of the American Statistical Association. 2007;102:507–516. [Google Scholar]
- Ji S, Dunson D, Carin L. Multitask compressive sensing. IEEE Transactions on Signal Processing. 2009;57:92–106. doi: 10.1109/TSP.2010.2070796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. New York: Wiley; 1980. [Google Scholar]
- Koul H, Susarla V, Van Ryzin J. Regression analysis with randomly right-censored data. Annals of Statistics. 1981;9:1276–1288. [Google Scholar]
- Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika. 1993;80:267–278. [Google Scholar]
- Miller M, Wang C, Parisini E, Coletta R, Goto R, Lee S, Barral D, Townes M, Roura-Mir C, Ford H, Brenner M, Dascher CC. Characterization of two avian MHC-like genes reveals an ancient origin of the cd1 family. Proceedings of National Academy of Science, USA. 2005;102:8674–8679. doi: 10.1073/pnas.0500105102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pawitan Y, Bjohle J, Amler L, Borg A, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu E, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw P, Smeds J, Skoog L, Wedren S, Bergh J. Gene expression profiling spares early breast cancer patients from adjuvant therapy: Derived and validated in two population-based cohorts. Breast Cancer Research. 2005;7:R953–R964. doi: 10.1186/bcr1325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sha N, Tadesse MG, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics. 2006;22:2262–2268. doi: 10.1093/bioinformatics/btl362. [DOI] [PubMed] [Google Scholar]
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh, Piccart M, Delorenzi M. Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis. Journal of the National Cancer Institute. 2006;98:262–272. doi: 10.1093/jnci/djj052. [DOI] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Tibshirani R. The Lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001;1:211–244. [Google Scholar]
- Tipping ME, Faul AC. Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Key West, FL: 2003. Fast marginal likelihood maximisation for sparse Bayesian models. [Google Scholar]
- Wang S, Nan B, Zhu J, Beer D. Doubly penalized Buckley-James method for survival data with high-dimensional covariates. Biometrics. 2008;64:132–140. doi: 10.1111/j.1541-0420.2007.00877.x. [DOI] [PubMed] [Google Scholar]
- Wei GCG, Tanner MA. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithm. Journal of the American Statistical Association. 1990;85:699–704. [Google Scholar]
- Wei LJ. The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine. 1992;11:1871–1879. doi: 10.1002/sim.4780111409. [DOI] [PubMed] [Google Scholar]
- West M. On scale mixtures of normal distributions. Biometrika. 1987;74:646–648. [Google Scholar]
- Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B. 2006;68:49–67. [Google Scholar]
- Zhang H, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
- Zilliox M, Irizarray R. A gene expression bar code for microarray data. Nature Methods. 2007;4:911–913. doi: 10.1038/nmeth1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.