Abstract
Many different biological processes are represented by network graphs such as regulatory networks, metabolic pathways, and protein-protein interaction networks. Since genes that are linked on the networks usually have biologically similar functions, the linked genes form molecular modules to affect the clinical phenotypes/outcomes. Similarly, in large-scale genetic association studies, many SNPs are in high linkage disequilibrium (LD), which can also be summarized as a LD graph. In order to incorporate the graph information into regression analysis with high dimensional genomic data as predictors, we introduce a Bayesian approach for graph-constrained estimation (Bayesian GRACE) and regularization, which controls the amount of regularization for sparsity and smoothness of the regression coefficients. The Bayesian estimation with their posterior distributions can provide credible intervals for the estimates of the regression coefficients along with standard errors. The deviance information criterion (DIC) is applied for model assessment and tuning parameter selection. The performance of the proposed Bayesian approach is evaluated through simulation studies and is compared with Bayesian Lasso and Bayesian Elastic-net procedures. We demonstrate our method in an analysis of data from a case-control genome-wide association study of neuroblastoma using a weighted LD graph.
Keywords: Bayesian Lasso, biological network, Laplacian matrix, high dimensional data, DIC
1 Introduction
Partially motivated by genomic applications, regularization methods for variable selection and coefficient estimation for linear regression models have been extensively studied in recent years, including the popular procedures such as Lasso [1] and SCAD [2] and their various extensions. The fused Lasso [3] imposes the L1 penalty on the absolute differences of the regression coefficients in order to account for some smoothness of the regression coefficients when the covariates are naturally ordered. For grouped variables, Yuan and Lin [4] proposed the group Lasso, where the penalty function is intermediate between the L1 penalty and L2 penalty. Zou and Hastie [5] proposed the Elastic-net as a stabilized version of the Lasso. Particularly, it is useful for the analysis of genomic data because it deals with groups of highly correlated variables.
One limitation of all these regularized regression approaches is that they do not provide valid variance estimates automatically. Methods based on data splitting have recently been proposed to obtain valid p-values for high-dimensional regression [6]. However, these methods requires large sample sizes for data splitting. Alternatively, the Lasso estimates could be interpreted as the Bayes posterior mode under independent Laplace priors for the predictors. Park and Casella [7] have recently developed a Bayesian approach using the Gibbs sampler for the Lasso estimates with the Laplace prior in the hierarchical model. They showed that the posterior median estimates for regression coefficients in their model are similar to the ordinary Lasso estimates. Such a Bayesian Lasso can provide interval estimates, i.e., Bayesian credible intervals, along with their standard errors for the regression coefficient estimates. This is a great advantage over the ordinary Lasso, which does not provide any confidence assessment of the point estimates of the regression coefficients. Yi and Xu [8] have recently applied the Bayesian Lasso to the problem of mapping multiple quantitative trait loci (QTL). Since the traditional linear regression model is used to relate genotypes to phenotypes, the Bayesian Lasso can fit the model to estimate all possible genetic effects associated with all molecular markers across the entire genome.
Besides the QTL mapping problems, gene expression data are often studied in a linear regression framework with phenotypes as responses and numerical measurements on genes as predictors. Often external information about biological graphs such as regulatory networks, metabolic pathways and protein-protein interaction networks is available and provides additional information about how genes are related on the networks. Similarly, in large-scale genetic association studies, many SNPs are in high linkage disequilibrium (LD), which can also be summarized as a LD graph [9]. In order to use this extra graph information in analysis of genomics data, a graph-constrained estimation (GRACE) procedure [10, 11] was proposed for fitting linear regression models and variable selection. It uses a Laplacian matrix-based penalty to incorporate a priori information of graph structure into regression analysis. The cyclic coordinate descent algorithm [12] is used to provide efficient algorithm for GRACE estimates. However, GRACE has the same limitation as other regularization methods that do not provide variance estimates or confidence intervals of the estimates.
In this paper, extending the idea of Bayesian Lasso [7], we develop a Bayesian formulation for the graph-constrained estimation for high-dimensional linear and probit regression models, which provides posterior estimates of the regression coefficients and valid standard errors. We call this procedure the Bayesian GRACE. We develop an efficient Markov chain Monte Carlo (MCMC) procedure for sampling the posterior distributions of the parameters. We use the deviance information criterion (DIC) [13] to choose the tuning parameters and to assess the model. Variable selection problem is resolved through Bayesian credible intervals for coefficients that do not include zero. The performance of the proposed approach is evaluated and compared with the Bayesian Lasso and Bayesian Elastic-net through simulated data sets. We also demonstrate our method in an analysis of data from a case-control genome-wide association study (GWAS) of neuroblastoma incorporating with a weighted LD graph defined as in [9].
The rest of the paper is organized as follows. We first briefly review the graph-constrained estimation for linear models [10, 11]. We then introduce a Bayesian hierarchical model with graph constraints on coefficients and present an efficient Gibbs sampling approach to obtain the posterior distributions of the regression coefficients. We also present a similar formulation for the probit models for binary outcomes. We present simulation studies and application to a real data set. Finally, we give a brief discussion and conclusions.
2 Graph-constrained Regularization for Linear Models
We first briefly describe the graph-constrained estimation procedure [10, 11]. Consider a standard linear regression model with n observations and p predictors:
where Y = (y1, …, yn)⊤ ∈ ℝn, X = (X1, …, Xp) ∈ ℝn×p, Xj = (x1j, …, xnj)⊤, β = (β1, …, βp)⊤, and ε ~ Nn(0, σ2In). We assume that the predictors and the response are centered so that
for j = 1, …, p.
Assume that the p predictors are nodes of a graph, which correspond to some prior known genetic networks or weighted LD graph. Denote the graph as G = (V, E), where V is the set of vertices that correspond to the p predictors, and E = (u ~ v) is the set of edges indicating that the predictors u and v are linked on the graph and there is an edge between u and v. Then, the Laplacian matrix represents linked or unlinked genes on the graph. The normalized weighted Laplacian matrix L = (luv)p×p is defined as
where du = Σu~v w(u, v) is the degree of the vertex u. Note that u is an isolated vertex if du = 0. Let w(u, u) = 0. It is well-known that the matrix L is semi-positive definite with 0 as the smallest eigenvalue and 2 as the largest eigenvalue [14]. In order to incorporate the graph information into regression analysis, Li and Li [10] proposed an improper Gaussian Markov random field priors on β, i.e.,
Note that
which can be regarded as a smoothness measurement of the regression coefficients with respect to the graph structure. Li and Li [11] proposed the following graph-constrained regularization procedure for variable selection and for parameter estimation,
| (1) |
where the tuning parameters λ1 and λ2 control the amount of regularization for sparsity and smoothness, respectively. When λ2 = 0, the graph-constrained penalty reduces to the ordinary Lasso penalty, and when L = I, the penalty becomes the Elastic-net penalty [5].
The graph-constrained estimation as presented in equation (1) adjusts the coefficients in order to account for different degrees of the vertices on the graph, allowing the genes with more connections such as the hub genes to have larger coefficients so that small changes of expressions of such genes can lead to large changes in the response. Biologically, the genes that are linked on the networks often have similar functions and have correlated expression levels so they are expected to have smoothed regression coefficients. Friedman et al. [12] presented a cyclic coordinate descent algorithm for solving Lasso and Elastic-net regularization. Li and Li [11] presented a similar coordinate descent algorithm to the problem (1) and employed a cross-validation method for selecting the tuning parameters.
3 A Bayesian Approach to Graph-constrained Estimation and Regularization
In this section, we present a Bayesian hierarchical model for the graph-constrained estimation (Bayesian GRACE) and present a Gibbs sampling procedure to obtain the posterior distributions of the regression coefficients. Our development is similar to Bayesian Lasso [7]. We consider both high-dimensional linear and probit regression models.
3.1 A hierarchical graph-constrained estimation
In the penalized linear regression with the graph-constrained penalty, the conditional prior of β, given σ2 can be expressed as
| (2) |
where L is the Laplacian matrix defined for the graph. Similar to the original Lasso, this prior can be attained from the basic identity of the Laplace distribution,
i.e., a scaled mixture of normals with an exponential mixing density is equivalent to the Laplace distribution. Specifically, the Bayesian formulation of the graph-constrained estimation is then given by the following hierarchical model
where . Any inverse Gamma prior for σ2 would maintain conjugacy, including the improper prior density π(σ2) = 1/σ2. After integrating out and using the scaled mixture of normals to define the Laplace distribution, the conditional prior on β has the desired form (2). Park and Casella [7] has pointed out that conditioning on σ2 is important because it guarantees a unimodal full posterior. Lack of unimodality slows convergence of the Gibbs sampler and makes point estimates less meaningful.
Using the improper prior density π(σ2) = 1/σ2, the full conditional posterior distributions of β, , and σ2 with fixed and λ2 are then
| (3) |
These full conditional distributions form the basis for an efficient Gibbs sampler, so they are easy to sample. The posterior distribution of the univariate βj also can be obtained as
| (4) |
after some algebra. Updating each βj becomes simple and fast. It is clear from this conditional distribution that larger λ1 leads to larger and therefore bigger regularization and bigger shrinkage of βk. On the other hand, larger value of λ2 leads to stronger influence of the values of the regression coefficients of the neighboring genes.
It is also interesting to notice that the mean of the conditional distribution for βk given by equation (4) is closely related to the soft-thresholding update formula based on the cyclic coordinate descent algorithm [11], which is given by
where S(z, γ) is the soft-thresholding operator with value
This link explains both the similarity and the key difference between the regularization method [10] and the Bayesian GRACE proposed here: the estimates for the coefficients of the relevant variables should be similar, however, the Bayesian GRACE procedure can lead to very small estimates of β but not exactly zero and therefore does not automatically select the variables. The problems of selecting the tuning parameters and variables are discussed in Section 3.3.
3.2 A Bayesian approach for graph-constrained estimation for binary response
Binary data such as absence/presence of some disease and positive/negative of a treatment are often considered as response variables in many biomedical studies. The logistic regression model is most commonly used for binary response, but including Y directly to the hierarchical model is not feasible because Y follows a Bernoulli distribution. Albert and Chib [15] suggested a standard Gaussian link function and a latent variable framework to estimate posterior distribution for β using the Gibbs sampler.
Let us denote Xβ = (g1, …, gn)⊤, and suppose that we have latent variables Z = (z1, z2, …, zn)⊤ from independent normal distribution with a mean of gi and unit variance. If we assume that
then it can be easily shown that
where φ (·) is a standard normal density function, Φ (·) is a standard normal distribution function, and 1(·) is an indicator function. The conditional distribution of zi, given Y, X, and β is then a truncated normal distribution.
To complete a Gibbs sampler, these latent variables are sampled from
where N− is a truncated normal distribution at the left by 0 and N+ is a truncated normal distribution at the right by 0.
We can still easily update β, , and σ2 from
where Z̃ = Z − z̄1n.
3.3 Variable and tuning parameters selection
Bayesian estimation based on the result of a Gibbs sampler consists of posterior means or medians and the standard errors through posterior distribution of coefficients, but these posterior estimates cannot be exactly zero. Therefore, the Bayesian GRACE procedure itself cannot directly lead to variable selection. One strategy is to set to zero any coefficient estimates whose Bayesian credible intervals contain zero. Since we have estimates of the standard errors of these point estimates, we can assess how sure that such coefficients are actually zero. This is a great advantage over the GRACE procedures [11].
Since the tuning parameters λ1 and λ2 influence on the posterior estimates, with larger λ1 values leading to more shrinkage of the estimates toward zero, these tuning parameters have to be determined first before selecting the relevant variables. Typically, cross-validation is used for choosing these tuning parameters, but it can be computationally very expensive. An alternative is to use the BIC or AIC, however, it is not clear how one should define the degrees of freedom or the effective number of the parameters. Park and Casella [7] suggested marginal maximum likelihood for the parameters, using a Monte Carlo EM algorithm [16]. However, we observed that in our setting, this approach does not work well when p is large.
We instead use the DIC to select the tuning parameters. The DIC has been the criterion of choice for Bayesian model selection and model comparison since it was proposed by Spiegelhalter et al. [13]. The DIC is a generalization of the AIC, but is intended for use with MCMC output and for hierarchical linear models. The DIC has been recently used in genetics in the context of evaluation hierarchical models of gene-environment interactions [17, 18], and mapping for multiple QTLs [19].
Let us assume, in general, that the distribution of the data y depends on a multidimensional parameter vector θ, and L(θ|y) is a likelihood function of θ. The statistical deviance is defined as
by Spiegelhalter et al. [13]. The DIC consists of two components, a term that measures goodness of fit and a penalty term for increasing model complexity, i.e.,
The second component pD measures the complexity of the model by the effective number of parameters, which is defined as the difference between the posterior mean of the deviance and the deviance evaluated at the posterior estimates, i.e.,
where θ̄ is the posterior estimates such as posterior means, medians, and modes.
Specifically, the posterior distribution of β in (3) forms a p-dimensional multivariate normal distribution, so the likelihood is
where , μ = (X⊤ X+ Σ−1)−1X⊤Y, and the observed data is (X, Y, L, λ2).
Let us denote t-th MCMC sample of θ by , t = 1, …, K, then the posterior mean deviance is
and the deviance at the posterior estimates of θ, denoted by θ̄, is
The DIC is finally computed as
Since the data (X, Y, L) is fixed, the DIC depends only on the tuning parameters (λ1, λ2). The computation of the covariance matrix of β involves the parameter λ2, and the posterior distribution of ( ) are determined by the magnitude of λ1. The larger λ1 is, the more sparse the model is. To determine the optimal tuning parameters, we can first start with different sets of (λ1, λ2) and obtain MCMC output results for each set. We then compute DIC based on each output and choose the set of (λ1, λ2) that has the smallest DIC.
4 Simulation Results
To demonstrate our proposed Bayesian GRACE and to compare with the results with GRACE, we simulated a network graph that mimics gene regulation modules. Suppose that the graph consists of 50 unconnected regulatory modules with 50 transcription factors (TFs) and each regulating 10 different genes, so we have a total of 550 variables or genes. Among these modules and genes, we further assume that the first four TFs and their regulated genes, i.e., a total of 44 variables, are associated with the response based on the following model,
| (5) |
where ε is randomly generated from the normal distribution of a mean of 0 and a variance of . For each TF, the Xj value is simulated from a standard normal distribution with a sample size of n = 1000. The 10 regulated genes for each TF are then simulated from a conditional normal distribution with a correlation of 0.1, 0.5, and 0.9 between the genes and its regulator, respectively. To assess the prediction performance of the methods, we also generated additional 500 test samples 50 times to compute the prediction mean squared errors. We considered two different models for the regression coefficients for the relevant genes.
For the first model, we assume that the true coefficients are specified as
This is also the model used in [11]. For each of the simulated data sets with different network correlations, we fitted a linear model using the Bayesian Lasso (λ2 = 0), Bayesian Elastic-net (L = I), and Bayesian GRACE, respectively. For each model, we ran the Gibbs sampling in (3) and obtained 5000 MCMC outputs of , t = 1, …, 6000, after the first 1000 burn-ins. All models were repeatedly fitted with 5 different shrinkage parameters λ1 = 0.1, 1, 10, 50, 100. Since the Bayesian Elastic-net and Bayesian GRACE have the second tuning parameter for smoothness, we ran them with λ2 = 1, 10, 102, 103, 104 for each fixed λ1. For each given λ1, the DICs were then computed to choose the λ2. The Bayesian Elastic-net has the minimum DIC with λ2 = 1 and the Bayesian GRACE achieved the smallest DIC when λ2 = 104 was used. Table 1 shows the DIC results of three Bayesian procedures for the three simulated data sets with different correlation structures. It appears that the Bayesian GRACE have much smaller DIC than the other two procedures over all different shrinkage and correlation parameters.
Table 1.
Deviance information criterion (DIC) for the Bayesian Lasso (B.Lasso), the Bayesian Elastic-net (B.Elastic-net), and the Bayesian GRACE (B.GRACE) for three simulated data sets based on Model 1 with different correlation structures. For each λ1, λ2 was searched over a range of values to obtain the minimum DICs.
| Correlation | Method | λ1 = 0.1 | λ1 = 1 | λ1 = 10 | λ1 = 50 | λ1 = 100 |
|---|---|---|---|---|---|---|
| 0.1 | B.Lasso | −317.76 | −332.06 | −421.78 | −680.56 | −920.04 |
| B.Elastic-net | −313.68 | −328.30 | −425.44 | −680.31 | −919.79 | |
| B.GRACE | −1347.92 | −1352.36 | −1384.57 | −1421.52 | −1388.90 | |
|
| ||||||
| 0.5 | B.Lasso | −249.94 | −267.41 | −373.32 | −680.28 | −912.33 |
| B.Elastic-net | −246.34 | −264.36 | −374.24 | −682.02 | −912.07 | |
| B.GRACE | −1411.03 | −1414.42 | −1439.82 | −1465.14 | −1423.96 | |
|
| ||||||
| 0.9 | B.Lasso | 501.17 | 446.17 | 141.86 | −517.24 | −854.50 |
| B.Elastic-net | 505.23 | 450.84 | 142.38 | −515.67 | −851.36 | |
| B.GRACE | −1395.35 | −1398.83 | −1424.41 | −1452.28 | −1411.22 | |
We next compute the prediction mean squared errors (PMSEs) based on 500 test samples for each model and repeated this 50 times. The average of PMSEs are shown in Figure 1. The results of the Bayesian Lasso and Bayesian Elastic-net are very similar for all three data sets. Both methods achieved the smallest prediction errors at λ1 = 50. In contrast, the Bayesian GRACE has smaller prediction errors than the other two methods, regardless of the shrinkage parameter λ1. Since the Bayesian GRACE has the smallest DIC at λ1 = 50, λ1 = 50 was chosen for all three Bayesian estimation procedures.
Figure 1.
The average of prediction mean squared errors of the Bayesian Lasso, the Bayesian Elastic-net, and the Bayesian GRACE for three simulated data sets based on Model 1 with different correlations between the TFs and the genes they regulate.
We also examine the estimates of the regression coefficients. The scatter plots for the squared errors of the posterior median estimates, (β̂j − βj)2, j = 1, …, 550, are presented in Figure 2 for the three simulated data sets. It appears that most of posterior median estimates of the zero regression coefficients are close to zero for all three procedures. It is clear that the squared errors of median estimates from the Bayesian GRACE for the 44 relevant variables are much smaller than those from Bayesian Lasso or Bayesian Elastic net, especially when the network correlation is strong.
Figure 2.
The squared errors of posterior median estimates for the regression coefficients by Bayesian Lasso, Bayesian Elastic-net, and Bayesian GRACE for three simulated data sets from Model 1 with a correlation of 0.1, 0.5, and 0.9 between the TFs and the genes that they regulates, respectively. The x-axis represents the variable index, where the first 44 variables are relevant.
We finally examine how well the Bayesian credible intervals can be used for variable selection. We computed 99% and 95% Bayesian credible intervals for the coefficient estimates and selected the variables whose credible intervals did not include zero. Accordingly, we set the median estimate to zero if its credible region contains zero. Since only the first 44 variables are associated with the response, we can evaluate and compare the performance of each method in terms of estimation and selection. Sensitivity and specificity of the selection show how well the methods choose the relevant variables and the mean squared error (MSE) of the median estimates of only selected variables can be used to measure how well each method accurately estimates the regression coefficients. Table 2 presents a summary of the selection results from three different Bayesian procedures for three simulated data sets. It appears that the Bayesian Lasso and Elastic-net selected fewer variables than 44 when 99% or 95% credible intervals were used, so the sensitivity are relatively low for these two procedures. The specificities of these methods were 1 or close to 1 due to a large number of variables. The Bayesian GRACE selected the exact 44 variables with perfect selection results. The MSE of the Bayesian GRACE estimates were much smaller, and specifically the MSE were 100–200 times smaller for the simulated data set with correlation of 0.9. Overall, we observed that the Bayesian GRACE overwhelmed the other methods in terms of both estimation and the variable selection. In particular, it showed better performance when the predictors are highly correlated. The perfect selection results from the Bayesian GRACE also attributed to the large sample sizes in our simulations.
Table 2.
Summary of model selection by 99% and 95% Bayesian credible intervals for Bayesian Lasso, Bayesian Elastic-net, and Bayesian GRACE for three data sets based on Model 1 with different correlation structures. Mean squared errors (MSEs) are also presented.
| Correlation | Method | Selection | # of Variables | Sensitivity | Specificity | MSE |
|---|---|---|---|---|---|---|
| 0.1 | B.Lasso | 99% | 29 | 0.66 | 1.00 | 0.014 |
| 95% | 36 | 0.82 | 1.00 | 0.010 | ||
|
| ||||||
| B.Elastic-net | 99% | 28 | 0.64 | 1.00 | 0.015 | |
| 95% | 35 | 0.80 | 1.00 | 0.010 | ||
|
| ||||||
| B.GRACE | 99% | 44 | 1.00 | 1.00 | 0.0024 | |
| 95% | 44 | 1.00 | 1.00 | 0.0024 | ||
|
| ||||||
| 0.5 | B.Lasso | 99% | 31 | 0.70 | 1.00 | 0.0144 |
| 95% | 35 | 0.80 | 1.00 | 0.0120 | ||
|
| ||||||
| B.Elastic-net | 99% | 31 | 0.71 | 1.00 | 0.014 | |
| 95% | 36 | 0.82 | 1.00 | 0.011 | ||
|
| ||||||
| B.GRACE | 99% | 44 | 1.00 | 1.00 | 0.0008 | |
| 95% | 44 | 1.00 | 1.00 | 0.0008 | ||
|
| ||||||
| 0.9 | B.Lasso | 99% | 23 | 0.52 | 1.00 | 0.044 |
| 95% | 32 | 0.73 | 1.00 | 0.027 | ||
|
| ||||||
| B.Elastic-net | 99% | 22 | 0.50 | 1.00 | 0.045 | |
| 95% | 32 | 0.73 | 1.00 | 0.023 | ||
|
| ||||||
| B.GRACE | 99% | 44 | 1.00 | 1.00 | 0.0002 | |
| 95% | 44 | 1.00 | 1.00 | 0.0002 | ||
For Model 2, we assume the following true regression coefficients,
where , so the error variance remains unchanged. We simply generated uk by setting uk = k and dividing each by . Y and Xj are generated in the same way as for Model 1, but we fix the network correlation at 0.5.
Similar to analysis for Model 1, the DIC was used for choosing the tuning parameters λ1 and λ2. All three Bayesian procedures selected the sparsity tuning parameter λ1 = 50, and the smoothness tuning parameter λ2 = 1 and 102 were selected by the Bayesian Elastic-net and Grace, respectively. Under these selected tuning parameters, Figure 3 compares the squared errors of median estimates of the three different Bayesian procedures. It is clear that the Bayesian Grace still resulted in the smallest estimation error, especially for those relevant variables. Additionally, we computed the 99%, 95%, and 90% Bayesian credible regions from the MCMC runs for all three Bayesian procedures to assess the selection performance. Table 3 shows the summary of the selected variables based on these different credible intervals. The mean squared errors of the median estimates in the table are based only on the selected variables, i.e., we set the median estimates of unselected variables to zero. We observed that the Bayesian Grace procedure selected more relevant variables with higher sensitivities and smaller MSEs, even though they all employee the same sparsity tuning parameter.
Figure 3.
The squared errors of posterior median estimates for the regression coefficients of the Bayesian Lasso, the Bayesian Elastic-net, and the Bayesian GRACE for simulated data set from Model 2 with a correlation of 0.5 between the TFs and the genes that they regulate. The x-axis represents the variable index, where the first 44 variables are relevant.
Table 3.
Summary of model selection by 99%, 95%, and 90% Bayesian credible intervals for Bayesian Lasso, Bayesian Elastic-net, and Bayesian GRACE for simulated data set based on Model 2 with a correlation of 0.5 between the TFs and the genes that they regulate. Mean squared errors (MSEs) are also presented.
| Method | Selection | # of Variables | Sensitivity | Specificity | MSE |
|---|---|---|---|---|---|
| B.Lasso | 99% | 28 | 0.64 | 1.00 | 0.0099 |
| 95% | 31 | 0.70 | 1.00 | 0.0079 | |
| 90% | 34 | 0.77 | 1.00 | 0.0069 | |
|
| |||||
| B.Elastic-net | 99% | 28 | 0.64 | 1.00 | 0.0098 |
| 95% | 31 | 0.70 | 1.00 | 0.0078 | |
| 90% | 34 | 0.77 | 1.00 | 0.0068 | |
|
| |||||
| B.GRACE | 99% | 30 | 0.68 | 1.00 | 0.0057 |
| 95% | 36 | 0.82 | 1.00 | 0.0039 | |
| 90% | 38 | 0.86 | 1.00 | 0.0032 | |
5 Real Data Analysis
Neuroblastoma (NB) is a common and lethal pediatric malignancy, but despite significant effort the genetic events that initiate tumorigenesis were until recently unknown [20]. We had hypothesized that NB is a complex disease that results from the interaction of mutant alleles with relatively low to moderate effect on tumor initiation. To identify these genetic variants, Maris et al. [21] reported a GWAS of NB where 1032 neuroblastoma cases and 2043 controls of European descent were genotyped using the Illumina 550K SNP chips and they observed a significant association between NB and the common minor alleles of three consecutive SNPs at chromosome band 6p22 and containing the predicted genes FLJ22536 and FLJ44180 (p-value = 1.71×10−9 to 7.01×10−10; allelic odds ratio, 1.39 to 1.40) using single SNP trend tests. Homozygosity for the at-risk G allele of the most significantly associated SNP, rs6939340, resulted in an increased likelihood of NB development (odds ratio, 1.97; 95% confidence interval, 1.58 to 2.45).
To demonstrate the Bayesian GRACE procedure, we reanalyzed the 1000 SNPs on chromosome 6 around the 6p22 region where the three SNPs were identified based on the single SNP analysis [21]. To account for the LD among the SNPs, we created a weighted LD graph linking these 1000 SNPs, where two SNPs are linked if their LD measurement r2 is greater 0.4. We assumed a linear regression model with response being 1 for the cases and −1 for the controls. In addition, we coded the SNP genotypes as 0,1, and 2 for 0, 1 and 2 minor alleles and normalized the genotypes such that ΣXj = 0 and for the jth SNP where the sum is over all the individuals. The linear model was repeatedly fitted with λ1 = 1, 10, 50, 100, 200, and λ2 = 1, 10, 102, 103, 104, 105 and the DIC was computed each of the combinations of λ1 and λ2. The DIC selected the model with λ1 = 100 and λ2 = 105.
The 95% Bayesian credible intervals for the median estimates of these 1000 SNPs are presented in Figure 4, showing that most of intervals include zeros and most of the SNPs are not associated with the NB risk. The bottom three plots in Figure 4 show the 99%, 95%, and 90% Bayesian credible intervals for the median estimates of the top 30 SNPs ranked by the absolute values of the estimates along with SNP names, respectively. In the 99% intervals only one SNP, rs4712653 was selected, and the 95% and 90% credible intervals selected 4 and 5 more SNPs, respectively. They are rs6939340, rs9295536, rs9466269, rs11759745, and rs12210008 (selected only when the 90% credible interval was used). The first 4 SNPs and rs4712653 are all linked together and have high LD scores between each other. Three of these SNPs, rs4712653, rs6939340 and rs9295536 were also identified by single SNP analysis [21]. The SNP rs12210008 has the highest median estimate but with relatively large standard error and was therefore only selected when the 90% credible interval was used. This SNP, which is in the neurensin 1 (NRSN1) gene, is not linked with any other SNPs identified. The gene NRSN1 is related to human nervous system development and neuron projection, which may be related to the risk of NB. This deserves further validation.
Figure 4.
Analysis of 1000 SNPs from a case-control genetic association study of neuroblastoma. The upper plot shows the posterior median estimates (
) and the 95% Bayesian credible intervals (
) for the 1000 SNPs. The lower plots show the posterior median estimates (
) of the top 30 SNPs ranked by the absolute values of median estimates and their 99%, 95%, and 90% Bayesian credible intervals (
), respectively.
6 Discussion and Conclusions
We have introduced a Bayesian approach for graph-constrained estimation and regularization for high-dimensional linear regressions where the predictors are the nodes of a graph. Such a graph-constrained regularization procedure utilizes the prior biological information encoded on graphs in identifying the genes or sub-networks or the SNPs that are related to the outcomes. An efficient Gibbs sampling procedure is developed to obtain the posterior distributions of the regression coefficients, which can be used to assess the variability of the estimates. The Gibbs sampling algorithm is very fast and is feasible for analyzing high dimensional data. When the regression coefficients are smooth w.r.t. the graphical structures, the proposed procedure performed better than the Bayesian Lasso or Bayesian Elastic-net procedure both in identifying the relevant variables and also in prediction. This agrees with what was observed by Li and Li [11] when comparing the graph-constrained regularization with Lasso or Elastic net procedures.
Compared to the GRACE, the Bayesian GRACE provides Bayesian credible intervals of the parameter estimates that can be used for statistical inferences. Due to availability of the closed-form conditional distributions in the Gibbs sampling steps, the computational burden is not too excessive. However, different from the standard regularization procedures that can directly result in zero estimates for some covariates and therefore directly lead to variable selection, the Bayesian formulation of the regularization procedures such as that of [7] and our proposed procedure do not directly result in zero estimates of the coefficients. When p is small, it is reasonable to consider the use of Bayesian credible intervals to select the relevant variables. However, when p is large and greater than the sample size n, selecting the variables based on the credible intervals can also be problematic since it depends the choice of the intervals. In this paper, we used the DIC to select the tuning parameters λ1 and λ2. This seems to provide sensible results for simulated and real data sets. We have also tried BIC and found that it tends to select much fewer number of variables. Alternatively, we can sample λ1 and λ2 in the Gibbs sampler with appropriate prior distributions for λ1 and λ2. For example, Park and Casella [7] assigned a Gamma prior on in Bayesian Lasso model, so the resulting conjugacy allows easy extension of the Gibbs sampler. However, estimation of the hyper-parameters in the prior distributions can also be a problem.
The Bayesian regularization procedure proposed in this paper and that of Park and Casella [7] are different from the class of Bayesian variable selection procedures [22–24]. Instead of obtaining the posterior probability of whether a variable should be in a model, the regularization methods obtain the posterior distributions of the regression coefficients by assuming appropriate prior on these coefficients to reflect the desired penalty terms as in penalized regressions. Methods to incorporate the prior graph information into Bayesian variable selection have also been recently developed and studied [25–27]. The key of such approaches is to assume a Markov random field prior for Bernoulli random variables to indicate whether the covariates should be included in the model. In contrast, our approach assumes a combination of Laplacian and Gaussian Markov random field prior on the regression coefficients. It would be interesting to compare the performances of these different Bayesian approaches for variable selection in high dimensional settings.
Acknowledgments
This research was supported by NIH grants ES009911 and CA127334.
References
- 1.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B. 1996;58:267–288. [Google Scholar]
- 2.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- 3.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society B. 2005;67:91–108. [Google Scholar]
- 4.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society B. 2006;68:49–67. [Google Scholar]
- 5.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B. 2005;67:301–320. [Google Scholar]
- 6.Wasserman L, Roeder K. High dimensional variable selection. The Annals of Statistics. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Park T, Casella G. The bayesian lasso. Journal of the American Statistical Association. 2008;103:681–686. [Google Scholar]
- 8.Yi N, Xu S. Bayesian lasso for quantitatibe trait loci. Genetics. 2008;179:1045–1055. doi: 10.1534/genetics.107.085589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li H, Wei Z, Maris J. A hidden markov random field model for genome-wide association studies. Biostatistics. 2010;11:129–150. doi: 10.1093/biostatistics/kxp043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
- 11.Li C, Li H. Variable selection and regression analysis for covariates with a graphical structure with an application to genomics. Annals of Applied Statistics. 2010 doi: 10.1214/10-AOAS332. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Friedman J, Hastie T, Hofling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
- 13.Spiegelhalter DJ, Best NG, Carlin BP. Bayesian measuures of model complexity and fit. Journal of the Royal Statistical Society B. 2002;64:583–639. [Google Scholar]
- 14.Chung F. Spectral Graph Theory. American Mathematical Society; 1997. [Google Scholar]
- 15.Albert J, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88:669–679. [Google Scholar]
- 16.Casella G. Empirical bayes gibbs sampling. Biostatistics. 2001;2:485–500. doi: 10.1093/biostatistics/2.4.485. [DOI] [PubMed] [Google Scholar]
- 17.Fikse WF, Rekaya R, Weigel KA. Genotype × environment interaction for milk production in guernsey cattle. J Dairy Sci. 2003;86:1821–1827. doi: 10.3168/jds.S0022-0302(03)73768-0. [DOI] [PubMed] [Google Scholar]
- 18.Rekaya R, Weigel KA, Gianola D. Bayesian estimation of parameters of a structual model for genetic covariances between milk yield in five regions of the united states. J Dairy Sci. 2003;86:1837–1844. doi: 10.3168/jds.S0022-0302(03)73770-9. [DOI] [PubMed] [Google Scholar]
- 19.Shriner D, Yi N. Deviance information criterion (dic) in bayesian multiple qtl mapping. Computational Statistics and Data Analysis. 2009;53:1850–1860. doi: 10.1016/j.csda.2008.01.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mosse YP, Laudenslager M, Longo L, Cole KA, Attiyeh EF, Wood A, Laquaglia MJ, Sennett R, Lynch JE, Perri P, Laureys G, Speleman F, Kim C, Hou C, Hakonarson H, Torkamani A, Schork NJ, Brodeur GM, Tonini GP, Rappaport E, Devoto M, Maris JM. Identification of alk as a major familial neuroblastoma predisposition gene. Nature. 2008;455:930–035. doi: 10.1038/nature07261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Maris JM, Yael PM, Bradfield JP, Hou C, Monni S, Scott RH, Asgharzadeh S, Attiveh EF, Diskin SJ, Laudenslager M, Winter C, Cole K, Glessner JT, Kim C, Frackelton EC, Casalunovo T, Eckert AW, Capasso M, Rappaport EF, Mc-Conville C, London WB, Seeger RC, Rahman N, Devoto M, Grant SFA, Li H, Hakonarson H. A genome-wide association study identifies a susceptibility locus to clinically aggressive neuroblastoma at 6p22. New England Journal of Medicine. 2008;358:2585–2593. doi: 10.1056/NEJMoa0708698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.George E. The variable selection problem. Journal of the American Statistical Association. 2000;95:1304–1308. [Google Scholar]
- 23.George E, McCulloch RE. Variable selection via gibbs sampling. Journal of the American Statistical Association. 1993;88:881–889. [Google Scholar]
- 24.George E, McCulloch RE. Approaches for bayesian variable selection. Statistica Sinica. 1997;7:339–374. [Google Scholar]
- 25.Tai F, Pan W. Bayesian variable selection in regression with networked predictors. 2008. Manuscript. [Google Scholar]
- 26.Li F, Zhang N. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. Journal of the American Statistical Association. 2010 in press. [Google Scholar]
- 27.Monni S, Li H. Bayesian methods for network-structured genomics data. Frontier of Statistical Decision Making and Bayesian Analysis - In honor of James O Berger. 2010 in press. [Google Scholar]




