Abstract
A Bayesian bi-level variable selection method (BAGB: Bayesian Analysis of Group Bridge) is developed for regularized regression and classification. This new development is motivated by grouped data, where generic variables can be divided into multiple groups, with variables in the same group being mechanistically related or statistically correlated. As an alternative to frequentist group variable selection methods, BAGB incorporates structural information among predictors through a group-wise shrinkage prior. Posterior computation proceeds via an efficient MCMC algorithm. In addition to the usual ease-of-interpretation of hierarchical linear models, the Bayesian formulation produces valid standard errors, a feature that is notably absent in the frequentist framework. Empirical evidence of the attractiveness of the method is illustrated by extensive Monte Carlo simulations and real data analysis. Finally, several extensions of this new approach are presented, providing a unified framework for bi-level variable selection in general models with flexible penalties.
Keywords: Bayesian Regularization, Bayesian Variable Selection, Bi-level Variable Selection, Group Bridge, Group Variable Selection, MCMC
1. Introduction
In many scientific applications, covariates are naturally grouped. One such example arises in association studies, where genetic variants can be divided into multiple groups, with variants in the same group being biologically related or statistically correlated. When the features to be associated possess an inherent grouping structure, traditional variable selection methods are known to produce nonsensical results (Breheny and Huang, 2009; Breheny, 2015), a phenomenon that becomes particularly apparent when categorical variables with multiple levels are present (e.g. in an ANOVA model) or nonparametric components are approximated by a linear combination of some basis functions (e.g. in an additive model) (Yuan and Lin, 2006; Qian et al., 2016). Any standard variable selection method thus offers at best a partial solution to the problem, and the longstanding interest in sparse group selection in many fields including genetics and ecology has led to a fair amount of recent work extending these individual variable selection approaches to grouped predictors.
Specifically, several methods in the context of linear models have been proposed; virtually all enforce within-group sparsity and identify predictive features at both group and individual levels (hereafter referred to as ‘bi-level variable selection’). The only technique that cannot do bi-level variable selection is the group LASSO (Yuan and Lin, 2006), which performs an ‘all-in-all-out’ group selection using an L2 penalization. Huang et al. (2009) noted that the resulting estimates are not guaranteed to be consistent and proposed group bridge to achieve ‘oracle group selection’ consistency using a generalized L1 regularization. Simon et al. (2013) similarly proposed sparse group LASSO (Tibshirani, 1996) by using a convex combination of LASSO and group LASSO penalties on the coefficients. While several other methods have been proposed (Breheny, 2015; Geng et al., 2015), these three have remained the most popular, being thoroughly studied and routinely used in various benchmarks (Breheny and Huang, 2009; Huang et al., 2012).
Among the pros and cons of these individual frequentist approaches, they all lack a systematic approach to inference and are not flexible in simultaneously tuning multiple shrinkage parameters (Leng et al., 2014). Huang et al. (2009) proposed an inferential procedure based on a ridge regression approximation of the covariance matrix. However, this approximation leads to standard error estimates of 0 when the corresponding estimated coefficient is 0. An alternative approach to obtaining standard error estimates is the bootstrap, which is known to have serious numerical difficulties when the true parameter values are either exactly 0 or close to 0; in such cases, bootstrap sampling introduces a bias that does not vanish asymptotically (Kyung et al., 2010; Knight and Fu, 2000). This means that investigators typically must use the resulting estimate without a quantification of its uncertainty, and they must further compromise the desired precision of the tuning parameters.
We address both these issues by providing a flexible, fully Bayesian approach to perform simultaneous variable selection and group identification. It provides coherent inference through the associated posterior and estimates the tuning parameters as an automatic byproduct of the Markov chain Monte Carlo (MCMC) procedure. We have implemented this method as BAGB (Bayesian Analysis of Group Bridge); it estimates the model parameters by using a structured shrinkage prior which enforces group-wise sparsity on the coefficients. In this study, we also introduce a new data-augmentation strategy based on a novel scale mixture of multivariate uniform (SMU) hierarchical representation. The SMU distribution allows us to implement a computationally efficient Gibbs sampling algorithm, which suggests interesting generalizations in a unified modeling framework. Taken together, the major contributions of this paper are as follows: (i) introduction of a group sparsity-promoting prior in Bayesian regularization, (ii) the SMU representation (defined in Proposition 2), which is central to the proposed methodology, (iii) a simple Gibbs sampler with tractable full conditional posterior distributions, (iv) data-adaptive estimation of the tuning parameters, and (v) extended characterization of Bayesian sparse group selection for modeling a variety of outcomes.
2. The Model
2.1. Background and Notation
Consider the normal linear regression model
| (1) |
where y is the n × 1 vector of responses, X is the n × p matrix of regressors, β is the p × 1 vector of coefficients to be estimated, and ε is the n × 1 vector of independent and identically distributed normal errors with mean 0 and variance σ2. We assume that there is an intrinsic grouping structure among the predictors, i.e. β = (β1, …, βK)′ (K is the number of groups consisting of mk elements each with ), where βk is the group-specific coefficient vector (k = 1, …, K). Without loss of generality, we assume that y and X are centered so that μ can be omitted from the model. The group bridge estimator (Huang et al., 2009) results from the following regularization problem
| (2) |
where ||βk||1 is the L1 norm of βk, λ > 0 is the tuning parameter which controls the degree of penalization, α > 0 is the concavity parameter which controls the shape of the penalty function, wk’s (wk > 0, k = 1, …, K) are known weights depending on some preliminary estimates, and for any z = (z1, …, zq)′ ∈ ℝq (the q-dimensional Euclidean space), ||z||s is defined as , s > 0. Contrary to the original formulation (2), we consider a multi-parameter version that gives rise to the following regularization problem
| (3) |
Introducing multiple parameters provides a natural way to pool the information among variables within a group, while also accommodating differential shrinkage through group-specific tuning parameters. Specifically, this allows us to propose a Bayesian approach that incorporates the structural information among predictors by means of a suitable prior distribution as follows (hereafter referred to as ‘group bridge prior’)
As opposed to minimizing (3), BAGB solves the group bridge problem by using a Gibbs sampler that involves constructing a Markov chain having the joint posterior for β as its stationary distribution. Unlike most existing sparsity-inducing priors, the BAGB joint posterior distribution is multimodal. Therefore, both classical and Bayesian approaches to group bridge estimation must encounter practical difficulties related to multimodality. However, as noted in Polson et al. (2014), multimodality is one of the strongest arguments for pursuing a Bayesian approach, as we are able to handle model uncertainty in a systematic manner by fully exploring the posterior distribution, unlike a frequentist approach which forces one to summarize by means of a single point estimate. As it will be clear from the simulation studies and real data analyses, the proposed Gibbs sampler seems to be very effective at exploring the whole parameter space, unlike classical group bridge estimator, which suffers from a number of computational drawbacks due to non-convexity that limits its applicability in high dimensional data analysis (Breheny, 2015).
2.2. Prior Formulation
With a slight abuse of notation, we drop the dependence on the group index k, and consider a q-dimensional group bridge prior as follows
where C(λ, α) is the normalizing constant. The normalizing constant C(λ, α) can be explicitly calculated as (detailed in Appendix A). Hence, the normalized group bridge prior is
| (4) |
This proves that the group bridge prior is a proper prior and in any Bayesian analysis with the group bridge prior, the propriety of the posterior will be retained. The posterior mode estimate under a normal likelihood and the group bridge prior (4) matches exactly with the solution of (3). The group bridge prior can be construed as a multivariate extension of the Bayesian bridge prior (Polson et al., 2014); while Polson et al. (2014) considers independent and identical generalized Gaussian priors on the coefficients, we assign independent multivariate group bridge priors on each group of coefficients to take into account the structural blocking information among predictors and also to mimic the property of the group bridge penalty. Although the above prior formulation may appear conceptually modest, there are considerable challenges to capitalizing on the BAGB prior distribution, underlined primarily by the fact that a scale mixture of multivariate normal (SMN) representation is not explicitly available, due to the lack of a closed-form expression of the mixing density function. The BAGB posterior inference is further impeded by the multimodality of the resulting posterior and presence of multiple shrinkage parameters. To overcome these challenges, we appeal to the SMU representation; with the SMU representation, we are able to formulate a hierarchical model, which alleviates the problem of estimating many parameters. The resulting Gibbs sampler seems to be very effective at exploring the multimodal parameter space, ensuring enriched posterior inference.
2.3. Properties
Proposition 1
Let, β ~ π(β), the group bridge prior defined in (4). Then, β has the following stochastic representation
where is a random vector uniformly distributed on the unit L1 norm sphere in ℝq, and R = ||β||1 is an absolutely continuous non-negative random variable independent of U with density function
which corresponds to a generalized Gamma distribution.
Proof
See Appendix B.
Based on the above proposition, we provide a straightforward algorithm on how to sample directly from the group bridge prior (4) (see Appendix C), allowing us to investigate the shape of the group bridge prior for varying α (Figures 1–2). When 0 < α < 1, the density function is strongly peaked at zero, suggesting that priors in this range are suitable for variable selection by shrinking small coefficients to zero while minimally shrinking the large coefficients. Specifically, when α ≤ 1, the density preserves it strong peak at zero (Figure 1); when α > 1, the sharpness of the density diminishes with increasing α (Figure 2), suggesting that priors in this range may not be appropriate if variable selection is desired. Based on this, we consider 0 < α < 1 throughout this article, and develop our Gibbs sampler based on a novel mixture representation of the group bridge prior, which we describe next.
Figure 1.
Density plots of four bivariate group bridge priors corresponding to α = {0.05, 0.2, 0.5, 0.8} and λ = 1, based on 10, 000 samples drawn from the group bridge prior distribution.
Figure 2.
Density plots of four bivariate group bridge priors corresponding to α = {1, 2, 5, 10} and λ = 1, based on 10, 000 samples drawn from the group bridge prior distribution.
2.4. Scale Mixture of Multivariate Uniform Representation
Proposition 2
The group bridge prior (4) can be written as a scale mixture of multivariate uniform (SMU) distribution, the mixing density being a particular Gamma distribution, i.e. β|u ~ Multivariate Uniform (A), where , u > 0 and .
Proof
Detailed in Appendix D.
2.5. Hierarchical Representation
Based on Proposition 2, we formulate our hierarchical representation as follows
| (5) |
where a non-informative scale-invariant marginal prior is assumed on σ2, i.e. π(σ2) ∝ 1/σ2.
2.6. Full Conditional Distributions
Introduction of u = (u1, u2, …, uK)′ enables us to derive the full conditional distributions as follows
| (6) |
| (7) |
| (8) |
where I(.) denotes an indicator function. The proof involves only simple algebra, and is therefore omitted.
2.7. Sampling Coefficients and the Latent Variables
(6), (7) and (8) lead us to a Gibbs sampler that starts at initial guesses for β and σ2 and iterates the following steps:
-
Generate uk from the left-truncated exponential distribution using inversion method, which can be done as follows:
Generate ,
, k = 1, …, K.
Generate β from a truncated multivariate normal distribution proportional to the posterior distribution of β (described in Appendix E).
Generate σ2 from the Inverse Gamma distribution proportional to the posterior distribution of σ2 described in (8).
2.8. Sampling Hyperparameters
To update the tuning parameters (λ1, …, λK), we work directly with the group bridge prior, marginalizing out the latent variables uk’s. From (5), we observe that the posterior for λk, given β, is conditionally independent of y. Therefore, if λk has a Gamma(a, b) prior, we can update the tuning parameters by generating samples from their joint conditional posterior distribution
The concavity parameter α is usually prefixed beforehand. However, it can also be estimated by assigning a suitable prior π(α). Since 0 < α < 1, a natural choice for prior on α is the Beta distribution. Assuming a Beta (c, d) prior on α, the posterior distribution of α is given by
| (9) |
To sample from (9), one can use a random walk Metropolis-Hastings algorithm with the prior Beta (c, d) as the transition distribution and acceptance probability
where α is the present state and α′ is a candidate draw from the prior distribution Beta (c, d). In our experience, a Beta (10, 10) prior leads to better rate of convergence and mixing as compared to the uniform prior, which is therefore recommended as the default prior for real data analyses.
3. Results
In this section, we evaluate the statistical properties of BAGB and compare its performance with three competing approaches viz. group LASSO (GL), sparse group LASSO (SGL), and group bridge (GB). We make use of the default parameters available in the R software (https://www.r-project.org/) implementations of these methods (grpreg and SGL), none of which assume invariance under within-group orthonormality of the design matrix. All these individual frequentist methods provide an estimate of the degrees of freedom (DF) as their standard outputs, the only exception being SGL, in which, DF is estimated as the number of non-zero coefficients in the model. Regarding the selection of the tuning parameters, Huang et al. (2009) indicated that tuning based on BIC (Schwarz, 1978) in general leads to better variable selection performance as compared to other model selection criteria such as AIC (Akaike, 1974), Cp (Mallows, 1973), and GCV (Golub et al., 1979). Therefore, the associated tuning parameters are selected by BIC for all the frequentist methods under consideration, which is defined as
For BAGB, we set the hyperparameters corresponding to λj’s as a = 1 and b = 0.1, which is relatively flat and results in high posterior probability near the MLE (Kyung et al., 2010). BAGB estimates are posterior means using the post-burn-in samples of the Gibbs sampler. To decide on the burn-in number, we calculate the potential scale reduction factor R̂ (Gelman and Rubin, 1992) using the coda package in R. Once R̂ < 1.1 for all parameters of interest, we continue to draw the desired number of iterations to obtain samples from the joint posterior distribution. The convergence of the MCMC algorithm is also investigated by trace and ACF plots of the generated samples. In particular, the Markov chain is run for a total of 30, 000 iterations, of which 15, 000 are discarded as burn-in (in this and the simulation study as well) and the rest are used as draws from the posterior distribution for making inference. This choice of running parameters appear to work satisfactorily based on the convergence diagnostics. The response is centered and the predictors are normalized to have zero means and unit variances before applying any group regularization method. For variable selection with BAGB, we make use of the Scaled Neighbourhood Criterion (SNC) (with threshold τ = 0.5) (Li and Lin, 2010), which is defined as .
3.1. The Birth Weight Example
First, we illustrate our proposal by re-examining the birth weight dataset from Hosmer and Lemeshow (1989). We make use of a reparametrized data frame available from the R package grpreg. The original dataset, collected at Baystate Medical Center, Springfield, Massachusetts, USA in 1986, consists of the birth weights of 189 babies in kilograms (the response variable of interest) and 8 predictors concerning the mother. Among the eight predictors, two are continuous (mother’s age in years and mother’s weight in pounds at the last menstrual period) and six are categorical (mother’s race, smoking status during pregnancy, number of previous premature labors, history of hypertension, presence of uterine irritability, and number of physician visits). Since a preliminary analysis suggested that nonlinear effects of both mother’s age and weight might exist, both these effects are modeled by using third-order polynomials (Yuan and Lin, 2006). The reparametrized data frame includes p = 16 predictors corresponding to K = 8 groups consisting of 3, 3, 2, 1, 2, 1, 1, and 3 variables respectively (Table 1). As in the original analysis (Yuan and Lin, 2006), we randomly select three-quarters of the observations (151 cases) for model fitting and reserve the rest of the data as the test set.
Table 1.
Description of variables in the birth weight dataset.
| Group | Variables | Description |
|---|---|---|
| Group 1 | age1, age2, age3 | Orthogonal polynomials of first, second, and third degree representing mother’s age in years |
| Group 2 | lwt1, lwt2, lwt3 | Orthogonal polynomials of first, second, and third degree representing mother’s weight in pounds at last menstrual period |
| Group 3 | white, black | Indicator functions for mother’s race (“other” is the reference group) |
| Group 4 | smoke | Smoking status during pregnancy |
| Group 5 | ptl1, ptl2m | Indicator functions for 1 and ≥ 2 previous premature labors (“no previous premature labors” is the reference category) |
| Group 6 | ht | History of hypertension |
| Group 7 | ui | Presence of uterine irritability |
| Group 8 | ftv1, ftv2, ftv3m | Indicator functions for 1, 2, and ≥ 3 physician visits during the first trimester (“no visits” is the reference category) |
We first investigate the possible susceptibility of the results to the Beta prior by considering several values of the concavity parameter that covers a range of non-informative settings (Figure 1). This sensitivity analysis reveals that BAGB significantly outperforms the other three methods in terms of prediction accuracy regardless of the values of the hyperparameters (Table 2). We repeat the random selection of training and test sets many times and obtain similar results as in Table 2. This analysis further reveals that the posterior estimate of α can be sensitive to the choice of the prior distribution. Following similar arguments in Chakraborty and Guo (2011), we recommend that the mean of the Beta distribution is kept above 0.3 and the variance is kept relatively small.
Table 2.
Summary of birth weight data analysis. ‘MSE’ reports average of mean squared errors on the test data. Variables selected by each method as well as an estimate of the concavity parameter (whenever applicable) is also provided.
| Method | Selected Variables | MSE | Estimated α |
|---|---|---|---|
| BAGB - Beta (1, 9) | white, ui | 710151.1 | 0.04 |
| BAGB - Beta (2, 8) | white, ptl1, ui | 701676.6 | 0.04 |
| BAGB - Beta (3, 7) | white, ui | 696373.8 | 0.06 |
| BAGB - Beta (4, 6) | white, ui | 697454.7 | 0.4 |
| BAGB - Beta (5, 5) | white, ptl1, ui | 693814.3 | 0.07 |
| BAGB - Beta (6, 4) | white, ht, ui | 664035.1 | 0.3 |
| BAGB - Beta (7, 3) | white, ptl1, ui | 683937.7 | 0.1 |
| BAGB - Beta (8, 2) | white, ptl1, ui | 686022.2 | 0.1 |
| BAGB - Beta (9, 1) | white, ui | 681118.1 | 0.1 |
| BAGB - Beta (10, 10) | white, ht, ui | 662615.2 | 0.3 |
| BAGB - Uniform | white, ui | 699259.5 | 0.06 |
| GB | lwt1, lwt3, white, ptl1, ht, ui | 758162.2 | Prefixed at 0.5 |
| GL | lwt1, lwt2, lwt3, white, black, smoke, ptl1, ptl2m, ht, ui | 744161.4 | Not applicable |
| SGL | white, ui | 868976.3 | Not applicable |
Next, we compare the prediction accuracy of both group bridge estimators for varying α, which clearly suggests preference for the BAGB model across the values of α (Table 3). We observe that GB solutions do not always coincide with posterior mode estimates (Figure 2). This can be contributed to BAGB’s ability to account for uncertainty associated with the hyperparameters, which is ignored altogether by frequentist GB. Another intrinsic feature of this diagram is the noticeable multimodality in the joint distribution of the BAGB posterior. Clearly, summarizing β by means of a single point estimate may not be reasonable, which unfortunately is the only available estimate in the frequentist framework. We observe that, for this particular dataset, BAGB performs better for fixed α, possibly because not much variance is explained by introducing the priors, resulting in slightly worse performance with estimated α (Tables 2–3). The mixing of the MCMC chain is reasonably good, confirmed in the trace and ACF plots and convergence diagnostics (data not shown). Thus, in addition to mimicking the bi-level variable selection property of frequentist group variable selection methods, BAGB unravels the full posterior summary of the coefficients, with better uncertainty assessment of the tuning parameters.
Table 3.
A comparison of GB and BAGB with respect to MSE’s for varying α based on 38 test set observations.
| Method | α = 0.1 | α = 0.3 | α = 0.5 | α = 0.7 | α = 0.9 |
|---|---|---|---|---|---|
| BAGB | 705374.0 | 672173.1 | 654960.2 | 640146.8 | 625812.3 |
| GB | 839626.6 | 838866.3 | 758162.2 | 764094.2 | 728786.2 |
3.2. Simulation Experiments
Next, we conduct extensive Monte Carlo simulation experiments to benchmark our method as compared to other state-of-the-art group variable selection methods. For a fair comparison, we prefix α = 0.5 for both group bridge methods. To evaluate the prediction performance, following Zou (2006), we calculate the median of relative prediction errors (MRPE) based on 100 replications. The relative prediction error (RPE) is defined as RPE = E(ŷ − Xβ*)/σ2, where β* is the true coefficient vector and ŷ is the predicted response. After partitioning each synthetic data into a training set and a test set, models are fitted on the training set and RPE’s are calculated on the test set. For any variable selection procedure, we measure its performance by Sensitivity (Sen) and Specificity (Spe), as defined in Chen and Wang (2013). We also report the percentage of occasions on which correct groups are selected (averaged over 100 replications by various estimators). A group is selected if any of the variables within that group is selected.
We simulate data from the true model y = Xβ + ε, ε ~ N(0, σ2I). Following Huang et al. (2009), we consider two scenarios. For the generating models in Example 1, the number of groups is moderately large, the group sizes are equal and relatively large, and within each group, the coefficients are either all nonzero or all zero. In Example 2, the group sizes vary and there are coefficients equal to zero in a nonzero group. For a relatively complete evaluation, we vary the signal-to-noise ratio (SNR) from low to moderate to high for varying sample sizes {nT, nP }, where SNR is defined as , nT denotes the size of the training set, and nP denotes the size of the testing set. Most previous simulation experiments (including those reported in Huang et al. (2009)) have only considered high SNR values that do not accurately reflect the actual characteristics of real data (Levina and Zhu, 2008). Therefore, with the simulation study attempted in this paper, we are able to investigate the possible sensitivity of the methods in realistic synthetic data, which is an important contribution to the benchmarking of existing and future methods. In particular, we consider four different levels of SNR from low (1) to moderate (3, 5) to high (10) and three sample sizes {nT, nP }={200, 200} and {400, 400}. σ2 is chosen such that the desired level of SNR is achieved.
Simulation 1 (“All-In-All-Out”)
In this simulation example, we assume that there are K = 5 groups and each group consists of eight covariates, resulting in p = 40 predictors. The design matrix X consists of 5 sub-matrices, i.e. X = (X1, X2, X3, X4, X5), where Xj = (x8(j−1)+1, … x8(j−1)+8)′ for any j ∈ {1, …, 5}. The covariates x1, …, x40 are generated as follows:
First simulate R1, …, R40 independently from the standard normal distribution.
Simulate Zj’s from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xj1 and xj2 equal to 0.5|j1−j2| ∀ j1 ≠ j2, for j1, j2 = 1, …, 5.
Obtain xj’s as , j = 1, …, 40, where gj is the smallest integer greater than (j − 1)/8 and the xj’s with the same value of gj belong to the same group.
The coefficients corresponding to the first two groups are given as (β1, …, β8)′ = (0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4)′, (β9, …, β16)′ = (2, …, 2)′. We set the rest of the coefficients to zero. Thus, the coefficients in each group are either all nonzero or all zero, leading to a “All-In-All-Out” scenario.
Simulation 2 (“Not-All-In-All-Out”)
In this example, we assume that the coefficients in a group can be either all zero, all nonzero or partly zero, leading to a “Not-All-In-All-Out” scenario. We also assume that the group sizes differ across groups. There are K = 6 groups made up of three groups each of size 10 and three groups each of size 4, resulting in p = 42 predictors. The design matrix X consists of 6 sub-matrices, i.e. X = (X1, X2, X3, X4, X5, X6), where Xj = (x10(j−1)+1, … x10(j−1)+10)′ for any j ∈ {1, 2, 3} and Xj = (x4(j−4)+31, …, x4(j−4)+34)′ for any j ∈ {4, 5, 6}. The covariates x1, …, x42 are generated as follows:
First simulate R1, …, R42 independently from the standard normal distribution.
Simulate Zj’s from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xj1 and xj2 equal to 0.5|j1−j2| ∀ j1 ≠ j2, for j1, j2 = 1, …, 6.
For j = 1, …, 30, let gj be the largest integer less than j/10 + 1, and for j = 31, …, 42, let gj be the largest integer less than (j − 30)/4 + 1. Then we obtain xj ’s as , j = 1, …, 42.
The coefficients are given as follows
The operating characteristics measuring the variable selection and prediction performance of various methods are summarized in Tables 3–4. First, in low SNR (high noise) settings, BAGB vastly outperforms existing methods, regardless of the sample size and sparsity structure of the data generating model. Even for moderate to high SNRs, it consistently produces the smallest prediction error across all simulation examples. This is due to the ability of BAGB to account for uncertainty in the tuning parameters in a systematic fashion that leads to the better prediction performance of the method. Second, none of the methods uniformly dominate the others in variable selection performance. In the presence of within-group sparsity (Simulation 2), BAGB has higher or similar sensitivity as compared to both GB and SGL (lower specificity), which indicates that as far as bi-level variable selection is concerned, BAGB is not as ‘aggressive’ as the frequentist methods in excluding predictors. This holds true even in the absence of within-group sparsity (Simulation 1), when the corresponding SNR is moderate or high, although it maintains good specificity. For low SNR scenarios in Simulation 1, it has slightly worse sensitivity than frequentist GB estimator, although it maintains better specificity. Third, consistent with previous studies (Huang et al., 2009), group LASSO has the lowest specificity in all the simulation examples, regardless of the sample size, which is expected as it tends to select all variables in an unimportant group. However, it has the highest sensitivity across all examples. Fourth, in terms of percentage of correct groups selected, BAGB always perform better than frequentist GB estimator in Simulation 1, although outperformed by both GL and SGL, especially in low and moderate SNR scenarios. All the methods have similar performance in selecting correct groups in Simulation 2. Lastly, variable selection performance of BAGB depends on the threshold parameter τ. In practice, using a higher threshold value τ would result in a higher sensitivity but a lower specificity. Therefore, depending on the importance of sensitivity and specificity of a particular study, the researcher may choose the appropriate threshold.
Table 4.
Results from Simulation 1. MRPE & Correct Selection (percentage of occasions on which correct groups are selected, averaged over 100 replications) of various estimators are reported. The numbers in parentheses are the corresponding standard errors of the RPE’s based on 500 bootstrap resampling.
| nT = nP = 200, SNR = 1 | ||||
|---|---|---|---|---|
|
| ||||
| Method | Sen | Spe | MRPE | % Correct Selection |
| BAGB | 0.175 | 0.921 | 0.009(0.019) | 11 |
| GB | 0.241 | 0.881 | 0.012(0.013) | 7 |
| GL | 0.535 | 0.663 | 0.049(0.011) | 27 |
| SGL | 0.076 | 0.965 | 0.068(0.014) | 13 |
|
| ||||
| nT = nP = 200, SNR = 3 | ||||
|
| ||||
| BAGB | 0.352 | 0.903 | 0.011(0.020) | 41 |
| GB | 0.352 | 0.937 | 0.036(0.014) | 21 |
| GL | 0.880 | 0.493 | 0.088(0.015) | 77 |
| SGL | 0.376 | 0.837 | 0.076(0.022) | 72 |
|
| ||||
| nT = nP = 200, SNR = 5 | ||||
|
| ||||
| BAGB | 0.478 | 0.892 | 0.026(0.021) | 64 |
| GB | 0.451 | 0.948 | 0.049(0.017) | 35 |
| GL | 0.955 | 0.443 | 0.112(0.018) | 91 |
| SGL | 0.538 | 0.768 | 0.056(0.022) | 94 |
|
| ||||
| nT = nP = 200, SNR = 10 | ||||
|
| ||||
| BAGB | 0.682 | 0.889 | 0.043(0.020) | 88 |
| GB | 0.584 | 0.958 | 0.081(0.018) | 56 |
| GL | 1.000 | 0.393 | 0.128(0.016) | 100 |
| SGL | 0.698 | 0.743 | 0.051(0.017) | 100 |
|
| ||||
| nT = nP = 400, SNR = 1 | ||||
|
| ||||
| BAGB | 0.237 | 0.932 | 0.016(0.008) | 20 |
| GB | 0.292 | 0.934 | 0.023(0.009) | 13 |
| GL | 0.760 | 0.553 | 0.047(0.009) | 53 |
| SGL | 0.172 | 0.915 | 0.067(0.014) | 31 |
|
| ||||
| nT = nP = 400, SNR = 3 | ||||
|
| ||||
| BAGB | 0.505 | 0.912 | 0.023(0.009) | 67 |
| GB | 0.482 | 0.962 | 0.034(0.011) | 32 |
| GL | 0.980 | 0.417 | 0.064(0.010) | 96 |
| SGL | 0.584 | 0.770 | 0.039(0.008) | 96 |
|
| ||||
| nT = nP = 400, SNR = 5 | ||||
|
| ||||
| BAGB | 0.669 | 0.910 | 0.029(0.009) | 92 |
| GB | 0.586 | 0.982 | 0.053(0.008) | 52 |
| GL | 0.995 | 0.387 | 0.072(0.010) | 99 |
| SGL | 0.698 | 0.741 | 0. 045(0.008) | 100 |
|
| ||||
| nT = nP = 400, SNR = 10 | ||||
|
| ||||
| BAGB | 0.848 | 0.910 | 0.036(0.010) | 100 |
| GB | 0.795 | 0.998 | 0.053(0.009) | 93 |
| GL | 1.000 | 0.320 | 0.085(0.011) | 100 |
| SGL | 0.841 | 0.732 | 0.046(0.008) | 100 |
Additional Simulations with Small Samples Sizes and Categorical Predictors
To investigate the impact of small sample sizes and categorical covariates on the performance of the proposed method, as suggested by an anonymous reviewer, we conduct additional simulation experiments. In particular, we repeat the above-mentioned simulation studies (Simulation 1 and Simulation 2) with {nT, nP}={100, 100} and SNR={1, 3, 5}. In addition, we carry out another simulation study with categorical covariates (Simulation 3), where we generate 15 latent variables Z1, Z2, …, Z15 from a multivariate normal distribution with mean 0, variance 1, and pairwise correlations between Zi and Zj equal to 0.5|i−j| ∀ i ≠ j (Park and Yoon, 2011). Each Zi is trichotomized as 0, 1, or 2 given it is smaller than , larger than , or in between. Then y is generated from
where I(.) is the indicator function and the noise ε is normally distributed with variance σ2 chosen so that the SNR is approximately 1.8. For each run, 100 observations are collected with two sample sizes: {nT, nP}={100, 100} and {nT, nP}={200, 200}. The results (presented in Table 6) indicates that BAGB performs as well as or better than frequentist methods in prediction accuracy. Here also, no method dominates the others with respect to variable selection performance. In terms of percentage of occasions on which correct groups are selected, BAGB exhibits superior performance in both Simulations 2 and 3, although outperformed by both GL and SGL in Simulation 1. In Table 7, we also report the computing time of our method for each of the simulation studies. We note that the runtime of our algorithm is not directly comparable to that of the frequentist methods, as it depends on many factors unrelated to the theoretical properties of the sampler that can affect the run time significantly (e.g. programming language, compiler options, whether lower level language is used or not, libraries one links to, and system hardware, among others). Beyond that, how one implements each Gibbs steps will impact the time it takes to run. While the method is more computationally intensive than calculating a solution path for a classical penalized regression approach like group bridge, this is the price to be paid for exploring uncertainty in parameter estimates and the computational demands of the method are certainly less than Bayesian approaches which search the model space directly. The results presented here employ the default options and libraries used by the R software. However, the computation time of BAGB can be significantly reduced either by using efficient programming of low-level languages like C++ or implementing GPU programming, which we intend to do in future.
Table 6.
Results from Simulation 3. MRPE and & Correct Selection (percentage of occasions on which correct groups are selected, averaged over 100 replications) of various estimators are reported. The numbers in parentheses are the corresponding standard errors of the RPE’s based on 500 bootstrap resampling.
| Simulation 1, nT = nP = 100, SNR = 1 | ||||
|---|---|---|---|---|
|
| ||||
| Method | Sen | Spe | MRPE | % Correct Selection |
| BAGB | 0.206 | 0.870 | 0.088(0.023) | 20 |
| GB | 0.304 | 0.754 | 0.114(0.013) | 29 |
| GL | 0.350 | 0.793 | 0.111(0.022) | 16 |
| SGL | 0.064 | 0.970 | 0.103(0.028) | 12 |
|
| ||||
| Simulation 1, nT = nP = 100, SNR = 3 | ||||
|
| ||||
| BAGB | 0.322 | 0.854 | 0.107(0.022) | 36 |
| GB | 0.238 | 0.905 | 0.107(0.018) | 13 |
| GL | 0.745 | 0.557 | 0.175(0.028) | 54 |
| SGL | 0.192 | 0.908 | 0.164(0.027) | 44 |
|
| ||||
| Simulation 1, nT = nP = 100, SNR = 5 | ||||
|
| ||||
| BAGB | 0.414 | 0.843 | 0.120(0.021) | 54 |
| GB | 0.311 | 0.921 | 0.120(0.015) | 23 |
| GL | 0.865 | 0.483 | 0.203(0.027) | 73 |
| SGL | 0.359 | 0.845 | 0.120(0.023) | 77 |
|
| ||||
| Simulation 2, nT = nP = 100, SNR = 1 | ||||
|
| ||||
| BAGB | 0.821 | 0.653 | 0.384(0.024) | 72 |
| GB | 0.446 | 0.934 | 0.470(0.022) | 7 |
| GL | 0.856 | 0.688 | 0.467(0.032) | 41 |
| SGL | 0.220 | 0.957 | 0.651(0.034) | 16 |
|
| ||||
| Simulation 2, nT = nP = 100, SNR = 3 | ||||
|
| ||||
| BAGB | 0.961 | 0.586 | 0.498(0.034) | 96 |
| GB | 0.835 | 0.850 | 0.583(0.042) | 60 |
| GL | 0.993 | 0.479 | 0.579(0.060) | 94 |
| SGL | 0.602 | 0.898 | 1.059(0.079) | 70 |
|
| ||||
| Simulation 2, nT = nP = 100, SNR = 5 | ||||
|
| ||||
| BAGB | 0.980 | 0.565 | 0.540(0.032) | 100 |
| GB | 0.953 | 0.808 | 0.554(0.039) | 91 |
| GL | 1.000 | 0.345 | 0.598(0.051) | 100 |
| SGL | 0.720 | 0.887 | 1.323(0.084) | 85 |
|
| ||||
| Simulation 3, nT = nP = 100 | ||||
|
| ||||
| BAGB | 0.980 | 0.812 | 0.060(0.019) | 99 |
| GB | 0.910 | 0.946 | 0.060(0.008) | 92 |
| GL | 0.993 | 0.776 | 0.080(0.013) | 18 |
| SGL | 0.780 | 0.780 | 0.120(0.019) | 16 |
|
| ||||
| Simulation 3, nT = nP = 200 | ||||
|
| ||||
| BAGB | 0.900 | 0.768 | 0.110(0.01) | 89 |
| GB | 0.768 | 0.924 | 0.110(0.01) | 63 |
| GL | 0.970 | 0.766 | 0.110(0.01) | 13 |
| SGL | 0.548 | 0.945 | 0.230(0.02) | 5 |
Table 7.
Average computing time (in minutes) of BAGB based on 100 replications. The numbers in parentheses are the corresponding standard errors.
| Simulation 1 (p = 40) | |||||
|---|---|---|---|---|---|
|
| |||||
| nT = nP = 100 | nT = nP = 200 | nT = nP = 400 | |||
| SNR=1 | 5.22(0.21) | SNR=1 | 5.21(0.22) | SNR=1 | 4.42(0.17) |
| SNR=3 | 5.19(0.21) | SNR=3 | 5.20(0.09) | SNR=3 | 4.30(0.21) |
| SNR=5 | 4.30(0.21) | SNR=5 | 5.10(0.19) | SNR=5 | 5.14(0.22) |
| SNR=10 | 4.25(0.21) | SNR=10 | 5.12(0.18) | ||
|
| |||||
| Simulation 2 (p=42) | |||||
|
| |||||
| nT = nP = 100 | nT = nP = 200 | nT = nP = 400 | |||
|
| |||||
| SNR=1 | 5.40(0.10) | SNR=1 | 5.54(0.20) | SNR=1 | 4.71(0.13) |
| SNR=3 | 5.53(0.22) | SNR=3 | 5.51(0.22) | SNR=3 | 4.50(0.09) |
| SNR=5 | 5.43(0.20) | SNR=5 | 5.52(0.22) | SNR=5 | 5.52(0.12) |
| SNR=10 | 4.60(0.16) | SNR=10 | 5.40(0.17) | ||
|
| |||||
| Simulation 3 (p = 30) | |||||
|
| |||||
| nT = nP = 100 | nT = nP = 200 | ||||
|
| |||||
| 5.50(0.19) | 5.49(0.22) | ||||
4. Extensions
4.1. Extension to General Models
In this section, we briefly discuss how BAGB can be extended to several other models (e.g. GLM, Cox’s model, etc.) beyond the linear regression. Our extension is based on the least squares approximation (LSA) by Wang and Leng (2007). Following Wang and Leng (2007), for a general model, the conditional distribution of y is given by
where β̃ is the MLE of β and Σ̂−1 = δ2L(β)/δβ2. Thus, we can easily extend our method to several other models by approximating the corresponding likelihood by normal likelihood. Combining the SMU representation of the group bridge prior and the LSA approximation of the general likelihood, the hierarchical representation of BAGB for general models can be written as
As before, an efficient Gibbs sampler can be easily carried out based on the full conditionals, which are easily derived as before (detailed in Appendix F).
4.2. Application to the Birthweight Data - Logistic Regression Illustration
As a second example, we re-visit the birth weight data. We devote this example to evaluate the performance of BAGB in general models. The binary response variable of interest is whether the maternal birth weight is less than 2.5 kg (Y = 1) or not (Y = 0). As before, we randomly select 151 observations for model fitting and use the rest as the test set to validate the test errors. Table 5 summarizes the comparison of test errors, which reveals that BAGB performs the best and SGL closely follows next. Figure 5 describes the 95% equal-tailed credible intervals for the regression parameters for this data set corresponding to α = 0.5. We emphasize that, among the methods depicted here, only BAGB can provide valid standard errors, based on a geometrically ergodic Markov chain. It is interesting to note that all the estimates are inside the credible intervals, which indicates that the resulting conclusion will be similar regardless of which method is used. Hence, the analyses show strong support for the use of the proposed method.
Table 5.
Results from Simulation 2. MRPE & Correct Selection (percentage of occasions on which correct groups are selected, averaged over 100 replications) of various estimators are reported. The numbers in parentheses are the corresponding standard errors of the RPE’s based on 500 bootstrap resampling.
| nT = nP = 200, SNR = 1 | ||||
|---|---|---|---|---|
|
| ||||
| Method | Sen | Spe | MRPE | % Correct Selection |
| BAGB | 0.916 | 0.732 | 0.137(0.015) | 93 |
| GB | 0.720 | 0.925 | 0.225(0.021) | 29 |
| GL | 0.982 | 0.608 | 0.226(0.018) | 88 |
| SGL | 0.507 | 0.920 | 0.363(0.033) | 27 |
|
| ||||
| nT = nP = 200, SNR = 3 | ||||
|
| ||||
| BAGB | 0.988 | 0.674 | 0.146(0.015) | 100 |
| GB | 0.978 | 0.851 | 0.185(0.014) | 99 |
| GL | 1.000 | 0.454 | 0.212(0.021) | 100 |
| SGL | 0.878 | 0.903 | 0.568(0.016) | 100 |
|
| ||||
| nT = nP = 200, SNR = 5 | ||||
|
| ||||
| BAGB | 0.994 | 0.653 | 0.153(0.014) | 100 |
| GB | 0.993 | 0.839 | 0.192(0.014) | 100 |
| GL | 1.000 | 0.379 | 0.229(0.022) | 100 |
| SGL | 0.903 | 0.938 | 0.870(0.026) | 100 |
|
| ||||
| nT = nP = 200, SNR = 10 | ||||
|
| ||||
| BAGB | 0.999 | 0.630 | 0.175(0.018) | 100 |
| GB | 0.999 | 0.830 | 0.190(0.021) | 100 |
| GL | 1.000 | 0.282 | 0.260(0.029) | 100 |
| SGL | 0.915 | 0.962 | 1.641(0.049) | 100 |
|
| ||||
| nT = nP = 400, SNR = 1 | ||||
|
| ||||
| BAGB | 0.977 | 0.704 | 0.078(0.010) | 100 |
| GB | 0.928 | 0.895 | 0.115(0.015) | 77 |
| GL | 1.000 | 0.527 | 0.114(0.011) | 100 |
| SGL | 0.858 | 0.870 | 0.192(0.014) | 100 |
|
| ||||
| nT = nP = 400, SNR = 3 | ||||
|
| ||||
| BAGB | 0.996 | 0.658 | 0.082(0.011) | 100 |
| GB | 0.994 | 0.850 | 0.094(0.012) | 100 |
| GL | 1.000 | 0.408 | 0.124(0.010) | 100 |
| SGL | 0.919 | 0.963 | 0.451(0.013) | 100 |
|
| ||||
| nT = nP = 400, SNR = 5 | ||||
|
| ||||
| BAGB | 0.999 | 0.639 | 0.091(0.011) | 100 |
| GB | 0.999 | 0.841 | 0.096(0.014) | 100 |
| GL | 1.000 | 0.291 | 0.128(0.011) | 100 |
| SGL | 0.929 | 0.986 | 0. 705(0.016) | 100 |
|
| ||||
| nT = nP = 400, SNR = 10 | ||||
|
| ||||
| BAGB | 1.000 | 0.618 | 0.109(0.009) | 100 |
| GB | 1.000 | 0.837 | 0.109(0.011) | 100 |
| GL | 1.000 | 0.299 | 0.148(0.011) | 100 |
| SGL | 0.934 | 0.994 | 1.360(0.033) | 100 |
Figure 5.
For the birth weight data, posterior mean Bayesian group bridge estimates and corresponding 95% equal-tailed credible intervals based on 15, 000 Gibbs samples using a logistic regression model. Overlaid are the corresponding GB, GL, and SGL estimates.
4.3. Composite Group Bridge Penalty
Seetharaman (2013) proposed the composite group bridge estimator as an extension of the original group bridge penalty that results from the following regularization problem
where λk’s are the tuning parameters (λk > 0, k = 1, …, K), and α and γ are the concavity parameters (0 < α < 1, 0 < γ < 1). As before, the hierarchical representation of the composite group bridge estimator can be formulated as follows
The full conditional distributions can be derived as before, which are provided in Appendix G.
5. Discussion
Despite recent methodological advances, there is a surprising dearth of published methods devoted to the Bayesian sparse group selection problem. Here, we describe a Bayesian method, BAGB, for bi-level variable selection in regularized regression and classification. Assuming a group-wise shrinkage prior on the coefficients, BAGB estimates the model parameters using a computationally efficient MCMC algorithm. Unlike established frequentist methods that summarize inference using a single point estimate, BAGB yields uncertainty estimates of the coefficients and enables coherent inference through the posterior distribution. Further, by allowing data-adaptive tuning of multiple shrinkage parameters, it provides a novel and natural treatment of the uncertainty associated with the hyperparameters. Realistic benchmark results reveal that BAGB outperforms a variety of published methods in both estimation and prediction. Finally, we have presented a unified framework which is applicable to models above and beyond the linear regression.
Analysis using SMU mixture representation provided a new data-augmentation strategy for fully Bayesian inference with general likelihoods. This mixture formulation is in sharp contrast to previous studies that relied primarily on SMN hierarchical representations of the associated priors (Kyung et al., 2010). In addition to facilitating efficient computation, this new multivariate representation provided interesting generalizations including several non-trivial variants of the proposed method. This is an important contribution to the characterization of existing and future analogous MCMC algorithms for other sparse Bayesian models, which to the best of our knowledge has not been explored in depth before. In our experience, the resulting Gibbs sampler is efficient with good rates of convergence, opening up new avenues for the potential application of BAGB in various scientific disciplines.
BAGB is similar in essence to Bayesian Group LASSO (Kyung et al., 2010), which considers independent multivariate Laplace priors on the group-specific coefficients. However, there are several differences between these two methods. Similar to its frequentist cousin, Bayesian group LASSO cannot do bi-level variable selection, and therefore, it is expected to perform poorly for problems that are actually sparse at both group and individual levels (Hernández-Lobato et al., 2013). In addition, Bayesian group LASSO assumes invariance under within-group orthonormal transformations (Kyung et al., 2010), which is difficult to justify without making untestable assumptions on the design matrix. Due to these fundamental differences, a comparison with Bayesian group LASSO is not attempted in this paper.
We anticipate several computational and methodological improvements that may further refine BAGB’s performance. While BAGB uses the heuristic SNC criterion (Li and Lin, 2010) for posterior inference, it can be highly sensitive to the choice of the threshold (Li and Pati, 2017). Alternative methods may improve on this, such as the posterior 2-means (Chakraborty and Guo, 2011; Li and Pati, 2017). Although we have developed our method for working with non-overlapping groups, we recognize that multiple features can belong to multiple overlapping groups in many practical problems, for which the current version of BAGB may not be applicable although we suggest it as a future work. Another path for future investigation includes a computationally efficient adaptation of our approach to fast algorithms for rapid estimation and speedy inference. We thus hope to refine both the scalability and applicability of BAGB in future studies.
Figure 3.
Beta densities for different hyperparameter values.
Figure 4.
Marginal posterior densities (based on 15, 000 posterior samples) of the 16 predictors in the birth weight data. Solid red line: penalized likelihood (group bridge) solution with λ chosen by BIC. Dashed green line: BAGB posterior means. Dotted purple line: BAGB posterior modes. α is prefixed as 0.5.
Table 8.
Summary of the birth weight data analysis results using logistic regression. ‘AUC’ reports the area under the curve (AUC) estimates on the test data. ‘Test Error’ reports the misclassification errors on the test data.
| BAGB | GB | GL | SGL | |
|---|---|---|---|---|
| Test Error | 0.21 | 0.24 | 0.26 | 0.21 |
| AUC | 0.78 | 0.67 | 0.71 | 0.64 |
Highlights.
A Bayesian bi-level variable selection method, BAGB (Bayesian Analysis of Group Bridge), is developed for linear models.
A new data-augmentation scheme is introduced based on a novel scale mixture of multivariate uniform hierarchical mixture representation.
BAGB facilitates posterior inference via a computationally efficient MCMC algorithm.
BAGB performs as well as or better than published frequentist methods in both estimation and prediction.
Simple extensions of BAGB are discussed for general models with flexible penalties.
Acknowledgments
We thank the Associate Editor and the two anonymous reviewers for their helpful comments. This work was supported in part by the research computing resources acquired and managed by University of Alabama at Birmingham IT Research Computing. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the University of Alabama at Birmingham. Himel Mallick was supported in part by the research grants U01 NS041588 and NSF 1158862. Nengjun Yi was supported in part by the research grant NIH 5R01GM069430-08.
Appendix A: Calculating the Normalizing Constant
Here we derive the normalizing constant for the group bridge prior.
Proof
The group bridge prior is given by , where β ∈ ℝq, λ > 0, α > 0, and ℝq denotes the q-dimensional Euclidean space. Consider the following transformation
The Jacobian of the above transformation is |J| = rq−1.
Therefore, the joint pdf of (r, z1, z2, …, zq−1) is given by
where −1 < zi < 1, i = 1, …, (q − 1) and .
Note that,
Therefore, we have,
This completes the proof.
Appendix B: Stochastic Representation
Here we study various properties of the group bridge prior in light of its connection to the multivariate L1 norm spherically symmetric distributions (also known as simplicially contoured distributions) (Fang et al., 1989). To this end, we introduce the following notations and lemma.
Definition 1
A random vector β = (β1, …, βq)′ with location parameter μ and scale parameter Σ follows an L1 norm spherically symmetric distribution if its density function f is of the form for some ψ: ℝ+ → ℝ+.
Definition 2
A random vector β = (β1, …, βq)′ follows a multivariate uniform distribution in an arbitrary region A ⊂ ℝq if its density function f is of the form
where Ω(A)=Volume of A.
Lemma
If f is of the form for some ψ: ℝ+ → ℝ+, it can also be expressed as
where , g(.) is the density of r(β).
Proof
Without loss of generality, we assume that the random vector β* follows an L1 norm spherically symmetric distribution with Σ = Iq and μ = 0, where Iq is the q × q identity matrix and 0 is the q-dimensional vector with all entries equal to 0. Following the proof of Lemma 1.4 in Fang et al. (1989), we have the following
for any non-negative measurable function ζ. Therefore, for any non-negative measurable function η, we define ζ = η. ψ and get the following
Therefore, the density of r(β*) is
Now, consider the transformation . We get,
This completes the proof.
Proof of Proposition 1
By applying Lemma 1, the proof is an immediate consequence of the fact that the group bridge prior belongs to the family of multivariate L1 norm spherically symmetric distributions, as defined above.
Appendix C: Directly Sampling from the Group Bridge Prior
Based on Proposition 1, we observe that when α = 1, β follows a joint distribution of q i.i.d Laplace distributions, which can be decomposed as β = RU, where follows a multivariate ‘L1 norm sphere’ uniform distribution and R follows a particular gamma distribution. Therefore, any sample from the joint distribution of i.i.d Laplace distributions, divided by its L1 norm can be regarded as a sample generated from the multivariate ‘L1 norm sphere’ uniform distribution (a multivariate uniform distribution defined in an L1 norm sphere). Therefore, given λ and α, the algorithm for generating observations from the q-dimensional group bridge prior proceeds as follows:
Step I: Generate an observation r* from and calculate r = r*1/α.
Step II: Generate z = (z1, …, zq)′ from the joint distribution of q i.i.d Laplace distributions with scale parameter 0 and location parameter 1 (these parameters can be arbitrarily chosen).
Step III: Calculate . u can be regarded as a sample generated from the multivariate ‘L1 norm sphere’ uniform distribution.
Step IV: y = u * r is a sample generated from the desired group bridge prior.
Appendix D: Scale Mixture of Multivariate Uniform Representation
Proof of Proposition 2
Consider the L1 norm open ball . The volume of A is given by . We have,
This completes the proof.
Appendix E: A Gibbs Sampler for Sampling from the Multivariate Normal Distribution Truncated On An L1 Norm Open Ball
I. Problem Statement
Let Ωn(ρ) denote the following L1 norm open ball in ℝn:
| (10) |
where ρ > 0 is the radius of Ωn(ρ) and ||.||1 denotes the L1 norm. With a slight abuse of notation, let us denote an n-dimensional truncated multivariate normal (TMVN) distribution defined on the open ball Ωn(R) with mean vector μ and covariance matrix Σ as NΩn (μ, ∑). The following describes an efficient Gibbs sampler to sample from this distribution.
II. Gibbs Sampler for the TMVN Distribution
The full conditional distributions of the above TMVN distribution are truncated univariate normals. Therefore, the generation of x̃(t) distributed according to a NΩn (μ, Σ) distribution proceeds by successively drawing the components , i = 1, .., n from their conditional distributions, i.e. at the step (t + 1):
where and TUVN denotes a two-sided truncated univariate normal distribution truncated in , i = 1, 2, .., n with mean and variance , where the expressions , and Li are given as follows
| (11) |
where , μi is the ith element of μ, si is the (n−1)×1 vector obtained from the ith column of Σ by removing the ith row and denotes the (n − 1) × (n − 1) matrix obtained from Σ−1 by removing the ith row and ith column. The sampling according to the above TUVN distributions can be easily done by implementing an efficient algorithm proposed by Li and Ghosh (2015).
III. Sampling Coefficients in the Bayesian Group Bridge Regression
It is easy to note that the truncation region in BAGB posterior is group-wise independent. Therefore, at each step, we sample from the group-wise conditional distribution (using the algorithm described above), conditional on other parameters, i.e. βk|β−k, u, λ, α, σ2, y, X ~ NΩk (μk, Σk), where NΩk (μk, Σk) is the mk-dimensional TMVN distribution defined on the open ball Ωk(u) with mean vector μk and covariance matrix Σk, which are easily obtained by partioning the corresponding full variance-covariance matrix, and , and β−k = (β1, …, βk−1, βk+1, …, βK), k = 1, …, K.
Appendix F: The BAGB Posterior Distributions for General Models
The full conditional distributions for the general models are given as
Appendix G: The BAGB Posterior Distributions for Composite Group Bridge
The full conditional distributions for the composite group bridge (Seetharaman, 2013) can be derived as
When γ = 1, the above reduces to the BAGB estimator.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19:716–723. [Google Scholar]
- Breheny P, Huang J. Penalized methods for bi-level variable selection. Statistics and its interface. 2009;2(3):369. doi: 10.4310/sii.2009.v2.n3.a10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breheny Patrick. The group exponential lasso for bilevel variable selection. Biometrics. 2015;71(3):731–740. doi: 10.1111/biom.12300. [DOI] [PubMed] [Google Scholar]
- Chakraborty Sounak, Guo Ruixin. A bayesian hybrid huberized support vector machine and its applications in high-dimensional medical data. Computational Statistics & Data Analysis. 2011;55(3):1342–1356. [Google Scholar]
- Chen Qixuan, Wang Sijian. Variable selection for multiplyimputed data with application to dioxin exposure study. Statistics in medicine. 32(21):3646–3659. doi: 10.1002/sim.5783. [DOI] [PubMed] [Google Scholar]
- Fang KT, Kotz S, Ng KW. Symmetric Multivariate and Related Distributions. Chapman and Hall; London: 1989. [Google Scholar]
- Gelman Andrew, Rubin Donald B. Inference from iterative simulation using multiple sequences. Statistical science. 1992;7(4):457–472. [Google Scholar]
- Geng Zhigeng, Wang Sijian, Yu Menggang, Monahan Patrick O, Champion Victoria, Wahba Grace. Group variable selection via convex logexpsum penalty with application to a breast cancer survivor study. Biometrics. 2015;71(1): 53–62. doi: 10.1111/biom.12230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golub Gene H, Heath Michael, Wahba Grace. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics. 1979;21(2): 215–223. [Google Scholar]
- Hernández-Lobato Daniel, Hernández-Lobato José Miguel, Dupont Pierre. Generalized spike-and-slab priors for bayesian group feature selection using expectation propagation. Journal of Machine Learning Research. 2013;14(1):1891–1945. [Google Scholar]
- Hosmer DW, Lemeshow S. Applied Logistic Regression. John Wiley & Sons; 1989. [Google Scholar]
- Huang J, Ma S, Xie H, Zhang C. A group bridge approach for variable selection. Biometrika. 2009;96(2):339–355. doi: 10.1093/biomet/asp020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang J, Breheny P, Ma S. A selective review of group selection in high-dimensional models. Statistical Science. 2012;27(4) doi: 10.1214/12-STS392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knight K, Fu W. Asymptotics for lasso-type estimators. Annals of Statistics. 2000:1356–1378. [Google Scholar]
- Kyung M, Gill J, Ghosh M, Casella G. Penalized regression, standard errors, and bayesian lassos. Bayesian Analysis. 2010;5:369–412. [Google Scholar]
- Leng C, Tran M, Nott D. Bayesian adaptive lasso. Annals of the Institute of Mathematical Statistics. 2014;66(2):221–244. [Google Scholar]
- Levina E, Zhu J. Discussion of “sure independence screening for ultrahigh dimensional feature space”. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008;70(5):897–898. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Hanning, Pati Debdeep. Variable selection using shrinkage priors. Computational Statistics & Data Analysis. 2017;107:107–119. 3. [Google Scholar]
- Li Q, Lin N. The bayesian elastic net. Bayesian Analysis. 2010;5(1):151–70. [Google Scholar]
- Li Yifang, Ghosh Sujit K. Efficient sampling methods for truncated multivariate normal and student-t distributions subject to linear inequality constraints. Journal of Statistical Theory and Practice. 9(4):712–732. [Google Scholar]
- Mallows Colin L. Some comments on c p. Technometrics. 1973;15(4):661–675. [Google Scholar]
- Park C, Yoon YJ. Bridge regression: adaptivity and group selection. Journal of Statistical Planning and Inference. 2011;141:3506–3519. [Google Scholar]
- Polson NG, Scott JG, Windle J. The bayesian bridge. Journal of the Royal Statistical Society, Series B (Methodological) 2014;76(4):713–733. [Google Scholar]
- Qian Wei, Yang Yi, Zou Hui. Tweedie’s compound poisson model with grouped elastic net. Journal of Computational and Graphical Statistics. 25(2):606–625. [Google Scholar]
- Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461–464. [Google Scholar]
- Seetharaman Indu. Master’s thesis. Kansas State University; 2013. Consistent bi-level variable selection via composite group bridge penalized regression. [Google Scholar]
- Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. Journal of Computational and Graphical Statistics. 2013;22(2):231–245. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996;58:267–288. [Google Scholar]
- Wang H, Leng C. Unified lasso estimation by least squares approximation. Journal of the American Statistical Association. 2007;102(479):1039–1048. [Google Scholar]
- Yuan M, Lin N. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B (Methodological) 2006;68:49–67. [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]





