Summary
Penalization schemes like Lasso or ridge regression are routinely used to regress a response of interest on a high-dimensional set of potential predictors. Despite being decisive, the question of the relative strength of penalization is often glossed over and only implicitly determined by the scale of individual predictors. At the same time, additional information on the predictors is available in many applications but left unused. Here, we propose to make use of such external covariates to adapt the penalization in a data-driven manner. We present a method that differentially penalizes feature groups defined by the covariates and adapts the relative strength of penalization to the information content of each group. Using techniques from the Bayesian tool-set our procedure combines shrinkage with feature selection and provides a scalable optimization scheme. We demonstrate in simulations that the method accurately recovers the true effect sizes and sparsity patterns per feature group. Furthermore, it leads to an improved prediction performance in situations where the groups have strong differences in dynamic range. In applications to data from high-throughput biology, the method enables re-weighting the importance of feature groups from different assays. Overall, using available covariates extends the range of applications of penalized regression, improves model interpretability and can improve prediction performance.
Keywords: Classification, External covariates, Feature selection, Penalized regression, Variational Bayes
1. Introduction
We are interested in the setup where we observe a continuous or categorical response
together with a vector of potential predictors, or features,
and aim to find a relationship of the form
Two main questions are of potential interest in this setting. First, we want to obtain an
that yields good predictions for
given a new observation
. Second, we aim at finding which components in
are the “important ones” for the prediction.
A common and useful approach to this end are (generalized) linear regression methods, which assume that the distribution of
depends on
via a linear term
. In order to cope with high-dimensionality of
and avoid over-fitting, penalization on
is employed, e.g., in ridge regression (Hoerl and Kennard, 1970), Lasso (Tibshirani, 1996), or elastic net (Zou and Hastie, 2005). By constraining the values of
, the complexity of the model is restricted, resulting in biased but less variable estimates and improved prediction performance. In addition, some choices of the penalty yield estimates with a relatively small number of non-zero components, thereby facilitating feature selection. An example is the
-penalty employed in Lasso or elastic net.
Commonly, penalization methods apply a penalty that is symmetric in the model coefficients. Real data, however, often consist of a collection of heterogeneous features, which such an approach does not account for. In particular, it ignores any additional information or structural differences that may be present in the features. Often we encounter
whose components comprise multiple data modalities and data qualities, e.g., measurement values from different assays. Other side-information on individual features could include temporal or spatial information, quality metrics associated to each measurement or the features’ sample variance, frequency or signal-to-noise ratio. It has already been observed in multiple testing that the power of the analysis can be improved by making use of such external information (e.g., Ferkingstad and others, 2008; Dobriban and others, 2015; Ignatiadis and others, 2016; Li and Barber, 2019; Lei and Fithian, 2018). However, in current penalized regression models this information is frequently ignored. Making use of it could on one hand improve prediction performance. On the other hand, it might yield important insight into the relationship of external covariates to the features’ importance. For example, if the covariate encodes different data modalities, insights into their relative importance could help cutting costs by reducing future assays to the essential data modalities.
As a motivating example, we consider applications in molecular biology and precision medicine. Here, the aim is to predict phenotypic outcomes, such as treatment response, and identify reliable disease markers based on molecular data. Nowadays, different high-throughput technologies can be combined to jointly measure thousands of molecular features from different biological layers (Ritchie and others, 2015; Hasin and others, 2017). Examples include genetic alterations, gene expression, methylation patterns, protein abundances, or microbiome occurrences. However, despite the increasing availability of molecular and clinical data, outcome prediction remains challenging (Hamburg and Collins, 2010; Chen and Snyder, 2013; Alyass and others, 2015). Common applications of penalized regression only make use of parts of the available data. For example, different assay types are simply concatenated or analyzed separately. In addition, available annotations on individual features are left unused, such as their chromosomal location or gene set and pathway membership. Incorporating side-information on the assay type and spatial or functional annotations could help to improve prediction performance. Furthermore, it could help prioritizing feature groups, such as different assays or gene sets.
Here, we propose a method that incorporates external covariates in order to guide penalization and can learn relationships of the covariate to the feature’s effect size in a data-driven way. We introduce the method for linear models and extend it to classification purposes. We demonstrate that this can improve prediction performance and yields insights into the relative importance of different feature sets, both on simulated data and applications in high-throughput biology.
2. Methods
2.1. Problem statement
Assume we are given observations
with
,
(possibly
) from a linear model, i.e.
![]() |
(2.1) |
with
. In addition, we suppose that we have access to a covariate
for each predictor
. We hope, loosely speaking, that
contains some sort of information on the magnitude of
. The question we want to address is: can we use the information from
to improve upon estimation of
and prediction of
?
In order to estimate
from a finite sample
and
we can employ penalization on the negative log-likelihood of the model, i.e.
![]() |
(2.2) |
where
denotes a penalty function on the model coefficients. For example,
leads to Lasso (
) or ridge regression (
). The parameter
controls the amount of penalization and thereby the model complexity. Ideally, we would like to choose an optimal
. For estimation this means minimizing the mean squared error
; for prediction this means minimizing the expected prediction error. In practice,
is often chosen to minimize the cross-validated error.
In most applications, the penalization is symmetric, i.e., for any permutation
we have
. However, as we have external information on each feature given by
we want to allow for differential penalization guided by
. For this, we will consider the following non-symmetric generalization, which still leads to a convex optimization problem in
for convex penalty functions
, such as
or
:
![]() |
(2.3) |
Instead of a constant
, here
provides a mapping from the covariate
to a non-negative penalty factor
. This additional flexibility compared to a single penalty parameter can be helpful if
contains information on
. For example, in the simple case of ridge regression with deterministic orthonormal design matrix, known noise variance
and “oracle covariate”
the optimal
is seen to be
. However, in practice the information in
is not that explicit and hence we do not know which
is optimal.
If
takes values in a small set of discrete values, e.g., for categorical covariates
, cross-validation could be used to determine a suitable set of function values. This approach is employed by Boulesteix and others (2017), where categorical covariates encode different data modalities. However, cross-validation soon becomes prohibitive, as it requires a grid search exponential in the number of categories defined by
. Similarly, cross-validation can be employed with
parametrized by a small number of tuning parameters using domain knowledge to come up with a suitable parametric form for
(Bergersen and others, 2011; Veríssimo and others, 2016). However, such an explicit form is often not available. In many situations, it is a major problem itself to come up with a helpful relationship between
and
and thereby knowledge of which values of a covariate would require more or less penalization. Therefore, we aim at finding
in a data-driven manner and with improved scalability compared to cross-validation.
2.2. Problem statement from a Bayesian perspective
There is a direct correspondence between estimates obtained from penalized regression and a Bayesian estimate with penalization via corresponding priors on the coefficients. For example, the ridge estimate corresponds to the maximum a posterior estimate (MAP) in a Bayesian regression model with normal prior on
and the Lasso estimate to a MAP with a Laplace prior on
. This correspondence opens up alternative strategies using tools from the Bayesian mindset to approach the problem outlined above: Differential penalization translates to introducing different priors on the components of
. Our belief that
carries information on
can be incorporated by using prior distributions whose parameters depend on
. Wiel and others (2016) used this idea to derive an Empirical Bayes approach for finding group-wise penalty parameters in ridge regression. However, this approach does not obviously generalize to other penalties such as the Lasso.
Moving completely into the Bayesian mindset we instead turn to explicit specification of priors to implement the penalization task. Different priors have been suggested (Mitchell and Beauchamp, 1988; MacKay, 1996; Park and Casella, 2008; Carvalho and others, 2009) and structural knowledge was incorporated into the penalization by employing multivariate priors that encode the structure in the covariance or non-exchangeable priors with different hyper-parameters (e.g., Hernández-Lobato and others, 2013; Engelhardt and Adams, 2014;Rockova and Lesaffre, 2014; Wu and others, 2014; Andersen and others, 2017; Xu and Ghosh, 2015 and references therein). Despite the possible gains in prediction performance when incorporating such structural knowledge, these methods have not been widely applied. A limiting factor has often been the lack of scalability to large datasets.
2.3. Setup and notation
From the linear model assumption we have
![]() |
(2.4) |
where
denotes the precision of the noise. Based on the external covariate
we define a partition of the
predictors into
groups:
![]() |
(2.5) |
For instance, categorical covariates
, such as different assay types, naturally define such a partition. For continuous covariates
can be defined based on suitable binning or clustering.
To achieve penalization in dependence of
we consider a spike-and-slab prior (Mitchell and Beauchamp, 1988) on the model coefficients
with a different slab precision
and mixing parameter
for each group. We re-parametrize
as
with
![]() |
(2.6) |
![]() |
(2.7) |
In the special case of
, this yields a normal prior as in MacKay (1996) corresponding to ridge regression. With
we additionally promote sparsity on the coefficients, and the value of
controls the number of active predictors in each group. The value of
controls the overall shrinkage per group. To learn the model hyper-parameters
,
and the noise precision
, we choose the following conjugate priors
![]() |
(2.8) |
and for each group 
![]() |
(2.9) |
![]() |
(2.10) |
with
and
. Hence, the joint probability of the model is given by
![]() |
(2.11) |
2.4. Inference using variational Bayes
The challenge now lies in inferring the posterior of the model parameters from the observed data
and the covariate
. While Markov Chain Monte Carlo methods are frequently used for this purpose they do not scale well to large datasets. Here, we adopt a variational inference framework (Bishop, 2006; Blei and others, 2017) that has been used (in combination with importance sampling) for variable selection with exchangeable priors (Carbonetto and Stephens, 2012; Carbonetto and others, 2017). Denoting all unobserved model components by
, we approximate the posterior
by a distribution
from a restricted class of distributions
, where the goodness of the approximation is measured in terms of the Kullback–Leibler (KL) divergence, i.e.
![]() |
(2.12) |
A common and useful choice for distributions in class
is the mean-field approximation, i.e., that the distribution factorizes in its parameters. We consider
![]() |
(2.13) |
where
and
are not factorized due to their strong dependencies (Titsias and Lázaro-Gredilla, 2011).
The variational approach leads to an iterative inference algorithm (Blei and others, 2017) by observing that minimizing the KL-divergence is equivalent to maximizing the evidence lower bound
defined by
![]() |
(2.14) |
From this, we have
![]() |
(2.15) |
![]() |
(2.16) |
with
denoting the differential entropy.
Variational methods are based on maximization of the functional
with respect to
in order to obtain a tight lower bound on the log model evidence and minimize the KL-divergence between the density
and the true (intractable) posterior. Under a mean-field assumption
, the optimal
keeping all other factors fixed is given by
![]() |
(2.17) |
Iterative optimization of each factor results in Algorithm S1 of the supplementary material available at Biostatistics online. Further details on the variational inference and the updates can be found in Sections 1 and 2 of the supplementary material available at Biostatistics online. The method is implemented in the open-source Bioconductor package graper. From the obtained approximation
of the posterior distribution, we obtain point estimates for the model parameters. In particular, we will use the posterior means
,
, and
.
Remark on the choice of the mean-field assumption.
An interesting deviation from the standard fully factorized mean-field assumption in Equation (2.13) is taking a multivariate variational distribution for the model coefficients. This is easily possible for the dense model (
), where we can consider the factorization
![]() |
In particular, a multivariate distribution is kept for the model coefficients
instead of factorizing
. Thereby, this approach allows to capture dependencies between model coefficients in the inferred posterior and is less approximative. We will show below that this can improve the prediction results. However, a drawback of this approach is its computational complexity, as it requires the calculation and inversion of a
covariance matrix in each step. While this can be reduced to a quadratic complexity as described in Section 2.1 of the supplementary material available at Biostatistics online, this is still prohibitive for many applications. Therefore, we concentrate in the following on the fully factorized mean-field assumption but include comparisons to the multivariate approach in the Results.
2.5. Extension to logistic regression
The model of Section 2.3 can be flexibly adapted to other types of generalized linear regression setups with suitable link functions and likelihoods. However, the inference framework needs to be adapted due to loss of conjugacy. Here, we extend the model to the framework of logistic regression with a binary response variable, where we assume that the response follows a Bernoulli likelihood with a logistic link function
![]() |
(2.18) |
While the prior structure and core of the variational inference are identical to the case of a linear model, additional approximations are necessary. For this purpose, we adopt work by Jaakkola and Jordan (2000) and approximate the likelihood using a lower bound on the logistic function. For an arbitrary
we have
![]() |
(2.19) |
with
. With this,
can be bounded by
![]() |
(2.20) |
As this approximation restores a quadratic form in
, the remaining updates can be adopted from the case of a linear model above with the additional variational parameter
(see Section 2.2 of the supplementary material available at Biostatistics online for details).
3. Results
3.1. Results on simulated data
First, we evaluated the method on simulated data to test its ability to recover the model coefficients and hyper-parameters per group. For this, a random
matrix was generated from a multivariate normal distribution with mean zero and a Toeplitz covariance structure
, and the response was simulated from a linear model with normal error. The
predictors were split into
groups of equal size, and the coefficients were simulated from the model as described in Equations (2.6) and (2.7) with fixed
and
for each group. In particular, we set
for
,
for
, and
for
. For each pair of groups with same
-value the sparsity level
was varied between
and
for a certain value of
determining the sparsity level from 0 (sparse) to 1 (dense). We then varied the number of features
, the number of samples
, the correlation strength
, the noise precision
, and the sparsity level
(Table S1 of the supplementary material available at Biostatistics online) and generated for each setting ten independent datasets. We evaluated the recovery of the hyper-parameter
and
for each group and compared the predictive performance and computational complexity to those of related methods including ridge regression (Hoerl and Kennard, 1970), Lasso (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), adaptive Lasso (Zou, 2006), sparse group Lasso, group Lasso (Friedman and others, 2010), GRridge (Wiel and others, 2016), varbvs (Carbonetto and others, 2017), and IPF-Lasso (Boulesteix and others, 2017). Those methods were taken from the respective R packages provided by the authors, i.e., glmnet 2.0-16, SGL 1.1, grpreg 3.2-0, GRridge 1.7.1, varbvs 2.4-0, and ipflasso 0.1.
3.1.1. Recovery of hyper-parameters
The algorithm accurately recovered the relative importance of different groups (encoded by
) and the group-wise sparsity level (encoded by
) across a large range of settings as shown in Figure S1 of the supplementary material available at Biostatistics online. The method failed to recover those parameters accurately only if the ratio between sample size and number of features was too small or the sparsity parameter
was too close to 1. These settings were challenging for all methods as can be seen in Section 3.1.2, where we evaluated estimation and prediction performance in comparison to other methods. In addition, the groups had to contain sufficiently many predictors to reliably estimate group-wise parameters, as seen in Figure S1b of the supplementary material available at Biostatistics online. We also noted that a low signal-to-noise ratio could impede the estimation of hyper-parameters as can be seen from the group with a very large
value (meaning low coefficient amplitudes as in group 5 and 6) and low precision values (
) of the noise term.
3.1.2. Prediction and estimation performance
Next, we compared the estimation of the true model coefficients and the prediction accuracy on an independent test set of
. Overall, the method showed improved performance for a large range of sample sizes, correlations, numbers of features, noise variances and active features, both in terms of the root mean squared error on
as well as for estimation of
(Figure 1). Among the non-sparse methods, graper with a non-factorized mean-field assumption clearly outperformed the factorized mean-field assumption as well as GRridge and group Lasso. The covariate-agnostic ridge regression performed worst in most cases. Sparse methods performed in general better in this simulation example, as the underlying model had a large fraction of zero coefficients. Here, we observed that graper was comparable to IPF-Lasso, which is the most closely related method. Only in settings with a very high number of active predictors or strong correlations between the predictors (
close to one) the method was outperformed by the IPF-Lasso.
Fig. 1.
Root mean squared error (RMSE) of the predicted response
(left) and the estimate
(right) for different methods when varying one of the simulation parameters (a–e) as described in Table S1 of supplementary material available at Biostatistics online. The prediction error is assessed on
test samples. The line denotes the mean RMSE across 10 random instances of simulated data with bars denoting standard errors. The two panels separate methods with sparse estimates of
(right) from non-sparse methods (left). (Group Lasso is counted as non-sparse method as it is not sparse within groups.)
3.1.3. Scalability
While the additional group-wise optimization comes at a computational cost, the variational approach runs inference in time complexity linear in the number of features
, samples
, and groups
. Only in the case of a multivariate variational distribution, the complexity is quadratic in the larger of
and
and cubic in the smaller of the two. When varying the number of samples
, features
, and groups
we observed comparable run times as for Lasso (Figure 2). Differences were mainly observed for
: For larger
, graper required slightly longer times than Lasso. This difference was more pronounced when using a sparsity promoting spike and slab prior, where additional parameters need to be inferred. As expected, the multivariate approach of graper became considerably slower for large
and showed comparable run times to the sparse group Lasso. The number of groups mainly influenced the computation times of IPF-Lasso, which scales exponentially in the number of groups. Here, graper provided a by far more scalable approach (Figure 2, right panel).
Fig. 2.
Average run time (in min) for different methods when varying the number of samples
, features
and groups
. Each parameter is varied at a time while holding the others fixed to
,
or
. Shown are the average times across 50 random instances of simulated data with error bars denoting one standard error.
3.2. Application to data from high-throughput biology
3.2.1. Drug response prediction in leukemia samples
Next, we exemplify the method’s performance on real data by considering an application to biological data, where predictors were obtained from different assays. Using assay type as external covariates we used the method to integrate data from the different assays (also referred to as omic types) in a study on chronic lymphocytic leukemia (Dietrich and others, 2018). This study combined drug response measurements with molecular profiling including gene expression and methylation. Briefly, we used normalized RNA-Seq expression values of the 5000 most variable genes, the DNA methylation M-values at the 1
most variable CpG sites as well as the ex-vivo cell viability after exposure to 61 drugs at five different concentrations as predictors for the response to a drug (Ibrutinib) that was not included into the set of predictors. The data were obtained from the Bioconductor package MOFAdata 1.0.0 (Argelaguet and others, 2018). In total, this resulted in a model with
patient samples and
predictors.
We first applied the different regression methods to the data on their original scale. Since the features have different scales (e.g., the drug responses vary from around 1 (neutral) to 0 (completely toxic), the normalized expression values from 0 to 20 and the methylation M-values from
10 to 8), this ensures that the omic type information is an informative covariate: it results in larger effect sizes of the drug response data and smaller effect sizes of the methylation and expression data compared to scaled predictors. In this setting, incorporating knowledge on the assay type into the penalized regression showed clear advantages in terms of prediction performance: The covariate-aware methods (GRridge, IPF-Lasso, and graper) all improved upon the covariate-agnostic Lasso, ridge regression, or elastic net (Figure 3a). Also the group Lasso methods, which incorporate the group information but apply a single penalty parameter, could not adapt to the scale differences. The inferred hyper-parameters
of graper highlighted the larger effect sizes of the drug response feature group, which was strongly favored by the penalization (Figure 3b).
Fig. 3.
Application to the chronic lymphocytic leukemima data with scale differences between assays. (a) Comparison of the root mean squared error (RMSE) for the prediction of samples’ viability after treatment with Ibrutinib. Performance was evaluated in a 10-fold cross-validation scheme, the points denote the individual RMSE for each fold. (b) Inferred hyper-parameters in the different folds for the three different omic types (
on the left and
on the right).
To address differences in feature scale, a common choice made by many implementations (e.g., glmnet (Friedman and others, 2010)) is to scale all features to unit variance. Indeed, for the data at hand, this transformation was particularly beneficial for the covariate-agnostic methods, and their prediction performances became more similar to those of the covariate-aware methods. However, for dense methods such as ridge regression the covariate information on the omic type remained important (Figure 4a). Sparse methods in general resulted in very good predictions as the response to Ibrutinib can be well explained by a very sparse model containing only few drugs with related mode of action. By learning weights for each omic type graper directly highlighted the importance of the drug data as predictors (Figure 4b).
Fig. 4.
Application to the chronic lymphocytic leukemima data with standardized predictors. (a) Comparison of the root mean squared error (RMSE) for the prediction of samples’ viability after treatment with Ibrutinib. Performance was evaluated in a 10-fold cross-validation scheme, the points denote the individual RMSE for each fold. (b) Inferred hyper-parameters by graper (sparse) in the different folds for the three different omic types (
on the left and
on the right).
In general, standardization of all features is unlikely to be an optimal choice, since in many applications there is a relation between information content and amplitude. For example, we often measure high-amplitude signals that are informative jointly with low-amplitude features that originate mainly from technical noise. Here, standardization can be harmful as it would drown the informative high-amplitude features and “blow up” the noisy low-amplitude features (Figure S2 of the supplementary material available at Biostatistics online). In particular, standardization does not distinguish between meaningful differences in variance (e.g., features that differ between two disease groups) and differences in variance due to the scale. While removal of the latter would be desirable, meaningful differences should be retained. Hence, the question of whether to scale the predictors or not, is related to the question of whether the variance of a feature is an informative covariate: if the variance contains important information on the relevance of a predictor, standardization should not be applied or information on the predictors’ variance should be re-included via the covariate, e.g., binning features based on their variance. This has been shown to be beneficial for marginal testing applications where filtering or weighting by variance increased the power to detect true positives (Bourgon and others, 2010; Ignatiadis and others, 2016). A recent study on RNA-Seq data in the context of penalized regression found no strong effect of standardization compared to no standardization (Zwiener and others, 2014). However, an example where standardization can be harmful in applications to genomic data are binary mutation data. Here, features are all on the same scale and standardization would favor mutations with lower frequencies which in most applications is not desirable (see also Section 3.1 of the supplementary material available at Biostatistics online).
3.2.2. Age prediction from multi-tissue gene expression data
As a second example for a covariate in genomics, we considered the tissue type. Using data from the GTEx consortium (Lonsdale and others, 2013), we asked whether the tissue type is an informative covariate in the prediction of a person’s age from gene expression. Briefly, using gene expression data provided by the Bioconductor package recount 1.7.6 (Collado-Torres and others, 2017), we chose five tissues that were available for the largest number of donors and from each tissue considered the top 50 principal components on the RNA-Seq data after normalization and variance stabilization using DESeq2 1.21.25 (Love and others, 2014). In total, this gave us
predictors from
tissues for
donors.
We observed a small advantage for methods that incorporate the tissue type as a covariate (Figure 5a): GRridge, IPF-Lasso, and graper all had a smaller prediction error compared to covariate-agnostic methods. In particular, graper resulted in comparable prediction performance to IPF-Lasso, whilst requiring less than a second for training compared to 40 min for IPF-Lasso. The learnt relative penalization strength and sparsity levels of graper can again provide insights into the relative importance of the different tissue types. In particular, we found lower penalization for blood vessel and muscle and higher penalization for blood and skin (Figure 5b). This is consistent with previous studies on a per-tissue basis, where gene expression in blood vessel has been found to be a good predictor for age, while blood was found to be less predictive (Yang and others, 2015).
Fig. 5.
Application to the GTEx data. (a) Comparison of root mean squared error (RMSE) for the prediction of donor age (in years). Performance is evaluated in a 10-fold cross-validation scheme, the points denote the individual RMSE for each fold. (b) Inferred penalty parameters for the five tissues in graper in each fold.
4. Discussion
We propose a method that can use information from external covariates to guide penalization in regression tasks and that can provide a flexible and scalable alternative to approaches that were proposed recently (Wiel and others, 2016; Boulesteix and others, 2017). We illustrated in simulations and data from biological applications that if the covariate is informative of the effect sizes in the model, these approaches can improve upon commonly used penalized regression methods that are agnostic to such information. We investigated the use of important covariates in genomics such as omic type or tissue. The performance of our approach is in many cases comparable to the IPF-Lasso method (Boulesteix and others, 2017), while scalability is highly improved in terms of the number of feature groups, thereby extending the range of possible applications.
The variational inference framework provides improved scalability compared to Bayesian methods that are based on sampling strategies. Variational Bayes methods have already been employed in the setting of Bayesian regression with Spike-and-Slab priors (Carbonetto and Stephens 2012; Carbonetto and others 2017). However, these methods do not incorporate information from external covariates. A drawback of variational methods is the fact that they often result in too concentrated approximations to the posterior distribution and thereby underestimate the posterior variance. Nevertheless, they have been shown to provide reasonable point estimates in regression tasks (Carbonetto and Stephens, 2012), which we focused on here. Due to the mean-field assumption strong correlations between active predictors can lead to suboptimal results of graper. Here, a multivariate mean-field assumption in the variational Bayes approach can be of advantage, suggested as an alternative above. However, it comes at the price of higher computational costs. What is not addressed in our current implementation is the common problem of missing values in the data; if extant, they would need to be imputed beforehand.
While our approach is related to methods that adapt the penalty function in order to incorporate structural knowledge, such as the group Lasso (Yuan and Lin, 2006), sparse group Lasso (Friedman and others, 2010), or fused Lasso ( Tibshirani and others, 2005), these approaches apply the same penalty parameter to all the different groups and perform hard in- or exclusion of groups instead of the softer weighting proposed here. Alternatively, the loss function can be modified to incorporate prior knowledge based on a known set of “high-confidence predictors” as proposed by Jiang and others (2016). The existence and identity of such “high-confidence predictors,” however, is often not clear.
In contrast to frequentist regression methods, the Bayesian approach provides direct posterior-inclusion probabilities for each feature that can be useful for model selection. To obtain frequentist guarantees on the selected features it could be promising to combine the approach with recently developed methods for controlling the false discovery rate, such as the knockoffs (Candes and others, 2018). For this, feature statistics can be constructed based on the estimated coefficients or inclusion probabilities from our model as long as the knockoffs obtain the same covariate information as their true counterpart.
An interesting question that we have not addressed is the quest for rigorous criteria when the inclusion of a covariate by differential penalization is of advantage. This question is not limited to the framework of penalized regression but affects the general setting of shrinkage estimation. While joint shrinkage of a set of estimates can be very powerful in producing more stable estimates with reduced variance, care needs to be taken on which measurements to combine in such a shrinkage approach. As in the case of coefficients in the linear model setting, external covariates could be helpful for this decision and facilitate a more informed shrinkage. However, allowing for differential shrinkage will reintroduce some degrees of freedom into the model and can only be advantageous if the covariate provides “sufficient” information to balance this. For future work, it would be of interest to find general conditions for when this is the case, thereby enabling an informed choice of covariates in practice.
We provide an open-source implementation of our method in the Bioconductor package graper. In addition, vignettes and scripts are made available that facilitate the comparison of graper with various related regression methods and can be used to reproduce all results contained in this work (https://git.embl.de/bvelten/graper_analyses).
5. Software
The method is implemented in the Bioconductor package graper, scripts for the analyses contained in this article can be found at https://git.embl.de/bvelten/graper_analyses.
Supplementary Material
Acknowledgments
We thank Bernd Klaus for providing useful comments on the manuscript. Conflict of Interest: None declared.
Funding
This work was supported by the European Union Horizon 2020 project SOUND [633974] and the EMBL International PhD Program.
References
- Alyass, A., Turcotte, M. and Meyre, D. (2015). From big data analysis to personalized medicine for all: challenges and opportunities. BMC Medical Genomics 8, 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andersen, M. R., Vehtari, A., Winther, O. and Hansen, L. K. (2017). Bayesian inference for spatio-temporal spike and slab priors. Journal of Machine Learning Research 18, 1–58. [Google Scholar]
- Argelaguet, R., Velten, B., Arnol, D., Dietrich, S., Zenz, T., Marioni, J. C., Buettner, F., Huber, W. and Stegle, O. (2018). Multi-omics factor analysis—a framework for unsupervised integration of multi-omic data sets. Molecular Systems Biology 14, e8124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bergersen, L. C., Glad, I. K. and Lyng, H. (2011). Weighted Lasso with data integration. Statistical Applications in Genetics and Molecular Biology 10, 1–29. [DOI] [PubMed] [Google Scholar]
- Bishop, C. M. (2006). Pattern recognition. Machine Learning 128, 1–58. [Google Scholar]
- Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association 112, 859–877. [Google Scholar]
-
Boulesteix, A.-L., De Bin, R., Jiang, X. and Fuchs, M. (2017). IPF-LASSO: integrative L
-penalized regression with penalty factors for prediction based on multi-omics data. Computational and Mathematical Methods in Medicine 2017, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar] - Bourgon, R., Gentleman, R. and Huber, W. (2010). Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences 107, 9546–9551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Candes, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘model-X’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 551–577. [Google Scholar]
- Carbonetto, P. and Stephens, M. (2012). Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis 7, 73–108. [Google Scholar]
- Carbonetto, P., Zhou, X. and Stephens, M. (2017). varbvs: Fast variable selection for large-scale regression. arXiv preprint arXiv:1709.06597. [Google Scholar]
- Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009). Handling sparsity via the horseshoe. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, in PMLR 5, 73–80. [Google Scholar]
- Chen, R. and Snyder, M. (2013). Promise of personalized omics to precision medicine. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 5, 73–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Collado-Torres, L., Nellore, A., Kammers, K., Ellis, S. E, Taub, M. A., Hansen, K. D., Jaffe, A. E., Langmead, B. and Leek, J. T. (2017). Reproducible RNA-seq analysis using recount2. Nature Biotechnology 35, 319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dietrich, S., Oleś, M., Lu, J., Sellner, L., Anders, S., Velten, B., Wu, B., Hüllein, J., da Silva Liberio, M., Walther, T.. and others. (2018). Drug-perturbation-based stratification of blood cancer. The Journal of Clinical Investigation 128, 427–445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobriban, E., Fortney, K., Kim, S. K. and Owen, A. B. (2015). Optimal multiple testing under a Gaussian prior on the effect sizes. Biometrika 102, 753–766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Engelhardt, B. E. and Adams, R. P. (2014). Bayesian structured sparsity from Gaussian fields. arXiv preprint arXiv:1407.2235. [Google Scholar]
- Ferkingstad, E., Frigessi, A., Rue, H., Thorleifsson, G. and Kong, A. (2008). Unsupervised empirical Bayesian multiple testing with external covariates. The Annals of Applied Statistics, 2, 714–735. [Google Scholar]
- Friedman, J., Hastie, T. and Tibshirani, R. (2010a). A note on the group Lasso and a sparse group Lasso. arXiv preprint arXiv:1001.0736. [Google Scholar]
- Friedman, J., Hastie, T. and Tibshirani, R. (2010b). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1. [PMC free article] [PubMed] [Google Scholar]
- Hamburg, M. A. and Collins, F. S. (2010). The path to personalized medicine. The New England Journal of Medicine 2010, 301–304. [DOI] [PubMed] [Google Scholar]
- Hasin, Y., Seldin, M. and Lusis, A. (2017). Multi-omics approaches to disease. Genome Biology 18, 83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hernández-Lobato, D., Hernández-Lobato, J. M. and Dupont, P. (2013). Generalized spike-and-slab priors for Bayesian group feature selection using expectation propagation. The Journal of Machine Learning Research 14, 1891–1945. [Google Scholar]
- Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67. [Google Scholar]
- Ignatiadis, N., Klaus, B., Zaugg, J. and Huber, W. (2016). Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods 13, 577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing 10, 25–37. [Google Scholar]
- Jiang, Y., He, Y. and Zhang, H. (2016). Variable selection with prior information for generalized linear models via the prior LASSO method. Journal of the American Statistical Association 111, 355–376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lei, L. and Fithian, W. (2018). AdaPT: an interactive procedure for multiple testing with side information. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 649–679. [Google Scholar]
- Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive Benjamini-Hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81, 45–74. [Google Scholar]
- Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters, G., Garcia, F., Young, N.. and others. (2013). The genotype-tissue expression (GTEx) project. Nature Genetics 45, 580–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Love, M. I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacKay, D. J. C. (1996). Bayesian methods for backpropagation networks. In: Domany, E., et al. (editors), Models of Neural Networks III New York, NY: Springer, pp. 211–254. [Google Scholar]
- Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83, 1023–1032. [Google Scholar]
- Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]
- Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A. and Kim, D. (2015). Methods of integrating data to uncover genotype-phenotype interactions. Nature Reviews Genetics 16, 85. [DOI] [PubMed] [Google Scholar]
- Rockova, V. and Lesaffre, E. (2014). Incorporating grouping information in Bayesian variable selection with applications in genomics. Bayesian Analysis 9, 221–258. [Google Scholar]
- Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 267–288. [Google Scholar]
- Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 91–108. [Google Scholar]
- Titsias, M. K. and Lázaro-Gredilla, M. (2011). Spike and slab variational inference for multi-task and multiple kernel learning. In: Shawe-Taylor, J., et al. (editors) Advances in Neural Information Processing Systems. New York: Curran Associates, pp. 2339–2347. [Google Scholar]
- Veríssimo, A., Oliveira, A. L., Sagot, M.-F. and Vinga, S. (2016). Degreecox–a network-based regularization method for survival analysis. BMC Bioinformatics 17, 449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wiel, M. A., Lien, T. G., Verlaat, W., Wieringen, W. N. and Wilting, S. M. (2016). Better prediction by use of co-data: adaptive group-regularized ridge regression. Statistics in Medicine 35, 368–381. [DOI] [PubMed] [Google Scholar]
- Wu, A., Park, M., Koyejo, O. O. and Pillow, J. W. (2014). Sparse Bayesian structure learning with dependent relevance determination priors. In: Gharamani, Z., et al. (editors) Advances in Neural Information Processing Systems. New York: Curran Associates, pp. 1628–1636. [Google Scholar]
- Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group Lasso. Bayesian Analysis 10, 909–936. [Google Scholar]
- Yang, J., Huang, T., Petralia, F., Long, Q., Zhang, B., Argmann, C., Zhao, Y., Mobbs, C. V, Schadt, E. E., Zhu, J.. and others. (2015). Synchronized age-related gene expression changes across multiple tissues in human and the link to complex diseases. Scientific Reports 5, 15145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67. [Google Scholar]
- Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. [Google Scholar]
- Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320. [Google Scholar]
- Zwiener, I., Frisch, B. and Binder, H. (2014). Transforming RNA-Seq data to improve the performance of prognostic gene signatures. PLoS One 9, e85150. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


























