Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2019 Oct 9;22(2):348–364. doi: 10.1093/biostatistics/kxz034

Adaptive penalization in high-dimensional regression and classification with external covariates using variational Bayes

Britta Velten 1,, Wolfgang Huber 1
PMCID: PMC8036004  PMID: 31596468

Summary

Penalization schemes like Lasso or ridge regression are routinely used to regress a response of interest on a high-dimensional set of potential predictors. Despite being decisive, the question of the relative strength of penalization is often glossed over and only implicitly determined by the scale of individual predictors. At the same time, additional information on the predictors is available in many applications but left unused. Here, we propose to make use of such external covariates to adapt the penalization in a data-driven manner. We present a method that differentially penalizes feature groups defined by the covariates and adapts the relative strength of penalization to the information content of each group. Using techniques from the Bayesian tool-set our procedure combines shrinkage with feature selection and provides a scalable optimization scheme. We demonstrate in simulations that the method accurately recovers the true effect sizes and sparsity patterns per feature group. Furthermore, it leads to an improved prediction performance in situations where the groups have strong differences in dynamic range. In applications to data from high-throughput biology, the method enables re-weighting the importance of feature groups from different assays. Overall, using available covariates extends the range of applications of penalized regression, improves model interpretability and can improve prediction performance.

Keywords: Classification, External covariates, Feature selection, Penalized regression, Variational Bayes

1. Introduction

We are interested in the setup where we observe a continuous or categorical response Inline graphic together with a vector of potential predictors, or features, Inline graphic and aim to find a relationship of the form Inline graphic Two main questions are of potential interest in this setting. First, we want to obtain an Inline graphic that yields good predictions for Inline graphic given a new observation Inline graphic. Second, we aim at finding which components in Inline graphic are the “important ones” for the prediction.

A common and useful approach to this end are (generalized) linear regression methods, which assume that the distribution of Inline graphic depends on Inline graphic via a linear term Inline graphic. In order to cope with high-dimensionality of Inline graphic and avoid over-fitting, penalization on Inline graphic is employed, e.g., in ridge regression (Hoerl and Kennard, 1970), Lasso (Tibshirani, 1996), or elastic net (Zou and Hastie, 2005). By constraining the values of Inline graphic, the complexity of the model is restricted, resulting in biased but less variable estimates and improved prediction performance. In addition, some choices of the penalty yield estimates with a relatively small number of non-zero components, thereby facilitating feature selection. An example is the Inline graphic-penalty employed in Lasso or elastic net.

Commonly, penalization methods apply a penalty that is symmetric in the model coefficients. Real data, however, often consist of a collection of heterogeneous features, which such an approach does not account for. In particular, it ignores any additional information or structural differences that may be present in the features. Often we encounter Inline graphic whose components comprise multiple data modalities and data qualities, e.g., measurement values from different assays. Other side-information on individual features could include temporal or spatial information, quality metrics associated to each measurement or the features’ sample variance, frequency or signal-to-noise ratio. It has already been observed in multiple testing that the power of the analysis can be improved by making use of such external information (e.g., Ferkingstad and others, 2008; Dobriban and others, 2015; Ignatiadis and others, 2016; Li and Barber, 2019; Lei and Fithian, 2018). However, in current penalized regression models this information is frequently ignored. Making use of it could on one hand improve prediction performance. On the other hand, it might yield important insight into the relationship of external covariates to the features’ importance. For example, if the covariate encodes different data modalities, insights into their relative importance could help cutting costs by reducing future assays to the essential data modalities.

As a motivating example, we consider applications in molecular biology and precision medicine. Here, the aim is to predict phenotypic outcomes, such as treatment response, and identify reliable disease markers based on molecular data. Nowadays, different high-throughput technologies can be combined to jointly measure thousands of molecular features from different biological layers (Ritchie and others, 2015; Hasin and others, 2017). Examples include genetic alterations, gene expression, methylation patterns, protein abundances, or microbiome occurrences. However, despite the increasing availability of molecular and clinical data, outcome prediction remains challenging (Hamburg and Collins, 2010; Chen and Snyder, 2013; Alyass and others, 2015). Common applications of penalized regression only make use of parts of the available data. For example, different assay types are simply concatenated or analyzed separately. In addition, available annotations on individual features are left unused, such as their chromosomal location or gene set and pathway membership. Incorporating side-information on the assay type and spatial or functional annotations could help to improve prediction performance. Furthermore, it could help prioritizing feature groups, such as different assays or gene sets.

Here, we propose a method that incorporates external covariates in order to guide penalization and can learn relationships of the covariate to the feature’s effect size in a data-driven way. We introduce the method for linear models and extend it to classification purposes. We demonstrate that this can improve prediction performance and yields insights into the relative importance of different feature sets, both on simulated data and applications in high-throughput biology.

2. Methods

2.1. Problem statement

Assume we are given observations Inline graphic with Inline graphic, Inline graphic (possibly Inline graphic) from a linear model, i.e.

graphic file with name M20.gif (2.1)

with Inline graphic. In addition, we suppose that we have access to a covariate Inline graphic for each predictor Inline graphic. We hope, loosely speaking, that Inline graphic contains some sort of information on the magnitude of Inline graphic. The question we want to address is: can we use the information from Inline graphic to improve upon estimation of Inline graphic and prediction of Inline graphic?

In order to estimate Inline graphic from a finite sample Inline graphic and Inline graphic we can employ penalization on the negative log-likelihood of the model, i.e.

graphic file with name M32.gif (2.2)

where Inline graphic denotes a penalty function on the model coefficients. For example, Inline graphic leads to Lasso (Inline graphic) or ridge regression (Inline graphic). The parameter Inline graphic controls the amount of penalization and thereby the model complexity. Ideally, we would like to choose an optimal Inline graphic. For estimation this means minimizing the mean squared error Inline graphic; for prediction this means minimizing the expected prediction error. In practice, Inline graphic is often chosen to minimize the cross-validated error.

In most applications, the penalization is symmetric, i.e., for any permutation Inline graphic we have Inline graphic. However, as we have external information on each feature given by Inline graphic we want to allow for differential penalization guided by Inline graphic. For this, we will consider the following non-symmetric generalization, which still leads to a convex optimization problem in Inline graphic for convex penalty functions Inline graphic, such as Inline graphic or Inline graphic:

graphic file with name M49.gif (2.3)

Instead of a constant Inline graphic, here Inline graphic provides a mapping from the covariate Inline graphic to a non-negative penalty factor Inline graphic. This additional flexibility compared to a single penalty parameter can be helpful if Inline graphic contains information on Inline graphic. For example, in the simple case of ridge regression with deterministic orthonormal design matrix, known noise variance Inline graphic and “oracle covariate” Inline graphic the optimal Inline graphic is seen to be Inline graphic. However, in practice the information in Inline graphic is not that explicit and hence we do not know which Inline graphic is optimal.

If Inline graphic takes values in a small set of discrete values, e.g., for categorical covariates Inline graphic, cross-validation could be used to determine a suitable set of function values. This approach is employed by Boulesteix and others (2017), where categorical covariates encode different data modalities. However, cross-validation soon becomes prohibitive, as it requires a grid search exponential in the number of categories defined by Inline graphic. Similarly, cross-validation can be employed with Inline graphic parametrized by a small number of tuning parameters using domain knowledge to come up with a suitable parametric form for Inline graphic (Bergersen and others, 2011; Veríssimo and others, 2016). However, such an explicit form is often not available. In many situations, it is a major problem itself to come up with a helpful relationship between Inline graphic and Inline graphic and thereby knowledge of which values of a covariate would require more or less penalization. Therefore, we aim at finding Inline graphic in a data-driven manner and with improved scalability compared to cross-validation.

2.2. Problem statement from a Bayesian perspective

There is a direct correspondence between estimates obtained from penalized regression and a Bayesian estimate with penalization via corresponding priors on the coefficients. For example, the ridge estimate corresponds to the maximum a posterior estimate (MAP) in a Bayesian regression model with normal prior on Inline graphic and the Lasso estimate to a MAP with a Laplace prior on Inline graphic. This correspondence opens up alternative strategies using tools from the Bayesian mindset to approach the problem outlined above: Differential penalization translates to introducing different priors on the components of Inline graphic. Our belief that Inline graphic carries information on Inline graphic can be incorporated by using prior distributions whose parameters depend on Inline graphic. Wiel and others (2016) used this idea to derive an Empirical Bayes approach for finding group-wise penalty parameters in ridge regression. However, this approach does not obviously generalize to other penalties such as the Lasso.

Moving completely into the Bayesian mindset we instead turn to explicit specification of priors to implement the penalization task. Different priors have been suggested (Mitchell and Beauchamp, 1988; MacKay, 1996; Park and Casella, 2008; Carvalho and others, 2009) and structural knowledge was incorporated into the penalization by employing multivariate priors that encode the structure in the covariance or non-exchangeable priors with different hyper-parameters (e.g., Hernández-Lobato and others, 2013; Engelhardt and Adams, 2014;Rockova and Lesaffre, 2014; Wu and others, 2014; Andersen and others, 2017; Xu and Ghosh, 2015 and references therein). Despite the possible gains in prediction performance when incorporating such structural knowledge, these methods have not been widely applied. A limiting factor has often been the lack of scalability to large datasets.

2.3. Setup and notation

From the linear model assumption we have

graphic file with name M76.gif (2.4)

where Inline graphic denotes the precision of the noise. Based on the external covariate Inline graphic we define a partition of the Inline graphic predictors into Inline graphic groups:

graphic file with name M81.gif (2.5)

For instance, categorical covariates Inline graphic, such as different assay types, naturally define such a partition. For continuous covariates Inline graphic can be defined based on suitable binning or clustering.

To achieve penalization in dependence of Inline graphic we consider a spike-and-slab prior (Mitchell and Beauchamp, 1988) on the model coefficients Inline graphic with a different slab precision Inline graphic and mixing parameter Inline graphic for each group. We re-parametrize Inline graphic as Inline graphic with

graphic file with name M90.gif (2.6)
graphic file with name M91.gif (2.7)

In the special case of Inline graphic, this yields a normal prior as in MacKay (1996) corresponding to ridge regression. With Inline graphic we additionally promote sparsity on the coefficients, and the value of Inline graphic controls the number of active predictors in each group. The value of Inline graphic controls the overall shrinkage per group. To learn the model hyper-parameters Inline graphic, Inline graphic and the noise precision Inline graphic, we choose the following conjugate priors

graphic file with name M99.gif (2.8)

and for each group Inline graphic

graphic file with name M101.gif (2.9)
graphic file with name M102.gif (2.10)

with Inline graphic and Inline graphic. Hence, the joint probability of the model is given by

graphic file with name M105.gif (2.11)

2.4. Inference using variational Bayes

The challenge now lies in inferring the posterior of the model parameters from the observed data Inline graphic and the covariate Inline graphic. While Markov Chain Monte Carlo methods are frequently used for this purpose they do not scale well to large datasets. Here, we adopt a variational inference framework (Bishop, 2006; Blei and others, 2017) that has been used (in combination with importance sampling) for variable selection with exchangeable priors (Carbonetto and Stephens, 2012; Carbonetto and others, 2017). Denoting all unobserved model components by Inline graphic, we approximate the posterior Inline graphic by a distribution Inline graphic from a restricted class of distributions Inline graphic, where the goodness of the approximation is measured in terms of the Kullback–Leibler (KL) divergence, i.e.

graphic file with name M112.gif (2.12)

A common and useful choice for distributions in class Inline graphic is the mean-field approximation, i.e., that the distribution factorizes in its parameters. We consider

graphic file with name M114.gif (2.13)

where Inline graphic and Inline graphic are not factorized due to their strong dependencies (Titsias and Lázaro-Gredilla, 2011).

The variational approach leads to an iterative inference algorithm (Blei and others, 2017) by observing that minimizing the KL-divergence is equivalent to maximizing the evidence lower bound Inline graphic defined by

graphic file with name M118.gif (2.14)

From this, we have

graphic file with name M119.gif (2.15)
graphic file with name M120.gif (2.16)

with Inline graphic denoting the differential entropy.

Variational methods are based on maximization of the functional Inline graphic with respect to Inline graphic in order to obtain a tight lower bound on the log model evidence and minimize the KL-divergence between the density Inline graphic and the true (intractable) posterior. Under a mean-field assumption Inline graphic, the optimal Inline graphic keeping all other factors fixed is given by

graphic file with name M127.gif (2.17)

Iterative optimization of each factor results in Algorithm S1 of the supplementary material available at Biostatistics online. Further details on the variational inference and the updates can be found in Sections 1 and 2 of the supplementary material available at Biostatistics online. The method is implemented in the open-source Bioconductor package graper. From the obtained approximation Inline graphic of the posterior distribution, we obtain point estimates for the model parameters. In particular, we will use the posterior means Inline graphic, Inline graphic, and Inline graphic.

Remark on the choice of the mean-field assumption.

An interesting deviation from the standard fully factorized mean-field assumption in Equation (2.13) is taking a multivariate variational distribution for the model coefficients. This is easily possible for the dense model (Inline graphic), where we can consider the factorization

graphic file with name M133.gif

In particular, a multivariate distribution is kept for the model coefficients Inline graphic instead of factorizing Inline graphic. Thereby, this approach allows to capture dependencies between model coefficients in the inferred posterior and is less approximative. We will show below that this can improve the prediction results. However, a drawback of this approach is its computational complexity, as it requires the calculation and inversion of a Inline graphic covariance matrix in each step. While this can be reduced to a quadratic complexity as described in Section 2.1 of the supplementary material available at Biostatistics online, this is still prohibitive for many applications. Therefore, we concentrate in the following on the fully factorized mean-field assumption but include comparisons to the multivariate approach in the Results.

2.5. Extension to logistic regression

The model of Section 2.3 can be flexibly adapted to other types of generalized linear regression setups with suitable link functions and likelihoods. However, the inference framework needs to be adapted due to loss of conjugacy. Here, we extend the model to the framework of logistic regression with a binary response variable, where we assume that the response follows a Bernoulli likelihood with a logistic link function

graphic file with name M137.gif (2.18)

While the prior structure and core of the variational inference are identical to the case of a linear model, additional approximations are necessary. For this purpose, we adopt work by Jaakkola and Jordan (2000) and approximate the likelihood using a lower bound on the logistic function. For an arbitrary Inline graphic we have

graphic file with name M139.gif (2.19)

with Inline graphic. With this, Inline graphic can be bounded by

graphic file with name M142.gif (2.20)

As this approximation restores a quadratic form in Inline graphic, the remaining updates can be adopted from the case of a linear model above with the additional variational parameter Inline graphic (see Section 2.2 of the supplementary material available at Biostatistics online for details).

3. Results

3.1. Results on simulated data

First, we evaluated the method on simulated data to test its ability to recover the model coefficients and hyper-parameters per group. For this, a random Inline graphic matrix was generated from a multivariate normal distribution with mean zero and a Toeplitz covariance structure Inline graphic, and the response was simulated from a linear model with normal error. The Inline graphic predictors were split into Inline graphic groups of equal size, and the coefficients were simulated from the model as described in Equations (2.6) and (2.7) with fixed Inline graphic and Inline graphic for each group. In particular, we set Inline graphic for Inline graphic, Inline graphic for Inline graphic, and Inline graphic for Inline graphic. For each pair of groups with same Inline graphic-value the sparsity level Inline graphic was varied between Inline graphic and Inline graphic for a certain value of Inline graphic determining the sparsity level from 0 (sparse) to 1 (dense). We then varied the number of features Inline graphic, the number of samples Inline graphic, the correlation strength Inline graphic, the noise precision Inline graphic, and the sparsity level Inline graphic (Table S1 of the supplementary material available at Biostatistics online) and generated for each setting ten independent datasets. We evaluated the recovery of the hyper-parameter Inline graphic and Inline graphic for each group and compared the predictive performance and computational complexity to those of related methods including ridge regression (Hoerl and Kennard, 1970), Lasso (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), adaptive Lasso (Zou, 2006), sparse group Lasso, group Lasso (Friedman and others, 2010), GRridge (Wiel and others, 2016), varbvs (Carbonetto and others, 2017), and IPF-Lasso (Boulesteix and others, 2017). Those methods were taken from the respective R packages provided by the authors, i.e., glmnet 2.0-16, SGL 1.1, grpreg 3.2-0, GRridge 1.7.1, varbvs 2.4-0, and ipflasso 0.1.

3.1.1. Recovery of hyper-parameters

The algorithm accurately recovered the relative importance of different groups (encoded by Inline graphic) and the group-wise sparsity level (encoded by Inline graphic) across a large range of settings as shown in Figure S1 of the supplementary material available at Biostatistics online. The method failed to recover those parameters accurately only if the ratio between sample size and number of features was too small or the sparsity parameter Inline graphic was too close to 1. These settings were challenging for all methods as can be seen in Section 3.1.2, where we evaluated estimation and prediction performance in comparison to other methods. In addition, the groups had to contain sufficiently many predictors to reliably estimate group-wise parameters, as seen in Figure S1b of the supplementary material available at Biostatistics online. We also noted that a low signal-to-noise ratio could impede the estimation of hyper-parameters as can be seen from the group with a very large Inline graphic value (meaning low coefficient amplitudes as in group 5 and 6) and low precision values (Inline graphic) of the noise term.

3.1.2. Prediction and estimation performance

Next, we compared the estimation of the true model coefficients and the prediction accuracy on an independent test set of Inline graphic. Overall, the method showed improved performance for a large range of sample sizes, correlations, numbers of features, noise variances and active features, both in terms of the root mean squared error on Inline graphic as well as for estimation of Inline graphic (Figure 1). Among the non-sparse methods, graper with a non-factorized mean-field assumption clearly outperformed the factorized mean-field assumption as well as GRridge and group Lasso. The covariate-agnostic ridge regression performed worst in most cases. Sparse methods performed in general better in this simulation example, as the underlying model had a large fraction of zero coefficients. Here, we observed that graper was comparable to IPF-Lasso, which is the most closely related method. Only in settings with a very high number of active predictors or strong correlations between the predictors (Inline graphic close to one) the method was outperformed by the IPF-Lasso.

Fig. 1.

Fig. 1.

Root mean squared error (RMSE) of the predicted response Inline graphic (left) and the estimate Inline graphic (right) for different methods when varying one of the simulation parameters (a–e) as described in Table S1 of supplementary material available at Biostatistics online. The prediction error is assessed on Inline graphic test samples. The line denotes the mean RMSE across 10 random instances of simulated data with bars denoting standard errors. The two panels separate methods with sparse estimates of Inline graphic (right) from non-sparse methods (left). (Group Lasso is counted as non-sparse method as it is not sparse within groups.)

3.1.3. Scalability

While the additional group-wise optimization comes at a computational cost, the variational approach runs inference in time complexity linear in the number of features Inline graphic, samples Inline graphic, and groups Inline graphic. Only in the case of a multivariate variational distribution, the complexity is quadratic in the larger of Inline graphic and Inline graphic and cubic in the smaller of the two. When varying the number of samples Inline graphic, features Inline graphic, and groups Inline graphic we observed comparable run times as for Lasso (Figure 2). Differences were mainly observed for Inline graphic: For larger Inline graphic, graper required slightly longer times than Lasso. This difference was more pronounced when using a sparsity promoting spike and slab prior, where additional parameters need to be inferred. As expected, the multivariate approach of graper became considerably slower for large Inline graphic and showed comparable run times to the sparse group Lasso. The number of groups mainly influenced the computation times of IPF-Lasso, which scales exponentially in the number of groups. Here, graper provided a by far more scalable approach (Figure 2, right panel).

Fig. 2.

Fig. 2.

Average run time (in min) for different methods when varying the number of samples Inline graphic, features Inline graphic and groups Inline graphic. Each parameter is varied at a time while holding the others fixed to Inline graphic, Inline graphic or Inline graphic. Shown are the average times across 50 random instances of simulated data with error bars denoting one standard error.

3.2. Application to data from high-throughput biology

3.2.1. Drug response prediction in leukemia samples

Next, we exemplify the method’s performance on real data by considering an application to biological data, where predictors were obtained from different assays. Using assay type as external covariates we used the method to integrate data from the different assays (also referred to as omic types) in a study on chronic lymphocytic leukemia (Dietrich and others, 2018). This study combined drug response measurements with molecular profiling including gene expression and methylation. Briefly, we used normalized RNA-Seq expression values of the 5000 most variable genes, the DNA methylation M-values at the 1Inline graphic most variable CpG sites as well as the ex-vivo cell viability after exposure to 61 drugs at five different concentrations as predictors for the response to a drug (Ibrutinib) that was not included into the set of predictors. The data were obtained from the Bioconductor package MOFAdata 1.0.0 (Argelaguet and others, 2018). In total, this resulted in a model with Inline graphic patient samples and Inline graphic predictors.

We first applied the different regression methods to the data on their original scale. Since the features have different scales (e.g., the drug responses vary from around 1 (neutral) to 0 (completely toxic), the normalized expression values from 0 to 20 and the methylation M-values from Inline graphic10 to 8), this ensures that the omic type information is an informative covariate: it results in larger effect sizes of the drug response data and smaller effect sizes of the methylation and expression data compared to scaled predictors. In this setting, incorporating knowledge on the assay type into the penalized regression showed clear advantages in terms of prediction performance: The covariate-aware methods (GRridge, IPF-Lasso, and graper) all improved upon the covariate-agnostic Lasso, ridge regression, or elastic net (Figure 3a). Also the group Lasso methods, which incorporate the group information but apply a single penalty parameter, could not adapt to the scale differences. The inferred hyper-parameters Inline graphic of graper highlighted the larger effect sizes of the drug response feature group, which was strongly favored by the penalization (Figure 3b).

Fig. 3.

Fig. 3.

Application to the chronic lymphocytic leukemima data with scale differences between assays. (a) Comparison of the root mean squared error (RMSE) for the prediction of samples’ viability after treatment with Ibrutinib. Performance was evaluated in a 10-fold cross-validation scheme, the points denote the individual RMSE for each fold. (b) Inferred hyper-parameters in the different folds for the three different omic types (Inline graphic on the left and Inline graphic on the right).

To address differences in feature scale, a common choice made by many implementations (e.g., glmnet (Friedman and others, 2010)) is to scale all features to unit variance. Indeed, for the data at hand, this transformation was particularly beneficial for the covariate-agnostic methods, and their prediction performances became more similar to those of the covariate-aware methods. However, for dense methods such as ridge regression the covariate information on the omic type remained important (Figure 4a). Sparse methods in general resulted in very good predictions as the response to Ibrutinib can be well explained by a very sparse model containing only few drugs with related mode of action. By learning weights for each omic type graper directly highlighted the importance of the drug data as predictors (Figure 4b).

Fig. 4.

Fig. 4.

Application to the chronic lymphocytic leukemima data with standardized predictors. (a) Comparison of the root mean squared error (RMSE) for the prediction of samples’ viability after treatment with Ibrutinib. Performance was evaluated in a 10-fold cross-validation scheme, the points denote the individual RMSE for each fold. (b) Inferred hyper-parameters by graper (sparse) in the different folds for the three different omic types (Inline graphic on the left and Inline graphic on the right).

In general, standardization of all features is unlikely to be an optimal choice, since in many applications there is a relation between information content and amplitude. For example, we often measure high-amplitude signals that are informative jointly with low-amplitude features that originate mainly from technical noise. Here, standardization can be harmful as it would drown the informative high-amplitude features and “blow up” the noisy low-amplitude features (Figure S2 of the supplementary material available at Biostatistics online). In particular, standardization does not distinguish between meaningful differences in variance (e.g., features that differ between two disease groups) and differences in variance due to the scale. While removal of the latter would be desirable, meaningful differences should be retained. Hence, the question of whether to scale the predictors or not, is related to the question of whether the variance of a feature is an informative covariate: if the variance contains important information on the relevance of a predictor, standardization should not be applied or information on the predictors’ variance should be re-included via the covariate, e.g., binning features based on their variance. This has been shown to be beneficial for marginal testing applications where filtering or weighting by variance increased the power to detect true positives (Bourgon and others, 2010; Ignatiadis and others, 2016). A recent study on RNA-Seq data in the context of penalized regression found no strong effect of standardization compared to no standardization (Zwiener and others, 2014). However, an example where standardization can be harmful in applications to genomic data are binary mutation data. Here, features are all on the same scale and standardization would favor mutations with lower frequencies which in most applications is not desirable (see also Section 3.1 of the supplementary material available at Biostatistics online).

3.2.2. Age prediction from multi-tissue gene expression data

As a second example for a covariate in genomics, we considered the tissue type. Using data from the GTEx consortium (Lonsdale and others, 2013), we asked whether the tissue type is an informative covariate in the prediction of a person’s age from gene expression. Briefly, using gene expression data provided by the Bioconductor package recount 1.7.6 (Collado-Torres and others, 2017), we chose five tissues that were available for the largest number of donors and from each tissue considered the top 50 principal components on the RNA-Seq data after normalization and variance stabilization using DESeq2 1.21.25 (Love and others, 2014). In total, this gave us Inline graphic predictors from Inline graphic tissues for Inline graphic donors.

We observed a small advantage for methods that incorporate the tissue type as a covariate (Figure 5a): GRridge, IPF-Lasso, and graper all had a smaller prediction error compared to covariate-agnostic methods. In particular, graper resulted in comparable prediction performance to IPF-Lasso, whilst requiring less than a second for training compared to 40 min for IPF-Lasso. The learnt relative penalization strength and sparsity levels of graper can again provide insights into the relative importance of the different tissue types. In particular, we found lower penalization for blood vessel and muscle and higher penalization for blood and skin (Figure 5b). This is consistent with previous studies on a per-tissue basis, where gene expression in blood vessel has been found to be a good predictor for age, while blood was found to be less predictive (Yang and others, 2015).

Fig. 5.

Fig. 5.

Application to the GTEx data. (a) Comparison of root mean squared error (RMSE) for the prediction of donor age (in years). Performance is evaluated in a 10-fold cross-validation scheme, the points denote the individual RMSE for each fold. (b) Inferred penalty parameters for the five tissues in graper in each fold.

4. Discussion

We propose a method that can use information from external covariates to guide penalization in regression tasks and that can provide a flexible and scalable alternative to approaches that were proposed recently (Wiel and others, 2016; Boulesteix and others, 2017). We illustrated in simulations and data from biological applications that if the covariate is informative of the effect sizes in the model, these approaches can improve upon commonly used penalized regression methods that are agnostic to such information. We investigated the use of important covariates in genomics such as omic type or tissue. The performance of our approach is in many cases comparable to the IPF-Lasso method (Boulesteix and others, 2017), while scalability is highly improved in terms of the number of feature groups, thereby extending the range of possible applications.

The variational inference framework provides improved scalability compared to Bayesian methods that are based on sampling strategies. Variational Bayes methods have already been employed in the setting of Bayesian regression with Spike-and-Slab priors (Carbonetto and Stephens 2012; Carbonetto and others 2017). However, these methods do not incorporate information from external covariates. A drawback of variational methods is the fact that they often result in too concentrated approximations to the posterior distribution and thereby underestimate the posterior variance. Nevertheless, they have been shown to provide reasonable point estimates in regression tasks (Carbonetto and Stephens, 2012), which we focused on here. Due to the mean-field assumption strong correlations between active predictors can lead to suboptimal results of graper. Here, a multivariate mean-field assumption in the variational Bayes approach can be of advantage, suggested as an alternative above. However, it comes at the price of higher computational costs. What is not addressed in our current implementation is the common problem of missing values in the data; if extant, they would need to be imputed beforehand.

While our approach is related to methods that adapt the penalty function in order to incorporate structural knowledge, such as the group Lasso (Yuan and Lin, 2006), sparse group Lasso (Friedman and others, 2010), or fused Lasso ( Tibshirani and others, 2005), these approaches apply the same penalty parameter to all the different groups and perform hard in- or exclusion of groups instead of the softer weighting proposed here. Alternatively, the loss function can be modified to incorporate prior knowledge based on a known set of “high-confidence predictors” as proposed by Jiang and others (2016). The existence and identity of such “high-confidence predictors,” however, is often not clear.

In contrast to frequentist regression methods, the Bayesian approach provides direct posterior-inclusion probabilities for each feature that can be useful for model selection. To obtain frequentist guarantees on the selected features it could be promising to combine the approach with recently developed methods for controlling the false discovery rate, such as the knockoffs (Candes and others, 2018). For this, feature statistics can be constructed based on the estimated coefficients or inclusion probabilities from our model as long as the knockoffs obtain the same covariate information as their true counterpart.

An interesting question that we have not addressed is the quest for rigorous criteria when the inclusion of a covariate by differential penalization is of advantage. This question is not limited to the framework of penalized regression but affects the general setting of shrinkage estimation. While joint shrinkage of a set of estimates can be very powerful in producing more stable estimates with reduced variance, care needs to be taken on which measurements to combine in such a shrinkage approach. As in the case of coefficients in the linear model setting, external covariates could be helpful for this decision and facilitate a more informed shrinkage. However, allowing for differential shrinkage will reintroduce some degrees of freedom into the model and can only be advantageous if the covariate provides “sufficient” information to balance this. For future work, it would be of interest to find general conditions for when this is the case, thereby enabling an informed choice of covariates in practice.

We provide an open-source implementation of our method in the Bioconductor package graper. In addition, vignettes and scripts are made available that facilitate the comparison of graper with various related regression methods and can be used to reproduce all results contained in this work (https://git.embl.de/bvelten/graper_analyses).

5. Software

The method is implemented in the Bioconductor package graper, scripts for the analyses contained in this article can be found at https://git.embl.de/bvelten/graper_analyses.

Supplementary Material

kxz034_Supplementary_Data

Acknowledgments

We thank Bernd Klaus for providing useful comments on the manuscript. Conflict of Interest: None declared.

Funding

This work was supported by the European Union Horizon 2020 project SOUND [633974] and the EMBL International PhD Program.

References

  1. Alyass, A., Turcotte, M. and Meyre, D. (2015). From big data analysis to personalized medicine for all: challenges and opportunities. BMC Medical Genomics 8, 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Andersen, M. R., Vehtari, A., Winther, O. and Hansen, L. K. (2017). Bayesian inference for spatio-temporal spike and slab priors. Journal of Machine Learning Research 18, 1–58. [Google Scholar]
  3. Argelaguet, R., Velten, B., Arnol, D., Dietrich, S., Zenz, T., Marioni, J. C., Buettner, F., Huber, W. and Stegle, O. (2018). Multi-omics factor analysis—a framework for unsupervised integration of multi-omic data sets. Molecular Systems Biology 14, e8124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bergersen, L. C., Glad, I. K. and Lyng, H. (2011). Weighted Lasso with data integration. Statistical Applications in Genetics and Molecular Biology 10, 1–29. [DOI] [PubMed] [Google Scholar]
  5. Bishop, C. M. (2006). Pattern recognition. Machine Learning 128, 1–58. [Google Scholar]
  6. Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association 112, 859–877. [Google Scholar]
  7. Boulesteix, A.-L., De Bin, R., Jiang, X. and Fuchs, M. (2017). IPF-LASSO: integrative LInline graphic-penalized regression with penalty factors for prediction based on multi-omics data. Computational and Mathematical Methods in Medicine 2017, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bourgon, R., Gentleman, R. and Huber, W. (2010). Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences 107, 9546–9551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Candes, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘model-X’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 551–577. [Google Scholar]
  10. Carbonetto, P. and Stephens, M. (2012). Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis 7, 73–108. [Google Scholar]
  11. Carbonetto, P., Zhou, X. and Stephens, M. (2017). varbvs: Fast variable selection for large-scale regression. arXiv preprint arXiv:1709.06597. [Google Scholar]
  12. Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009). Handling sparsity via the horseshoe. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, in PMLR 5, 73–80. [Google Scholar]
  13. Chen, R. and Snyder, M. (2013). Promise of personalized omics to precision medicine. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 5, 73–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Collado-Torres, L., Nellore, A., Kammers, K., Ellis, S. E, Taub, M. A., Hansen, K. D., Jaffe, A. E., Langmead, B. and Leek, J. T. (2017). Reproducible RNA-seq analysis using recount2. Nature Biotechnology 35, 319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dietrich, S., Oleś, M., Lu, J., Sellner, L., Anders, S., Velten, B., Wu, B., Hüllein, J., da Silva Liberio, M., Walther, T.. and others. (2018). Drug-perturbation-based stratification of blood cancer. The Journal of Clinical Investigation 128, 427–445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Dobriban, E., Fortney, K., Kim, S. K. and Owen, A. B. (2015). Optimal multiple testing under a Gaussian prior on the effect sizes. Biometrika 102, 753–766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Engelhardt, B. E. and Adams, R. P. (2014). Bayesian structured sparsity from Gaussian fields. arXiv preprint arXiv:1407.2235. [Google Scholar]
  18. Ferkingstad, E., Frigessi, A., Rue, H., Thorleifsson, G. and Kong, A. (2008). Unsupervised empirical Bayesian multiple testing with external covariates. The Annals of Applied Statistics, 2, 714–735. [Google Scholar]
  19. Friedman, J., Hastie, T. and Tibshirani, R. (2010a). A note on the group Lasso and a sparse group Lasso. arXiv preprint arXiv:1001.0736. [Google Scholar]
  20. Friedman, J., Hastie, T. and Tibshirani, R. (2010b). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1. [PMC free article] [PubMed] [Google Scholar]
  21. Hamburg, M. A. and Collins, F. S. (2010). The path to personalized medicine. The New England Journal of Medicine 2010, 301–304. [DOI] [PubMed] [Google Scholar]
  22. Hasin, Y., Seldin, M. and Lusis, A. (2017). Multi-omics approaches to disease. Genome Biology 18, 83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hernández-Lobato, D., Hernández-Lobato, J. M. and Dupont, P. (2013). Generalized spike-and-slab priors for Bayesian group feature selection using expectation propagation. The Journal of Machine Learning Research 14, 1891–1945. [Google Scholar]
  24. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67. [Google Scholar]
  25. Ignatiadis, N., Klaus, B., Zaugg, J. and Huber, W. (2016). Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods 13, 577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing 10, 25–37. [Google Scholar]
  27. Jiang, Y., He, Y. and Zhang, H. (2016). Variable selection with prior information for generalized linear models via the prior LASSO method. Journal of the American Statistical Association 111, 355–376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lei, L. and Fithian, W. (2018). AdaPT: an interactive procedure for multiple testing with side information. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 649–679. [Google Scholar]
  29. Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive Benjamini-Hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81, 45–74. [Google Scholar]
  30. Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters, G., Garcia, F., Young, N.. and others. (2013). The genotype-tissue expression (GTEx) project. Nature Genetics 45, 580–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Love, M. I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. MacKay, D. J. C. (1996). Bayesian methods for backpropagation networks. In: Domany, E., et al. (editors), Models of Neural Networks III New York, NY: Springer, pp. 211–254. [Google Scholar]
  33. Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83, 1023–1032. [Google Scholar]
  34. Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]
  35. Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A. and Kim, D. (2015). Methods of integrating data to uncover genotype-phenotype interactions. Nature Reviews Genetics 16, 85. [DOI] [PubMed] [Google Scholar]
  36. Rockova, V. and Lesaffre, E. (2014). Incorporating grouping information in Bayesian variable selection with applications in genomics. Bayesian Analysis 9, 221–258. [Google Scholar]
  37. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 267–288. [Google Scholar]
  38. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 91–108. [Google Scholar]
  39. Titsias, M. K. and Lázaro-Gredilla, M. (2011). Spike and slab variational inference for multi-task and multiple kernel learning. In: Shawe-Taylor, J., et al. (editors) Advances in Neural Information Processing Systems. New York: Curran Associates, pp. 2339–2347. [Google Scholar]
  40. Veríssimo, A., Oliveira, A. L., Sagot, M.-F. and Vinga, S. (2016). Degreecox–a network-based regularization method for survival analysis. BMC Bioinformatics 17, 449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Wiel, M. A., Lien, T. G., Verlaat, W., Wieringen, W. N. and Wilting, S. M. (2016). Better prediction by use of co-data: adaptive group-regularized ridge regression. Statistics in Medicine 35, 368–381. [DOI] [PubMed] [Google Scholar]
  42. Wu, A., Park, M., Koyejo, O. O. and Pillow, J. W. (2014). Sparse Bayesian structure learning with dependent relevance determination priors. In: Gharamani, Z., et al. (editors) Advances in Neural Information Processing Systems. New York: Curran Associates, pp. 1628–1636. [Google Scholar]
  43. Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group Lasso. Bayesian Analysis 10, 909–936. [Google Scholar]
  44. Yang, J., Huang, T., Petralia, F., Long, Q., Zhang, B., Argmann, C., Zhao, Y., Mobbs, C. V, Schadt, E. E., Zhu, J.. and others. (2015). Synchronized age-related gene expression changes across multiple tissues in human and the link to complex diseases. Scientific Reports 5, 15145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67. [Google Scholar]
  46. Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. [Google Scholar]
  47. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320. [Google Scholar]
  48. Zwiener, I., Frisch, B. and Binder, H. (2014). Transforming RNA-Seq data to improve the performance of prognostic gene signatures. PLoS One 9, e85150. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxz034_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES