An R package VIGoR for joint estimation of multiple linear learners with variational Bayesian inference

Akio Onogi; Aisaku Arakawa

doi:10.1093/bioinformatics/btac328

. 2022 May 16;38(12):3306–3309. doi: 10.1093/bioinformatics/btac328

An R package VIGoR for joint estimation of multiple linear learners with variational Bayesian inference

Akio Onogi ^1,^✉, Aisaku Arakawa ²

Editor: Russell Schwartz

PMCID: PMC9191213 PMID: 35575313

Abstract

Summary

An R package that can implement multiple linear learners, including penalized regression and regression with spike and slab priors, in a single model has been developed. Solutions are obtained with fast minorize-maximization algorithms in the framework of variational Bayesian inference. This package helps to incorporate multimodal and high-dimensional explanatory variables in a single regression model.

Availability and implementation

The R package VIGoR (Variational Bayesian Inference for Genome-wide Regression) is available at the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/) and at GitHub (https://github.com/Onogi/VIGoR).

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

In current biology, data of multiple omics can be available from many samples (Hasin et al., 2017; Kim and Tagkopoulos, 2018), and data are also available from environments in fine scales (Gupta and Quan, 2018; Munandar et al., 2017). Thus, it is often of interest to discover important variables from such high-dimensional and multimodal data (Wang et al., 2017) or to use them to predict phenotypes (Riedelsheimer et al., 2012). To model such multimodal and high-dimensional variables in realistic time, flexible and fast regression tools are required. High-dimensional variables can be included in regression models using penalties. Although various penalized regressions have been proposed (e.g. Tibshirani, 1996; Zou and Hastie, 2005), implementation of these methods usually assumes unimodal explanatory variables (e.g. glmnet; Friedman et al., 2010). An extension to multiple learners is feasible in a Bayesian framework because each learner can be a module. Indeed, a popular R package BGLR (Pérez and de los Campos, 2014) allows modeling multimodal explanatory variables using different Bayesian regression methods. However, calculation time will be an issue because the package depends on Markov chain Monte Carlo (MCMC) algorithms.

Here, we provide an R package VIGoR (Variational Bayesian Inference for Genome-wide Regression) which can incorporate multimodal explanatory variables using different regression methods. Solutions are obtained with variational inference which is more time-efficient than MCMC. The package was initially developed to provide variational Bayesian inference for linear regressions (Onogi and Iwata, 2016) and has been updated to incorporate multiple learners. The updates are summarized in Supplementary Methods. VIGoR implements multiple regression methods including Bayesian lasso (BL) (Park and Casella, 2008), extended Bayesian lasso (EBL) (Mutshinda and Sillanpää, 2010), Bayesian Alphabets (BayesA, BayesB and BayesC), Bayesian ridge regression (BRR) and best linear unbiased prediction (BLUP). Bayesian Alphabets will be jargons in quantitative genetics; BayesB and BayesC are regressions with spike and slab priors (Habier et al., 2011; Meuwissen et al., 2001) and BayesA uses t-distributions as prior distributions of regression coefficients (Meuwissen et al., 2001). These regression methods were selected because they show different properties. BayesB, BayesC and EBL are suitable for variable selection, and BRR and BLUP are suitable for issues where many variables are involved in the response variable. BL and BayesA tend to show the intermediate properties. See Supplementary Methods for the details of these methods. Users can add these learners simultaneously to a single model as

y_{i} = \sum_{m = 1}^{M} f_{m} (\begin{matrix} x_{m, i}, & θ_{m} \end{matrix}) + ε_{i},

where y_i is the response variables for sample i, M is the number of learners in the model, f_m indicates the mth learner, x_m_,_i is the explanatory variables for learner m and sample i, θ_m is the parameters of learner m and ε_i is the residual. The theoretical backgrounds and variational Bayesian algorithms are illustrated in the pdf manual of the package available at https://github.com/Onogi/VIGoR. It is notable that the multiple learners in a model can share the explanatory variables (i.e. $x_{m, i} = x_{n, i}$ for $m, n \in 1, \dots, M$ and $m \neq n$ ). Such modeling would behave like ensemble learning which is expected to be robust against hyperparameter specification and data architecture (Knürr et al., 2013; Onogi et al., 2015).

2 Experiments

Two simulation experiments (Exp. 1 and 2) were performed here. The simulated data in Exp. 1 and 2 were bimodal and tetra-modal, respectively. The total number of explanatory variables was 2000, 6000, 10 000, 20 000, 60 000 or 100 000. Each type (mode) of variables was included in equal numbers and explained the same amount of the variance of the response variable. Both the numbers of training and testing samples were 1000. In Exp. 1, both types of the explanatory variables were generated from the standard normal distribution, and 1% and 0.1% of each variable were assumed to have non-zero effects drawn from the standard normal distribution. BayesC with different shrinkage magnitudes was applied to both types of variables. In Exp. 2, one type of explanatory variable was assumed as SNPs, and genotypes were generated assuming allele frequencies being 0.5. The other three types of variables were generated from the standard normal distribution. All the SNPs were assumed to have non-zero effects, and 0.1% had non-zero effects for the other types of variables. Bayesian ridge regression was applied to SNPs, and BayesB were applied to the other types of variables. BayesB and BayesC implemented in VIGoR and BGLR differ in the inclusion probability of explanatory variables. The probability is fixed to a predefined value in VIGoR whereas it is inferred with a prior beta distribution, $Beta (π_{0}, p_{0})$ , in BGLR. Here, $π_{0}$ is the expectation and $p_{0}$ defines the variance as $π_{0} (1 - π_{0}) / (p_{0} + 1)$ . For fair comparison, $p_{0}$ was set to 1e + 6 to prevent fluctuation of the probability during sampling. The predefined value of VIGoR and $π_{0}$ of BGLR were set to the true values of the simulation. VIGoR and BGLR were compared in terms of calculation time, area under the curve (AUC) and prediction accuracy for testing samples. For SNPs in Exp. 2, Pearson correlation between the true and estimated effects was used instead of AUC. These experiments are complementally explained in Supplementary Methods.

Calculation time depends on the criterion of convergence. For VIGoR, three convergence criteria were compared. The iterative update of VIGoR was stopped when ${‖θ^{*} - θ‖}^{2} / {‖θ^{*}‖}^{2} < t$ where $| | | |$ is the Euclidean norm, θ is the vector containing all parameter values at the previous iteration, θ* is the vector consisting of newly updated parameter values at the iteration and t is the convergence criterion which was set to 1e − 4 (loose), 1e − 5 (moderate) or 1e − 6 (strict). Because convergence of MCMC is difficult to verify in particular when many parameters are involved, three chain lengths were compared: (i) nIte = 1500, burnIn = 500 and thin = 5 (default setting of BGLR), (ii) nIte = 15 000, burnIn = 5000 and thin = 10 and (iii) nIte = 30 000, burnIn = 20 000 and thin = 10. Here, nIte, burnIn and thin denote the total number of iterations, length of burnin and sampling interval, respectively.

In Exp. 1, AUC and prediction accuracy of VIGoR were almost same among the convergence criteria (Fig. 1A). AUC and prediction accuracy of BGLR were improved by prolonging the chain length, but still inferior to those of VIGoR despite more calculation time spent. In Exp. 2, VIGoR generally showed the best performance when t = 1e − 6 (Fig. 1B). Calculation time under this strict criterion was still less than the shortest chain length of BGLR (1500). Complete results with standard deviations are presented in Supplementary Figures S1 and S2.

In Exp. 3, we compared VIGoR with BGLR using a real data of soybean (Onogi et al., 2021). Days to flowering of a major variety (ID V083) evaluated from 2005 to 2015 (N = 213) was predicted from records of the variety from 1967 to 2004 (N = 838). As explanatory variables, daily mean temperature, photoperiod and precipitation from sowing dates (1st day) to 246th day were used. First-order interactions among these variables were also considered resulting in 246 × 3 + 246 × 246 × 3 = 182 286 explanatory variables. These additive and interaction effects were modeled using BayesC with different shrinkage magnitudes. Specifically, 20% and 0.1% of the additive and interaction variables were assumed to have non-zero effects. $p_{0}$ of BGLR was set to 1e + 6. The convergence criterion of VIGoR and the chain lengths of BGLR followed those used in Exp. 1 and 2. Prediction accuracy of VIGoR was similar among the convergence criteria (Fig. 1C), and superior to BGLR of any chain length. Calculation time of VIGoR increased as the criterion became stricter, but still less than the longest chain of BGLR that was not able to achieve accuracy equivalent with VIGoR.

3 Conclusions

VIGoR offers fast and accurate solutions for linear models that incorporate multiple learners. The package enables users to model multimodal and high-dimensional explanatory variables, and will help discovering important variables or predicting phenotypes. Although the current version is only applicable to quantitative response variables, future updates will allow application to categorical or censored data.

Funding

This work was supported by Ryukoku University.

Conflict of Interest: none declared.

Data availability

R scripts to reproduce the experiment results and pdf manual of the package are provided at https://github.com/Onogi/VIGoR.

Supplementary Material

btac328_Supplementary_Data

Click here for additional data file.^{(1MB, zip)}

Contributor Information

Akio Onogi, Department of Plant Life Science, Faculty of Agriculture, Ryukoku University, Otsu, Shiga 520-2194, Japan.

Aisaku Arakawa, Division of Animal Breeding and Reproduction Research, Institute of Livestock and Grassland Science, National Agriculture and Food Research Organization, Tsukuba, Ibaraki 305-0901, Japan.

References

Friedman J. et al. (2010) Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33, 1–22. [PMC free article] [PubMed] [Google Scholar]
Gupta G.S., Quan V.M. (2018) Multi-sensor integrated system for wireless monitoring of greenhouse environment. In: 2018 IEEE Sensors Applications Symposium (SAS), Seoul, pp. 1–6. [Google Scholar]
Habier D. et al. (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics, 12, 186. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hasin Y. et al. (2017) Multi-omics approaches to disease. Genome Biol., 18, 83. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim M., Tagkopoulos I. (2018) Data integration and predictive modeling methods for multi-omics datasets. Mol. Omics, 14, 8–25. [DOI] [PubMed] [Google Scholar]
Knürr T. et al. (2013) Impact of prior specifications in a shrinkage-inducing Bayesian model for quantitative trait mapping and genomic prediction. Genet. Sel. Evol., 45, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meuwissen T.H. et al. (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157, 1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
Munandar A. et al. (2017) Design of real-time weather monitoring system based on mobile application using automatic weather station. In: 2017 2nd International Conference on Automation, Cognitive Science, Optics, Micro Electro-Mechanical System, and Information Technology (ICACOMIT), Jakarta, pp. 44–47. [Google Scholar]
Mutshinda C.M., Sillanpää M.J. (2010) Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics, 186, 1067–1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
Onogi A., Iwata H. (2016) VIGoR: variational Bayesian inference for genome-wide regression. J. Open Res. Softw., 4, e11. [Google Scholar]
Onogi A. et al. (2015) Exploring the areas of applicability of whole-genome prediction methods for Asian rice (Oryza sativa L.). Theor. Appl. Genet., 128, 41–53. [DOI] [PubMed] [Google Scholar]
Onogi A. et al. (2021) A method for identifying environmental stimuli and genes responsible for genotype-by-environment interactions from a large-scale multi-environment data set. Front. Genet., 12, 803636. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park T., Casella G. (2008) The Bayesian lasso. J. Am. Stat. Assoc., 103, 681–686. [Google Scholar]
Pérez P., de los Campos G. (2014) Genome-wide regression and prediction with the BGLR Statistical package. Genetics, 198, 483–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
Riedelsheimer C. et al. (2012) Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat. Genet., 44, 217–220. [DOI] [PubMed] [Google Scholar]
Sing T. et al. (2005) ROCR: visualizing classifier performance in R. Bioinformatics, 21, 3940–3941. [DOI] [PubMed] [Google Scholar]
Tibshirani R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58, 267–288. [Google Scholar]
Wang C. et al. (2017) Metabolome-wide association study identified the association between a circulating polyunsaturated fatty acid variant rs174548 and lung cancer. Carcinogenesis, 38, 1147–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H., Hastie T. (2005) Regularization and variable selection via the elastic net. J. R. Stat. Soc. B, 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btac328_Supplementary_Data

Click here for additional data file.^{(1MB, zip)}

Data Availability Statement

R scripts to reproduce the experiment results and pdf manual of the package are provided at https://github.com/Onogi/VIGoR.

[btac328-B1] Friedman J. et al. (2010) Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33, 1–22. [PMC free article] [PubMed] [Google Scholar]

[btac328-B2] Gupta G.S., Quan V.M. (2018) Multi-sensor integrated system for wireless monitoring of greenhouse environment. In: 2018 IEEE Sensors Applications Symposium (SAS), Seoul, pp. 1–6. [Google Scholar]

[btac328-B3] Habier D. et al. (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics, 12, 186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac328-B4] Hasin Y. et al. (2017) Multi-omics approaches to disease. Genome Biol., 18, 83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac328-B5] Kim M., Tagkopoulos I. (2018) Data integration and predictive modeling methods for multi-omics datasets. Mol. Omics, 14, 8–25. [DOI] [PubMed] [Google Scholar]

[btac328-B6] Knürr T. et al. (2013) Impact of prior specifications in a shrinkage-inducing Bayesian model for quantitative trait mapping and genomic prediction. Genet. Sel. Evol., 45, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac328-B7] Meuwissen T.H. et al. (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157, 1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac328-B8] Munandar A. et al. (2017) Design of real-time weather monitoring system based on mobile application using automatic weather station. In: 2017 2nd International Conference on Automation, Cognitive Science, Optics, Micro Electro-Mechanical System, and Information Technology (ICACOMIT), Jakarta, pp. 44–47. [Google Scholar]

[btac328-B9] Mutshinda C.M., Sillanpää M.J. (2010) Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics, 186, 1067–1075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac328-B10] Onogi A., Iwata H. (2016) VIGoR: variational Bayesian inference for genome-wide regression. J. Open Res. Softw., 4, e11. [Google Scholar]

[btac328-B11] Onogi A. et al. (2015) Exploring the areas of applicability of whole-genome prediction methods for Asian rice (Oryza sativa L.). Theor. Appl. Genet., 128, 41–53. [DOI] [PubMed] [Google Scholar]

[btac328-B12] Onogi A. et al. (2021) A method for identifying environmental stimuli and genes responsible for genotype-by-environment interactions from a large-scale multi-environment data set. Front. Genet., 12, 803636. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac328-B13] Park T., Casella G. (2008) The Bayesian lasso. J. Am. Stat. Assoc., 103, 681–686. [Google Scholar]

[btac328-B14] Pérez P., de los Campos G. (2014) Genome-wide regression and prediction with the BGLR Statistical package. Genetics, 198, 483–495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac328-B15] Riedelsheimer C. et al. (2012) Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat. Genet., 44, 217–220. [DOI] [PubMed] [Google Scholar]

[btac328-B16] Sing T. et al. (2005) ROCR: visualizing classifier performance in R. Bioinformatics, 21, 3940–3941. [DOI] [PubMed] [Google Scholar]

[btac328-B17] Tibshirani R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58, 267–288. [Google Scholar]

[btac328-B18] Wang C. et al. (2017) Metabolome-wide association study identified the association between a circulating polyunsaturated fatty acid variant rs174548 and lung cancer. Carcinogenesis, 38, 1147–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btac328-B19] Zou H., Hastie T. (2005) Regularization and variable selection via the elastic net. J. R. Stat. Soc. B, 67, 301–320. [Google Scholar]

PERMALINK

An R package VIGoR for joint estimation of multiple linear learners with variational Bayesian inference

Akio Onogi

Aisaku Arakawa

Roles

Abstract

Summary

Availability and implementation

Supplementary information

1 Introduction

2 Experiments

Fig. 1.

3 Conclusions

Funding

Data availability

Supplementary Material

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

An R package VIGoR for joint estimation of multiple linear learners with variational Bayesian inference

Akio Onogi

Aisaku Arakawa

Roles

Abstract

Summary

Availability and implementation

Supplementary information

1 Introduction

2 Experiments

Fig. 1.

3 Conclusions

Funding

Data availability

Supplementary Material

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases