Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jul 31.
Published in final edited form as: Stat Med. 2022 Jun 5;41(20):3899–3914. doi: 10.1002/sim.9483

Spike-and-slab least absolute shrinkage and selection operator generalized additive models and scalable algorithms for high-dimensional data analysis

Boyi Guo 1, Byron C Jaeger 2, A K M Fazlur Rahman 1, D Leann Long 1, Nengjun Yi 1
PMCID: PMC10390213  NIHMSID: NIHMS1916760  PMID: 35665524

Abstract

There are proposals that extend the classical generalized additive models (GAMs) to accommodate high-dimensional data (pn) using group sparse regularization. However, the sparse regularization may induce excess shrinkage when estimating smooth functions, damaging predictive performance. Moreover, most of these GAMs consider an “all-in-all-out” approach for functional selection, rendering them difficult to answer if nonlinear effects are necessary. While some Bayesian models can address these shortcomings, using Markov chain Monte Carlo algorithms for model fitting creates a new challenge, scalability. Hence, we propose Bayesian hierarchical generalized additive models as a solution: we consider the smoothing penalty for proper shrinkage of curve interpolation via reparameterization. A novel two-part spike-and-slab LASSO prior for smooth functions is developed to address the sparsity of signals while providing extra flexibility to select the linear or nonlinear components of smooth functions. A scalable and deterministic algorithm, EM-Coordinate Descent, is implemented in an open-source R package BHAM. Simulation studies and metabolomics data analyses demonstrate improved predictive and computational performance against state-of-the-art models. Functional selection performance suggests trade-offs exist regarding the effect hierarchy assumption.

Keywords: EM-coordinate decsent, generalized additive models, high-dimensional data, predictive modeling, scalablility, spike-and-slab priors

1 |. INTRODUCTION

Much modern biomedical research, for example, sequencing data analysis and electronic health record data analysis, require special treatment of high-dimensionality, commonly known as pn problem. There is extensive literature on high-dimensional linear models via penalized models or Bayesian hierarchical models, see Mallick and Yi1 for review. These models are built upon a restrictive and unrealistic assumption, linearity. In classical statistical modeling, many strategies and models are proposed to relax the linearity assumption with various degrees of complexity. For example, variable categorization is a simple and common practice in epidemiology but suffers from power and interpretation issues. More complex models to address nonlinear effects include random forest and other so-called “black box” models.2 These models are useful for statistical prediction but do not estimate parameters relevant to the data generation process that one can draw inferences from. In addition, how to generalize these “black box” models to the high-dimensional setting remains unclear.

For their straightforward interpretation and flexibility, nonparametric regression models serve as great alternatives to the “black-box” models in prediction and variable selection. Among those, generalized additive models (GAMs), proposed in the seminal work of Hastie and Tibshirani,3 grew to be one of the most popular modeling tools. In a GAM, the response variable, which is assumed to follow some exponential family distribution, can be modeled with the summation of smooth functions. Nevertheless, the classical GAMs cannot fulfill the increasing analytic demands for high-dimensional data analysis.

There exist some proposals to generalize the classical GAM to accommodate high-dimensional applications. The regularized models, branching out from group regularized linear models, are used to fit GAMs by accounting for the structure introduced when expanding smooth functions. Ravikumar et al4 extended the group LASSO5 to additive models (AMs); Huang et al.6 further developed adaptive group LASSO for additive models; Wang et al7 and Xue8 respectively applied group SCAD penalty9 to additive models. Bayesian hierarchical models are also used in the context of high-dimensional additive models, particularly within the spike-and-slab literature. Various group spike-and-slab priors10,11 combining with computationally intensive Markov chain Monte Carlo (MCMC) algorithms are proposed, where the application on AMs is treated as a special case. Bai and co-authors12 were the first to apply group spike-and-slab LASSO prior to Gaussian AMs using a fast optimization algorithm and further generalized the framework to GAMs.13 Focus on addressing the sparsity, these methods can excessively penalize the bases of a smooth function and produce inaccurate predictions, particularly when complex signals are assumed and large numbers of knots are used.14 In addition, these methods adopt an “all-in-all-out” strategy, that is either including or excluding the variable completely, rendering no space for bi-level selection. Scheipl et a15 proposed a spike-and-slab structure prior that addresses the bi-level selection. But the model fitting relies on computationally intensive MCMC algorithms and creates scalability concerns. Developing a fast, flexible and accurate generalized additive model framework would be of special interest.

We propose a novel Bayesian hierarchical generalized additive model (BHAM) for outcome prediction in the context of high-dimensional data analysis. Specifically, we incorporate smoothing penalties, derived from the smoothing spline literature,16 via reparameterization of smooth functions to avoid excessive shrinkage on the bases. Smoothing penalties were also previously used in the spike-and-slab GAM15 and the sparsity-smoothness penalty.17 We then impose a new two-part spike-and-slab LASSO prior to address the signal sparsity. In addition, a scalable optimization-based algorithm, EM-coordinate descent (EM-CD) algorithm, is developed. While the primary focus of this model is to improve prediction, the proposed model also provides utility in functional selection. Notably, the two-part prior that follows the effect hierarchy principle motivates a bi-level selection, rendering one of three possibilities for each predictor: no effect, only linear effect, or linear and nonlinear effects. The proposed model is implemented in a publicly available R package BHAM via https://github.com/boyiguo1/BHAM.

The proposed framework, BHAM, differs from previous spike-and-slab based GAMs, that is, the spike-and-slab GAM15 and the sparse Bayesian GAM (SB-GAM),13 in three ways. Firstly, the proposed prior for smooth functions is a spike-and-slab LASSO type prior using independent mixture double exponential distribution, compared to spike-and-slab GAM that uses normal-mixture-of-inverse gamma prior. Spike-and-slab LASSO priors provide computational convenience during model fitting by using optimization algorithms instead of intensive sampling algorithms. They make fitting high-dimensional models more feasible without sacrificing performance in prediction and variable selection. Secondly, SB-GAM uses a group spike-and-slab LASSO prior with an EM-CD algorithm to fit the model. While both methods use the combination of expectation maximization algorithm and coordinate descent algorithm, there are subtle differences in the implementation due to the difference in prior specification. The proposed model sets up independent priors among basis coefficients after the reparameterization step, which provides some advantage in computation. Lastly, the proposed model addresses the incapability of bi-level selection in SB-GAM.

In Section 2, we establish the Bayesian hierarchical generalized additive model, introduce the proposed spike-and-slab spline priors, and describe the fast-fitting EM-CD algorithm. In Section 3, we compare the proposed framework to state-of-the-art models via Monte Carlo simulation studies. Analyses of two metabolomics datasets are presented in Section 4. Conclusion and discussions are given in Section 5.

2 |. BAYESIAN HIERARCHICAL ADDITIVE MODELS (BHAM)

We assume the response variable, Y, follows an exponential family distribution with density function f(y), mean μ and dispersion parameter ϕ. The mean of the response variable can be modeled as the summation of smooth functions, Bj(·), j=1,,p, of a given p-dimensional vector of predictors x, written as

E(Y|x)=g1(β0+j=1pBj(xj))=g1(β0+j=1pβjTXj), (1)

where g-1(·) is the inverse of a monotonic link function. Given n data points yi,xii=1n, the data distribution is expressed as

f(Y=y|β,ϕ)=i=1nf(Y=yi|β,ϕ).

The basis function matrix, that is, the design matrix derived from the smooth function Bjxj, is denoted Xj for the variable xj. The dimension of the design matrix depends on the choice of the smooth function, and is denoted as Kj for xj.βj denotes the basis function coefficients for the jth variable such that Bjxj=βjTXj. With slight abuse of notation, we denote vectors and matrices in bold fonts β,X with conformable dimensions, where scalar and random variables are denoted in unbold fonts β,X. The matrix transposing operation is denoted with a superscript T. To note, the proposed model can include parametric forms of variables in the model, and hence considers general linear models and semiparametric regression models as special cases.

2.1 |. Smooth function reparameterization

To encourage proper smoothing of each additive function, we adopt the smoothing penalty from smoothing spline models.16 A smoothing penalty is the quadratic norm of the basis coefficients and allows different shrinkage on different bases, mathematically

pen[Bjx]=λjBj(x)2dx=λjβjTSjβj,

where Sj is a known smoothing penalty matrix and λj denotes a smoothing parameter. A linear function can be modeled as Bjxj=xj with the smoothing penalty matrix Sj=[0]. Unlike previous regularized methods that either ignore the smoothing penalty completely or restrain the smoothing penalty as a component of sparse penalty which leads to a more restrictive solution, we consider an additional mechanism in pair with the proposed prior (described in Section 2.2) to address the smoothness and sparsity in signals such that the locally adaptive nature of the smoothing penalty retains.

Marra and Wood18 proposed a reparameterization procedure to factor the smoothing penalty into the design matrix of each smooth function. Given that the smoothing penalty matrix Sj is symmetric and positive semi-definite for the univariate smooth functions, we eigendecompose the penalty matrix S=UDUT, where the matrix D is diagonal with the eigenvalues arranged in the ascending order. To note, D can contain elements of zeros on the diagonal, where the zeros are associated with the linear space of the smooth function. For the most popular smooth function, cubic splines, the dimension of the linear space is one. Hereafter, we focus on discussing a uni-dimensional linear space for simplicity; however, it generalizes easily to the cases where the linear space is multidimensional. We further write the orthonormal matrix U[U0:U*] containing the eigenvectors as columns in the corresponding order to D. That is, U contains the eigenvectors U0 with zero eigenvalues for the linear space and U* contains the eigenvectors (as columns) for the nonzero eigenvalues, that is, the nonlinear space. We multiply the basis function matrix X with the orthonormal matrix U for the new design matrix Xrepa=XUX0:X*. An additional scaling step is imposed on X* by the nonzero eigenvalues of D such that the new basis function matrix X* can receive a uniform penalty on each of its dimensions. With slight abuse of the notation, we drop the superscript repa and denote XjXj0:Xj* as the basis function matrix for the jth variable after the reparameterization. A spline function can be expressed in the matrix form

Bjxj=Bj0xj+Bj*xj=βjXj0+βj*TXj*,

and the generalized additive model in Equation (1) now is

E(Y|x)=g1(β0+j=1pBj(xj))=g1(β0+j=1pβjTXj)=g1[β0+j=1p(βjXj0+βj*TXj*)], (2)

where the coefficients βjβj:βj* is an augmentation of the coefficient scalar βj of linear space and the coefficient vector βj* of nonlinear space.

To summarize, the reparameterization step provides three benefits. Firstly, the reparameterization integrates the smoothing penalty matrix into the design matrix, and encourages models to properly smooth the nonlinear function when sparsity penalty exists. Secondly, the eigendecomposition of the smoothing penalty matrix allows the isolation of the linear space from the nonlinear space, improving the feasibility of bi-level functional selection. Lastly, the eigendecomposition facilitates the construction of an orthonormal design matrix, which makes imposing independent priors on the coefficients possible. This reduces the computational complexity compared to using a multivariate priors, and improve the generalizability of the framework to be compatible with other choices of priors.

2.2 |. Two-part spike-and-slab lasso prior for smooth functions

The family of spike-and-slab regression models is one of the most commonly used models in high-dimensional data analysis for its utility in outcome prediction and variable selection. Among all the spike-and-slab priors, the spike-and-slab LASSO (SSL) prior19,20 is one of the most popular choices because it’s highly scalable. The SSL prior is composed of two double exponential distributions with mean 0 and different dispersion parameters, 0<s0<s1, mathematically,

β|γ1-γDE0,s0+γDE0,s1,0<s0<s1.

The latent binary variable γ{0,1} indicates whether a variable x is included in the model, while the dispersion parameters s0 and s1 control the shrinkage of the coefficient. Given that both double exponential distributions have a mean of 0 and the latent indicator γ can only take the value of 0 or 1, the mixture double exponential distribution can be formulated as one single double exponential density,

β|γDE0,(1-γ)s0+γs1,0<s0<s1. (3)

Compared to other priors for high-dimensional data analysis, SSL has the following advantages. First of all, the SSL prior provides a locally adaptive shrinkage when estimating the coefficients. Secondly, the SSL prior encourages a sparse solution, making variable selection straightforward. Thirdly, the SSL prior motivates a scalable algorithm, the EM-CD algorithm, for model fitting, and hence is more feasible for high-dimensional data analysis. We defer to Bai et al21 for a detailed discussion.

We introduce a novel SSL-based prior for smooth functions in GAMs. Given the reparameterized design matrix Xj=Xj0:Xj* for the jth variable, we impose a two-part SSL prior to the coefficients βj=βj:βj*. Specifically, the linear space coefficient has an SSL prior and the nonlinear space coefficients share a group SSL prior,

βj|γj,s0,s1DE(0,(1-γj)s0+γjs1),βjk*|γj*,s0,s1iidDE(0,(1-γj*)s0+γj*s1),k=1,,Kj, (4)

where γj{0,1} and γj*{0,1} are two latent indicator variables, indicating if the model includes the linear effect and the nonlinear effect of the jth variable respectively. s0 and s1 are scale parameters, assuming 0<s0<s1 and given. These scale parameters s0 and s1 can be treated as tuning parameters and optimized via cross-validation, discussed in Section 2.4.

The proposed two-part SSL prior, particularly the group SSL prior of the nonlinear space coefficients, differs from previous group SSL priors,22,23 as the proposed prior follows the effect hierarchy principle. Effect hierarchy refers to the principle that “lower-order effects are more likely to be active than higher-order effects” defined by Chipman.24 To implement, we consider the shared latent indicator of nonlinear coefficients γj* depends on the value of the linear space latent indicator γj, while both latent indicators γj and γj* follow a Bernoulli distribution. While the probability of including the linear effect is θj, the probability of including the nonlinear effect is γjθj.

γj|θjBin1,θjγj*|γj,θjBin1,γjθj.

This is, when the linear effect is not selected, the probability of including the nonlinear effect drops from θj to 0. For computational convenience, we analytically integrate γj out such that γj*|θjBin1,θj2 (see the derivation in the Supporting Information).

To allow the shrinkage to self-adapt to the sparsity and smoothing pattern of the data, we further specify the parameter θj follows a beta distribution with given shape parameters a and b,

θjBetaa,b.

The beta distribution is a conjugate prior to the binomial distribution and hence provides some computation convenience. Having a prior distribution of θj enables the proposed prior to inherit the selective shrinkage property and self-adaptivity21 from the classical SSL prior. In other words, when a smooth function is significant, the coefficients of the smooth function escape the overall shrinkage and produce a more accurate estimate, particularly in pair with the smoothing penalty implicitly addressed via the reparameterization. Meanwhile, the hyper prior encourages information borrowing across coordinates and hence automatic adjust for different levels of sparsity. Hereafter, we refer to the Bayesian hierarchical generalized additive models with the two-part spike-and-slab LASSO prior as BHAM, and visually presented in Figure 1.

FIGURE 1.

FIGURE 1

Directed acyclic graph of the proposed Bayesian hierarchical additive model with parameter expansion. Elliposes are stochastic nodes, rectangles and are deterministic nodes

2.3 |. Scalable EM-coordinate descent algorithm

Despite the advantage to estimate posterior densities, using MCMC algorithms to fit the proposed model is computationaly prohibited and not feasible for high-dimensional data. Previous research shows the computation performance of MCMC algorithms for spike-and-slab models is bottlenecked for medium-sized data (p=25),25 and substantially slows as p increases modestly in the GAM context.14 Hence, we consider the optimization algorithms that focus on the maximum a posteriori estimates at the cost of posterior inference. Specifically, we extend the EM-Coordinate Descent (EM-CD) algorithm to fit BHAMs. Similar to the EMVS algorithm26 for spike-and-slab models, the EM-CD algorithm is based on the expectation-maximization (EM) algorithm, integrating the Coordinate Descent algorithm in each iterative step to find the posterior mode. The EM-CD algorithm has been well adapted in generalized linear models,27 Cox proportional hazards models,28 and their grouped counterparts.22,23 The EM-CD algorithm provides deterministic solutions, which becomes a popular property for reproducible research.

For BHAMs, we define the parameters of interest as Θ={β,θ,ϕ} and consider the latent binary indicators γ as nuisance parameters of the model, in other words, the “missing” data in the EM context. Our objective is to find the parameters Θ that maximize the posterior density function, or equivalently the logarithm of the density function,

argmaxΘlogf(Θ,γ|y,X)=logf(y|β,ϕ)+j=1plogf(βj|γj)+k=1Kjlogf(βjk*|γj*)+j=1p(γj+γj*)logθj+(2-γj-γj*)log(1-θj)+j=1plogf(θj),

where f(y|β,ϕ) is the data distribution and f(θ) is the Beta(a, b) density. We choose noninformative prior for the intercept β0 and the dispersion parameter ϕ; for example, fβ0|τ02=N0,τ02 with τ02 set to a large value and f(logϕ)1.

We use the EM algorithm to find the maximum a posteriori estimate of Θ. This is, in the E-step, we calculate the expectation of posterior density function of logf(Θ,γ|y,X) with respect to the latent indicators γ conditioning on the parameter values from previous iteration Θ(t-1),

Eγ|Θ(t-1)logf(Θ,γ|y,X).

Hereafter, we use the shorthand notation E(·)Eγ|Θ(t-1)(·). In the M-step, we find the parameters Θ(t) that maximize Elogf(Θ,γ|y,X). The parenthesized subscription (t) denotes the parameter estimation at the tth iteration. The E- and Msteps are iterated until the algorithm converges.

To note here, the log-posterior density of BHAMs (up to additive constants) can be written as a two-part equation

logfΘ,γy,X=Q1β,ϕ+Q2γ,θ,

where

Q1Q1(β,ϕ)=logf(y|β,ϕ)+j=1p[logfβj|γj+k=1Kjlogf(βjk*|γjk*)]

and

Q2Q2(γ,θ)=j=1p[(γj+γj*)logθj+(2-γj-γj*)log1-θj]+j=1plogfθj.

Q1 and Q2 are respectively the log posterior density of the coefficients β and the log posterior density of the probability parameters θ conditioning on γ. Meanwhile, conditioning on γ,Q1 and Q2 are independent and can be maximized separately for β,ϕ and θ. With the proposed two-part spike-and-slab LASSO prior, Q1 can be treated as penalized likelihood function and maximization of EQ1 can be solved via the Coordinate Descent algorithm in each iteration. Coordinate descent is an optimization algorithm that offers extreme computational advantages, and is famous for its application in optimizing the l1 penalized likelihood function. Maximization of EQ2 can be solved via closed-form equations following the beta-binomial conjugate relationship.

The density function of the mixture double exponential prior of coefficient β can be written as

fβ|γ,s0,s1=12(1-γ)s0+γs1exp-|β|(1-γ)s0+γs1,

and EQ1 can be expressed as a log-likelihood function with l1 penalty

E(Q1)=logf(y|β,ϕ)j=1p[E(Sj1)|βj|+k=1KjE(S*j1)|βjk|], (5)

where Sj=1-γjS0+γjS1 and Sj*=(1-γj*)S0+γj*S1. To calculate two unknown quantities ESj-1 and E(S*j1), the posterior probability pjPr(γj=1|Θ(t-1)) and pj*Pr(γj*=1|Θ(t-1)) are necessary, which can be derived via Bayes’ theorem. The calculation of pj* is slightly different from that of pj, as pj* depends on the values of the vector βj* and pj only depends on the scalar βj. The calculation follows the equations below,

pj=Prγj=1|θjfβj|γj=1,s1Prγj=1|θjfβj|γj=1,s1+Prγj=0|θjfβj|γj=0,s0
pj*=Pr(γj*=1|θj)k=1Kjf(βjk|γj*=1,s1)Pr(γj*=1|θj)k=1Kjf(βjk|γj*=1,s1)+Pr(γj*=0|θj)k=1Kjf(βjk|γj*=0,s0),

where Pr(γj=1|θj)=θj, Pr(γj=0|θj)=1-θj, Pr(γj*=1|θj)=θj2, Pr(γj*=0|θj)=1-θj2, f(β|γ=1,s1)=DE(β|0,s1), f(β|γ=0,s0)=DE(β|0,s0). It is trivial to show

Eγj=pjE(γj*)=pj*,
ESj-1=1-pjs0+pjs1,E(Sj*-1)=1-pj*s0+pj*s1.

After replacing the calculated quantities, EQ1 can be seen as a l1 penalized likelihood function with the regularization parameter λ=E(S-1), and hence be optimized via coordinate descent algorithm.29 Independently, the remaining parameters of interest θ can be updated by maximizing EQ2. As the beta distribution is a conjugate prior for Bernoulli distribution, θ can be easily updated with a closed-form equation,

θj=pj+pj*+a-1a+b. (6)

Totally, the proposed EM-CD algorithm is summarized as follows:

  1. Choose a starting value β(0) and θ(0) for β and θ. For example, we can initialize β(0)=0 and θ(0)=0.5,

  2. Iterate over the E-step and M-step until convergence

    E-step: calculate E(γj),E(γj*) and E(Sj-1),E(S*j1) with estimates of Θ(t-1) from previous iteration,

    M-step:

    1. Update β(t), and the dispersion parameter ϕ(t) if exists, using the coordinate descent algorithm with the penalized likelihood function in Equation (5),

    2. Update θ(t) using Equation (6).

We assess convergence by the criterion: |d(t)-d(t-1)|/(0.1+|d(t)|)<ε, where d(t)=-2logf(y|X,β(t),ϕ(t)) is the estimate of deviance at the tth iteration, and ε is a small value (say 10−5).

2.4 |. Selecting optimal scale values

Our proposed model, BHAM, requires two preset scale parameters s0,s1. Hence, we need to find the optimal values for the scale parameters such that the model reaches its best prediction performance regarding criteria of preference. This would be achieved by constructing a two-dimensional grid, consisting of different s0,s1 pairs. However, previous research suggests the value of slab scale s1 has less impact on the final model and is recommended to be set as a generally large value, for example, s1=1, that provides no or weak shrinkage.20 As a result, we focus on examining different values of spike scale s0. Instead of the two-dimensional grid, we consider a sequence of L decreasing values {s0l}:0<s01<s02<<s0L<s1. Increasing the spike scale s0 tends to include more nonzero coefficients in the model. A measure of preference calculated with cross-validations (CV), for example, deviance (defined as model log-likelihood times -2, -2logf(y|βˆ,ϕˆ), area under the curve (AUC), mean squared error, can be used to facilitate the selection of a final model. The procedure is similar to the LASSO implementation in the widely used R package glmnet, which quickly fits LASSO models over a list of candidate values of λ and gives a sequence of models for users to choose from.

3 |. SIMULATION STUDY

In this section, we compare the performance of the proposed model to six alternative models: linear LASSO models, component selection and smoothing operator (COSSO),30 adaptive COSSO,31 generalized additive models with automatic smoothing (referred to as mgcv hereafter),32 spike-and-slab GAM,15 and SB-GAM.13 We use linear LASSO models as the benchmark, examining the performance when the linearity assumption doesn’t hold. COSSO is one of the earliest smoothing spline models that consider sparsity-smoothness penalty. Adaptive COSSO improved upon COSSO by using adaptive weight for penalties such that the penalty of each functional component is different for extra flexibility. mgcv is one of the most popular models for nonlinear effect interpolation and prediction. Nevertheless, mgcv doesn’t support analyses when the number of parameters exceeds the sample size. Spike-and-slab GAM employs a spike-and-slab prior for GAM and uses an MCMC algorithm for model fitting. SB-GAM is the first spike-and-slab LASSO GAM. We implement linear LASSO model with R package glmnet 4.1-2, COSSO and adaptive COSSO with R package cosso 2.1-1, generalized additive models with automatic smoothing with R package mgcv 1.8-31, spike-and-slab GAM with R package spikeSlabGAM 1.1-15, and SB-GAM with R package sparseGAM 1.0. COSSO models and SB-GAM do not provide flexibility to define smooth functions, and hence use the default choices; mgcv, spikeSlabGAM and the proposed model allow customized smooth functions and we choose the cubic regression spline. We control the dimensionality of each smooth function, 10 bases, for all different choices of smooth functions. We use a 5-fold CV with the default selection criteria to select the final model for linear LASSO model, COSSO models, SB-GAM and the proposed model. Twenty default candidates of tuning parameters (s0 in BHAM, λ0 in SB-GAM) are examined for SB-GAM and the proposed model that allow user specification of tuning candidates. All computation was conducted on a high-performance 64-bit Linux platform with 48 cores of 2.70 GHz eight-core Intel Xeon E5-2680 processors and 24 G of RAM per core and R 3.6.2.33

Other related methods for high-dimensional GAMs also exist, notably the methods of sparse additive models by Ravikumar et al.4 However, we exclude these methods from the current simulation study because they demonstrated inferior predictive performance compared to mgcv.14

3.1 |. Monte Carlo simulation study

We follow the data generating process described in Bai13: we first generate n=500 training data points with p=4,10,50,100,200 predictors respectively, where the predictors X are simulated from a multivariate normal distribution MVNn×p0,IP. We then simulate the outcome Y from two distributions, Gaussian and binomial with the identity link and logit link g(x)=logx1-x respectively. The mean of each outcome is derived via the following function

E(Y)=g-15sin2πx1-4cos2πx2-0.5+6x3-0.5-5x42-0.3,

for Gaussian and binomial outcomes. Gaussian outcomes require specification of dispersion, where we set the dispersion parameter to be 1. In this data generating process, we have x1,x2,x3,x4 as the active predictors, while the rest predictors are inactive, that is, fjxj=0 for j=4,,p. Another set of independent sample of size ntest=1000, is created following the same data generating process, serving as the testing data. We generate 50 independent pairs of training and testing datasets to evaluate the prediction and variable selection performance of the chosen models, where training datasets are used to fit the models and testing datasets are used to calculate metrics of interest. In addition, we consider the data generating process where all functional forms of the predictors are linear while keeping the rest of simulation parameters the same. This additional set of linear simulations is designed to investigate the flexibility of the proposed model when nonlinear assumptions are not met.

To evaluate the predictive performance of the models, the statistics, R2 for Gaussian model and AUC for binomial model calculated based on the testing datasets, are averaged across 50 simulations. To evaluate the variable selection performance of the models, we record the set of variables each method selects and calculate the averaged positive predictive value (precision), true positive rate (recall), and Matthews correlation coefficient (MCC),

Precision=TPPP,
Recall=TPTP+FP,
MCC=TP×TN-FP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN),

where TP,TN,FP,FN, and PP are true positives, true negatives, false positives, false negatives, and predicted positives respectively. For the methods that don’t automatically achieve variable selection, we set the alpha level at 0.05 for mgcv that relies on hypothesis testing, and a soft-threshold at 0.5 for spikeSlabGAM given the marginal inclusion probabilities. For the two methods, BHAM and spikeSlabGAM, that are capable of bi-level selection, we record the probability that the linear and nonlinear components of each predictor are selected in the models.

3.2 |. Simulation results

3.2.1 |. Prediction performance

Among the set of simulations where the functional forms of the predictors are nonlinear, the predictive performances have a consistent pattern across the two distributions of outcomes. For simplicity, we use Gaussian simulations to exemplify the improvement of BHAM and defer to Tables 1 and 2 for detailed statistics. The proposed model, BHAM, predicts as good as, if not better than, other high dimensional additive models. Specifically, BHAM shows great improvement over COSSO methods, resulting in a median (interquartile range, IRT) 31% (131%) and 20% (129%) improvement over COSSO and adaptive COSSO in R2 statistics respectively. The improvement over the spikeSlabGAM model is moderate, resulting in a median (IRT) 6% (10%) improvement in R2. When comparing to SB-GAM, BHAM performs better (median (IRT) 13% (8%) improvement in R2) in lower dimensional cases (p=4,10), and equally good or slightly worse (median (IRT) 1% (9%) improvement in R2:) in high-dimensional cases (p=50,100,200). As previously hypothesized, the linear LASSO model predicts less accurate than other flexible models across all scenarios; mgcv performs extremely well in low-dimensional cases (p=4,10), and deteriorates as the dimensionality increases until not applicable. To note, mgcv fits models but fails to converge within the default number of iterations when the sample size approaches the number of coefficients to estimate (p=50), which leads to bad performance. Even though SB-GAM has slight prediction advantage over the proposed model in high-dimensional situations, the BHAM has an extreme computational advantage over SB-GAM, resulting median (IRT) 64% (39%) reduction in computation time (measured in seconds) for Gaussian simulations, without sacrificing much of the prediction accuracy (see Table 3).

TABLE 1.

The average and standard deviation of the out-of-sample R2 measure for Gaussian outcomes over 50 iterations

P mgcv LASSO COSSO Adaptive COSSO BHAM SB-GAM spikeSlabGAM
4 0.90 (0.01) 0.33 (0.01) 0.71 (0.13) 0.72 (0.11) 0.90 (0.01) 0.79 (0.04) 0.80 (0.00)
10 0.90 (0.01) 0.33 (0.01) 0.66 (0.21) 0.77 (0.02) 0.89 (0.01) 0.79 (0.04) 0.79 (0.00)
50 0.86 (0.02) 0.32 (0.01) 0.46 (0.19) 0.57 (0.18) 0.80 (0.02) 0.78 (0.05) 0.78 (0.01)
100 - 0.32 (0.01) 0.41 (0.23) 0.48 (0.25) 0.79 (0.01) 0.79 (0.05) 0.77 (0.01)
200 - 0.32 (0.01) 0.39 (0.19) 0.40 (0.17) 0.79 (0.01) 0.78 (0.04) 0.75 (0.01)

Note: The models of comparison include the proposed Bayesian hierarchical additive model (BHAM), linear LASSO model (LASSO), component selection and smoothing operator (COSSO), adaptive COSSO, mgcv, sparse Bayesian generalized additive model (SB-GAM), and spikeSlabGAM model. mgcv doesn’t provide estimation when number of parameters exceeds sample size that is, p = 100,200.

TABLE 2.

The average and standard deviation of the out-of-sample area under the curve measures for binomial outcomes over 50 iterations

P mgcv LASSO COSSO Adaptive COSSO BHAM SB-GAM spikeSlabGAM
4 0.94 (0.01) 0.83 (0.01) 0.90 (0.01) 0.90 (0.01) 0.92 (0.01) 0.92 (0.01) 0.90 (0.00)
10 0.92 (0.03) 0.83 (0.00) 0.86 (0.04) 0.86 (0.03) 0.92 (0.01) 0.92 (0.01) 0.90 (0.00)
50 0.76 (0.03) 0.83 (0.01) 0.83 (0.02) 0.84 (0.02) 0.90 (0.01) 0.92 (0.01) 0.89 (0.01)
100 - 0.83 (0.01) 0.83 (0.02) 0.81 (0.09) 0.90 (0.01) 0.92 (0.01) 0.88 (0.01)
200 - 0.83 (0.01) 0.81 (0.06) 0.82 (0.05) 0.88 (0.02) 0.92 (0.01) 0.87 (0.02)

Note: The models of comparison include the proposed Bayesian hierarchical additive model (BHAM), linear LASSO model (LASSO), component selection and smoothing operator (COSSO), adaptive COSSO, mgcv, sparse Bayesian generalized additive model (SB-GAM), and spikeSlabGAM model. mgcv doesn’t provide estimation when number of parameters exceeds sample size that is, p = 100,200.

TABLE 3.

The average and standard deviation of computation time in seconds, including cross-validation and final model fitting, over 50 iterations

Distribution P mgcv COSSO Adaptive COSSO BHAM SB-GAM spikeSlabGAM
Binomial 4 0.18 (0.04) 3.16 (1.39) 5.51 (4.07) 2.73 (0.22) 347.17 (89.43) 8.41 (0.91)
Binomial 10 3.46 (11.06) 8.30 (1.70) 10.66 (5.30) 4.08 (0.29) 539.05 (135.55) 20.36 (2.16)
Binomial 50 660.31 (141.53) 103.41 (20.00) 118.82 (18.45) 14.22 (0.58) 1590.09 (142.19) 236.73 (14.83)
Binomial 100 - 662.61 (125.00) 672.65 (185.09) 31.61 (2.56) 2720.53 (250.43) 967.97 (186.85)
Binomial 200 - 5325.66 (995.60) 4963.93 (1482.11) 82.17 (3.29) 4788.76 (420.64) 3371.88 (194.02)
Gaussian 4 0.05 (0.01) 0.75 (0.09) 0.75 (0.11) 8.78 (1.57) 38.82 (2.74) 1.84 (0.18)
Gaussian 10 0.32 (0.39) 3.42 (0.24) 3.41 (0.23) 20.77 (3.95) 76.12 (5.55) 5.93 (0.57)
Gaussian 50 72.03 (57.99) 33.98 (2.88) 34.35 (2.86) 285.73 (12.53) 374.76 (23.79) 65.18 (8.12)
Gaussian 100 - 117.79 (3.33) 119.63 (3.66) 372.01 (56.92) 640.44 (21.91) 194.14 (8.09)
Gaussian 200 - 518.86 (40.78) 524.76 (39.15) 471.46 (72.23) 1300.70 (72.74) 738.52 (62.76)

Note: The models of comparison include the proposed Bayesian hierarchical additive model (BHAM), the linear LASSO model (LASSO), component selection and smoothing operator (COSSO), adaptive COSSO, mgcv, sparse Bayesian generalized additive model (SB-GAM), spikeSlabGAM. mgcv doesn’t provide estimation when number of parameters exceeds sample size that is, p = 100,200.

We also examine the prediction performance when the functional form of predictors is linear, see Supporting Information Table S1 and S2. The proposed model, BHAM, has a similar performance as the linear LASSO model regardless of the distribution. This observation demonstrates that BHAM is a flexible model, and has good prediction performance regardless of the underlying functional form of predictors. spikeSlabGAM has a similar prediction performance to BHAM. Surprisingly, SB-GAM doesn’t perform well in high-dimensional Gaussian outcome scenarios.

3.2.2 |. Variable selection performance

Among the set of simulations where the functional forms of the predictors are nonlinear, the proposed model, BHAM, has a consistent performance across different dimensions and distribution settings (See Table 4 for Gaussian outcomes, and Supporting Information Table S3 for binomial outcomes): being conservative. The symptoms of conservative variable selection are high precision and low recall, where high precision means that among all the selected variables, high percentage of them are true signals; low recall means that the model selected a small subset among all the active predictions. In other words, BHAM tends to select a smaller set of variables that are truly effective to the outcome. We want to note, the variable selection performance of BHAM is plummeted and not optimized when p=200. Upon further investigation, we discover it’s because the generic sequence of s0 used to tune the model doesn’t contain the optimal value. Overall, among all the models examined, SB-GAM has the best performance, both high precision and high recall, and yields a high MCC.

TABLE 4.

The variable selection performance of Gaussian simulations, measured by positive predictive value (precision), true positive rate (recall), and Matthews correlation coefficient (MCC), for the high-dimensional methods averaged over 50 iterations

P Metric LASSO COSSO Adaptive COSSO BHAM SB-GAM spikeSlabGAM
4 Precision 1.00 1.00 1.00 1.00 1.00 1.00
10 Precision 0.76 0.84 0.93 0.62 0.86 1.00
50 Precision 0.48 0.70 0.69 0.88 0.75 1.00
100 Precision 0.43 0.61 0.59 0.99 0.79 0.99
200 Precision 0.36 0.61 0.47 0.28 0.75 0.99
4 Recall 0.53 0.49 0.53 0.88 0.99 0.51
10 Recall 0.40 0.52 0.52 0.83 1.00 0.50
50 Recall 0.35 0.40 0.48 0.37 1.00 0.50
100 Recall 0.33 0.36 0.40 0.30 0.99 0.50
200 Recall 0.32 0.33 0.35 0.52 1.00 0.50
10 MCC 0.32 0.49 0.57 0.46 0.86 0.61
50 MCC 0.31 0.47 0.53 0.50 0.83 0.69
100 MCC 0.32 0.41 0.45 0.53 0.87 0.70
200 MCC 0.28 0.41 0.38 0.36 0.85 0.70

Note: The models of comparison include the proposed Bayesian hierarchical additive model (BHAM), linear LASSO model (LASSO), component selection and smoothing operator (COSSO), adaptive COSSO, sparse Bayesian generalized additive model (SB-GAM), and spikeSlabGAM model. MCC is ill-defined when p = 4 simulation (no true negative), and hence omitted for all methods.

The performance of another Bayesian model, spikeSlabGAM deteriorates as the sparsity grows, particularly when (p>50), or for binomial outcomes. The variable selection performance for linear simulations matches with prediction performance: BHAM performs great among the Gaussian scenarios, while the performance of SB-GAM deteriorates.

Among the high-dimensional methods of comparison, there are two methods that are capable to achieve bi-level selection, the proposed BHAM and spikeSlabGAM. Among the linear simulations, both methods can accurately select the linear components and have a drastically lowered probability, close to 0, to include the nonlinear component, as anticipated. Specifically, spikeSlabGAM have a smaller probability to include the nonlinear component in the model than BHAM. However, this advantage of spikeSlabGAM over BHAM is less obvious among the nonlinear simulations: spikeSlabGAM performs better than BHAM when selecting components of the functional forms that include only linear or nonlinear component, for example, functional forms for x3 and x4. However, spikeSlabGAM inclines to ignore variables that have more complex function forms, for example, function forms for x1 and x2. In contrast, BHAM is more likely to include them in the model. This trade-off is determined by the assumption implicitly reflected via the prior hierarchy. We defer an in-depth discussion to Section 5.

4 |. Metabolomics DAta ANALYSis

In this section, we apply the proposed model BHAM to analyze two published metabolomics datasets where the outcomes are binary and continuous respectively. We demonstrate the improved prediction performance compared to the other Bayesian hierarchical additive model, SB-GAM,13 while being computationally efficient (see Table 5). BHAM requires roughly 10% of the computation time of SB-GAM to fit models.

TABLE 5.

Model fitting time in seconds for two metabolomics data analyses, from Emory Cardiovascular Biobank (ECB) and weight loss maintenance cohort (WLM)

Data BHAM
SB-GAM
CV Final Total CV Final Total
ECB 100.8 3.5 104.4 2,659.0 20.9 2,679.9
WLM 365.4 6.8 372.2 3,116.0 32.7 3,148.7

Note: It tabulates the computation time for cross-validation step (CV) and optimal model fitting step (Final), and total computation time (Total) for the proposed model BHAM and the model of comparison SB-GAM.

4.1 |. Emory cardiovascular biobank

We use the proposed model BHAM to analyze a metabolic dataset from a recently published research34 studying plasma metabolomic profile on the three-year all-cause mortality among patients undergoing cardiac catheterization. The dataset is publicly available via Dryad.35 It contains in total of 776 subjects from two cohorts. As there is a large number of nonoverlapping features among the two cohorts, we use the cohort with a larger sample size (N=454). There are initially 6796 features in the dataset, which is too large to be practically meaningful to analyze. Hence, we choose the top 200 features with the largest variance. We use 5-knot spline additive models for binary outcome using two different models, the proposed BHAM and the SB-GAM. 10-Fold CV are used to choose the optimal tuning parameters of each framework with respect to the default selection criterion implemented in the software. Out-of-bag samples are used for prediction performance evaluation, where deviance, AUC, Brier score, defined as 1ni=1nyi-yˆi2, and misclassification error, defined as 1ni=1nIyi-yˆi>0.5 are calculated. BHAM obtains superior AUC, Brier score, and misclassification error in the out-of-bag samples compared to SB-GAM (see Table 6). We plot the 33 features included in the final BHAM model in Figure 2.

TABLE 6.

Prediction performance of BHAM fitted with coordinate descent algorithm (BHAM) and SB-GAM models for Emory Cardiovascular Biobank by 10-fold cross-validation, including deviance, area under the curve (AUC), Brier score, and misclassification error (Misclass) where class labels are defined using threshold = 0.5

Methods Deviance AUC Brier Misclass
BHAM 510.99 0.61 0.19 0.24
SB-GAM 636.56 0.56 0.22 0.30

FIGURE 2.

FIGURE 2

Plots of the functions for the 33 metablites selected by BHAM in the Emory Cardiovascular Biobank data analysis

4.2 |. Weight loss maintenance cohort

We use the proposed model BHAM to analyze metabolomics data from a recently published study36 on the association between metabolic biomarkers and weight loss, where the dataset is publicly available.37 In this analysis, we primarily focus on the analysis of one of the three studies included, weight loss maintenance cohort,38 due to the drastically different intervention effects. In the dataset, 765 metabolites in baseline plasma collected were profiled using liquid chromatography mass spectrometry. Quality control and natural log transformation were previously performed and documented by the study publishing team.36 The outcome of interest is standardized percent change in insulin resistance, and hence modeled using a Gaussian model. After removing missing datapoints and addressing outliers in the data, there are p=237 features remaining in the analysis. 5-Knot spline additive models for the Gaussian outcome are constructed using two different models, the proposed BHAM and the SB-GAM. 10-Fold CV are used to choose the optimal tuning parameters of each framework with respect to the default selection criterion implemented in the software. Out-of-bag samples are used for prediction performance evaluation, where deviance, R2, mean squared error (MSE) defined as 1ni=1nyi-yˆi2, and mean absolute error (MAE) defined as 1ni=1nyi-yˆi are calculated. BHAM obtains superior R2, MSE, and MAE in the out-of-bag samples compared to SB-GAM (see Table 7).

TABLE 7.

Prediction performance of BHAM fitted with coordinate descent algorithm (BHAM) and SB-GAM models for Weight Loss Maintenance Cohort by 10-fold cross-validation, including deviance, R2, mean squared error (MSE), and mean absolute error (MAE)

Methods Deviance R2 MSE MAE
BHAM 668.01 0.07 0.93 0.76
SB-GAM 666.83 0.03 0.98 0.77

5 |. DISCUSSION

In the article, we described a novel high-dimensional generalized additive model with Bayesian hierarchical prior for the purpose of predictive modeling. In particular, we introduce a two-part spike-and-slab LASSO prior for reparameterized smooth functions and derive a scalable EM-CD algorithm for model fitting. The proposed model provides a new angle to address the excess shrinkage of smooth functions that is commonly vulnerable to previous regularized high-dimensional GAMs, and hence improves the predictive performance. Th EM-CD algorithm, extended from previous spike-and-slab LASSO models, provides a computationally efficient alternative to the computational prohibitive MCMC algorithms, enhancing the scalability of spike-and-slab models. In addition, the two-part prior motivates the bi-level selection of predictors, selection of linear and nonlinear components. In the simulation study and real-data analyses, the proposed model demonstrates improvement in prediction and computational advantage when compared to the state-of-the-art models. When serving the purpose of variable selection, trade-offs exist among methods of comparison. We implement the proposed model in an open-source R package BHAM, deposited at https://github.com/boyiguo1/BHAM. To maximize the flexibility of smooth function specification, we deploy the same programming grammar as in the state-of-the-art package mgcv, in contrast to previous tools where smooth functions are limited to the default ones. Ancillary functions are provided for model specification in high-dimensional settings, curve plotting and functional selection.

The proposed model shares many commonalities with the SB-GAM,13 which is independently developed around the same time as the proposed work. Both frameworks emphasize computational efficiency by deploying group spike-and-slab LASSO type priors and optimization-based scalable algorithms. Bai provides the theoretical proof for the consistency of variable selection using group spike-and-slab LASSO prior. The proposed model focuses on improving prediction performance for high-dimensional GAM, with the capability of bi-level selection. Moreover, the proposed model can easily generalize to other families of priors and smooth functions if desired. Not focused in this manuscript, the generalization is described in the Supporting Information.

During designing and analyzing the simulation study, we made couple of interesting observations. First of all, variable selection is a delicate topic in the context of predictive modeling. When prediction performance is used to tune a model, the model could possibly include noise variables in models, for example LASSO and LASSO-based models.39 Moreover, bi-level selection is a more complex problem than variable selection. The complexity shows on the validity of the effect hierarchy principle. While most functional forms follow that the linear component exists in the nonlinear function, there are functions that don’t follow it, for example, x2. The proposed prior and spikeSlabGAM employ different structures: the proposed prior imposes effect hierarchy while spikeSlabGAM treats the selection of linear and nonlinear components independent. The different prior setups lead to trade-offs for the purpose of bi-level selection. We recommend to use more judgment in bi-level selection, either relying on heuristic knowledge to choose appropriate prior or exploring multiple models when heuristic knowledge doesn’t exist. Secondly, we find the performance of the proposed model is more sensitive to the granularity of s0 sequence in the high-dimensional settings than in the lower dimension settings. Even though the current default sequence of s0 can result in reasonable performance shown in the simulation studies, we recommend fine-tuning the model with a granular sequence of s0 for performance improvement.

Our future efforts are direct to uncertainty inference of the proposed model, survival analysis and integrative analysis. Using EM-CD algorithm to fit the proposed BHAM is incapable of conducting uncertainty inference. We derive the EM-Iterative Weighted Least Square algorithm (EM-IWLS, see the Supporting Information) as an alternative. Instead of the Coordinate Descent algorithm, we use the Iterative Weighted Least Square algorithm in the EM procedure. The EM-IWLS algorithm is previously used to fit Bayesian high-dimensional generalized linear models,40 and deliver estimates of the coefficient variance-covariance matrix. Due to the space limit, technical details will be explained in a future manuscript. While the proposed model addresses a great deal of analytic problems, analyzing the time-to-event outcome remains unsolved. A naive approach would be convert a time-to-event outcome to a Poisson outcome following Whitehead.41 However, it would be more efficient to directly fit Cox models via penalized pseudo likelihood function42 Meanwhile, with the growing understanding of biological structure within -omics field, it is appealing to integrate external biology information in the modeling process. The main motivation for integrative models is that biologically informed grouping of weak effects increases the power of detecting true associations between features and the outcome,43 and stabilizes the analysis results for reproducibility purposes. Such integration can be achieved by setting up a structural hyperprior on the inclusion indicator of the smooth function null space γ0. A similar strategy has been used in Ferrari and Dunson.44

Supplementary Material

NIHMS1916760

Footnotes

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of this article.

DATA AVAILABILITY STATEMENT

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

REFERENCES

  • 1.Mallick H, Yi N. Bayesian methods for high dimensional linear models. J Biometr Biostat. 2013;205:1–27. doi: 10.4172/2155-6180.S1-005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Breiman L Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci. 2001;16(3):199–231. [Google Scholar]
  • 3.Hastie T, Tibshirani R. Generalized additive models: some applications. J Am Stat Assoc. 1987;82(398):371–386. doi: 10.1080/01621459.1987.10478440 [DOI] [Google Scholar]
  • 4.Ravikumar P, Lafferty J, Liu H, Wasserman L. Sparse additive models. J Royal Stat Soc Ser B (Stat Methodol). 2009;71(5):1009–1030. doi: 10.1111/j.1467-9868.2009.00718.x [DOI] [Google Scholar]
  • 5.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Stat Soc Ser B (Stat Methodol). 2006;68(1):49–67. [Google Scholar]
  • 6.Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models. Ann Stat. 2010;38(4):2282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007;23(12):1486–1494. [DOI] [PubMed] [Google Scholar]
  • 8.Xue L Consistent variable selection in additive models. Stat Sin. 2009;19(3):1281–1296. [Google Scholar]
  • 9.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96(456):1348–1360. [Google Scholar]
  • 10.Xu X, Ghosh M, others. Bayesian variable selection and estimation for group lasso Bayesian Anal 2015;10(4):909–936. [Google Scholar]
  • 11.Yang X, Narisetty NN, others. Consistent group selection with Bayesian high dimensional modeling. Bayesian Analysis 2020;15(3):909–935. [Google Scholar]
  • 12.Bai R, Moran GE, Antonelli JL, Chen Y, Boland MR. Spike-and-slab group lassos for grouped regression and sparse generalized additive models. J Am Stat Assoc. 2022;117(537):184–197. [Google Scholar]
  • 13.Bai R Spike-and-slab group lasso for consistent estimation and variable selection in non-Gaussian generalized additive models. arXiv:2007.07021v5. Preprint posted online June 5, 2021. https://arxiv.org/abs/2007.07021. [Google Scholar]
  • 14.Scheipl F, Kneib T, Fahrmeir L. Penalized likelihood and Bayesian function selection in regression models. AStA Adv Stat Anal. 2013;97(4):349–385. [Google Scholar]
  • 15.Scheipl F, Fahrmeir L, Kneib T. Spike-and-slab priors for function selection in structured additive regression models. J Am Stat Assoc. 2012;107(500):1518–1532. doi: 10.1080/01621459.2012.737742 [DOI] [Google Scholar]
  • 16.Wood SN. Generalized Additive Models: An Introduction with R. 2nd ed. Boca Raton, FL: CRC Press/Taylor & Francis Group; 2017. [Google Scholar]
  • 17.Meier L, Van De Geer S, Bühlmann P. High-dimensional additive modeling. Ann Stat. 2009;37(6 B):3779–3821. doi: 10.1214/09-AOS692 [DOI] [Google Scholar]
  • 18.Marra G, Wood SN. Practical variable selection for generalized additive models. Comput Stat Data Anal. 2011;55(7):2372–2387. [Google Scholar]
  • 19.Ročková V Bayesian estimation of sparse signals with a continuous spike-and-slab prior. Ann Stat. 2018;46(1):401–437. doi: 10.1214/17-AOS1554 [DOI] [Google Scholar]
  • 20.Ročková V, George EI. The Spike-and-Slab LASSO. J Am Stat Assoc. 2018;113(521):431–444. doi: 10.1080/01621459.2016.1260469 [DOI] [Google Scholar]
  • 21.Bai R, Rockova V, George EI. Spike-and-slab meets LASSO: a review of the spike-and-slab LASSO. arXiv:2010.06451. Preprint posted online July 1, 2021. https://arxiv.org/abs/2010.06451. [Google Scholar]
  • 22.Tang Z, Shen Y, Li Y, et al. Group spike-and-slab lasso generalized linear models for disease prediction and associated genes detection by incorporating pathway information. Bioinformatics. 2018;34(6):901–910. doi: 10.1093/bioinformatics/btx684 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Tang Z, Lei S, Zhang X, et al. Gsslasso Cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information. BMC Bioinform. 2019;20(1):1–15. doi: 10.1186/s12859-019-2656-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chipman H Prior distributions for Bayesian analysis of screening experiments. In: Dean A, Lewis S, eds. Screening: Methods for Experimentation in Industry, Drug Discovery, and Genetics. New York, NY: Springer; 2006:236–267. [Google Scholar]
  • 25.George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sin. 1997;7(2):339–373. [Google Scholar]
  • 26.Ročková V, George EI. EMVS: the EM approach to Bayesian variable selection. J Am Stat Assoc. 2014;109(506):828–846. doi: 10.1080/01621459.2013.869223 [DOI] [Google Scholar]
  • 27.Tang Z, Shen Y, Zhang X, Yi N. The spike-and-slab lasso generalized linear models for prediction and associated genes detection. Genetics. 2017;205(1):77–88. doi: 10.1534/genetics.116.192195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Tang Z, Shen Y, Zhang X, Yi N. The spike-and-slab lasso Cox model for survival prediction and associated genes detection. Bioinformatics. 2017;33(18):2799–2807. doi: 10.1093/bioinformatics/btx300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1. [PMC free article] [PubMed] [Google Scholar]
  • 30.Zhang HH, Lin Y. Component selection and smoothing for nonparametric regression in exponential families. Stat Sin. 2006; 16(3):1021–1041. [Google Scholar]
  • 31.Storlie CB, Bondell HD, Reich BJ, Zhang HH. Surface estimation, variable selection, and the nonparametric oracle property. Stat Sin. 2011;21(2):679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J Royal Stat Soc (B). 2011;73(1):3–36. [Google Scholar]
  • 33.R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing; 2021. https://www.R-project.org/ [Google Scholar]
  • 34.Mehta A, Liu C, Nayak A, et al. Untargeted high-resolution plasma metabolomic profiling predicts outcomes in patients with coronary artery disease. PLoS One. 2020;15(8):e237579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Mehta A, Liu C, Uppal K, Quyyumi A. Data from: metabolomics - Emory cardiovascular biobank. Dryad Dataset. https://datadryad.org/stash/dataset/doi:10.5061/dryad.866t1g1mt. [Google Scholar]
  • 36.Bihlmeyer NA, Kwee LC, Clish CB, et al. Metabolomic profiling identifies complex lipid species and amino acid analogues associated with response to weight loss interventions. PLoS One. 2021;16(5):e0240764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Bihlmeyer NA, Kwee LC, Clish CB, et al. Metabolomic profiling identifies complex lipid species and amino acid analogues associated with response to weight loss interventions. Zenodo. https://zenodo.org/record/4767969. Accessed August 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Svetkey LP, Stevens VJ, Brantley PJ, et al. Comparison of strategies for sustaining weight loss: the weight loss maintenance randomized controlled trial. JAMA. 2008;299(10):1139–1148. [DOI] [PubMed] [Google Scholar]
  • 39.Wu J, Witten D. Flexible and interpretable models for survival data. J Comput Graph Stat. 2019;28(4):954–966. doi: 10.1080/10618600.2019.1592758 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Yi N, Ma S. Hierarchical shrinkage priors and model fitting for high-dimensional generalized linear models. Stat Appl Genet Mol Biol. 2012;11(6). doi: 10.1515/1544-6115.1803 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Whitehead J Fitting Cox’s regression model to survival data using GLIM. J Royal Stat Soc Ser C (Appl Stat). 1980;29(3):268–275. [Google Scholar]
  • 42.Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Peterson CB, Stingo FC, Vannucci M. Joint Bayesian variable and graph selection for regression models with network-structured predictors. Stat Med. 2016;35(7):1017–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Ferrari F, Dunson DB. Identifying main effects and interactions among exposures using Gaussian processes. Ann Appl Stat. 2020;14(4):1743–1758. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1916760

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

RESOURCES