Bayesian regularization for a non-stationary Gaussian linear mixed effects model

Emrah Gecili; Siva Sivaganesan; Ozgur Asar; John P Clancy; Assem Ziady; Rhonda D Szczesniak

doi:10.1002/sim.9279

. Author manuscript; available in PMC: 2023 Feb 20.

Published in final edited form as: Stat Med. 2021 Dec 12;41(4):681–697. doi: 10.1002/sim.9279

Bayesian regularization for a non-stationary Gaussian linear mixed effects model

Emrah Gecili ^1,^*, Siva Sivaganesan ², Ozgur Asar ³, John P Clancy ⁴, Assem Ziady ⁵, Rhonda D Szczesniak ¹

PMCID: PMC8795479 NIHMSID: NIHMS1762034 PMID: 34897771

Summary

In omics experiments, estimation and variable selection can involve thousands of proteins/genes observed from a relatively small number of subjects. Many regression regularization procedures have been developed for estimation and variable selection in such high-dimensional problems. However, approaches have predominantly focused on linear regression models that ignore correlation arising from long sequences of repeated measurements on the outcome. Our work is motivated by the need to identify proteomic biomarkers that improve the prediction of rapid lung-function decline for individuals with cystic fibrosis (CF) lung disease. We extend four Bayesian penalized regression approaches for a Gaussian linear mixed effects model with non-stationary covariance structure to account for the complicated structure of longitudinal lung function data while simultaneously estimating unknown parameters and selecting important protein isoforms to improve predictive performance. Different types of shrinkage priors are evaluated to induce variable selection in a fully Bayesian framework. The approaches are studied with simulations. We apply the proposed method to real proteomics and lung-function outcome data from our motivating CF study, identifying a set of relevant clinical/demographic predictors and a proteomic biomarker for rapid decline of lung function. We also illustrate the methods on CD4 yeast cell-cycle genomic data, confirming that the proposed method identifies transcription factors that have been highlighted in the literature for their importance as cell cycle transcription factors.

Keywords: Bayesian regularization, MCMC, shrinkage priors, mixed effects models, Integrated Brownian motion, irregular longitudinal data

1 |. INTRODUCTION

Advanced statistical tools and computational algorithms are critical to discover biomarkers related to biological processes or diseases.¹ In these types of studies, it is common to have a large p (number of variables) but relatively small n (sample size). In omics and other experiments yielding high-dimensional data, only a small subset of potential predictors typically have an effect on the response variable; most covariates have minimal or zero-effect.² Correctly identifying these important yet sparse collection of variables from a large set is critical to produce stable models that generate accurate estimation and predictions. Identifying relevant predictors for different types of longitudinal models is an increasingly important area of study but has also become more challenging with the availability of big data.

Our work is motivated by an interest in identifying biomarkers of lung function decline in individuals with cystic fibrosis (CF), which is a hereditary lung disease that leads typically to respiratory failure.³ Biomarkers must be selected from thousands of expression levels of serum proteins, while accounting for demographic/clinical characteristics of individuals. The outcome of interest is longitudinal lung function measured through the forced expiratory volume in 1 second of % predicted (FEV₁). The dataset, described subsequently in further detail, includes 5011 protein isoforms collected from the serum samples of 88 individuals. Identifying a set of prognostic CF-related biomarkers may help early detection of precipitous drops in lung function, thereby enabling intervention prior to irreversible lung damage. Although thousands of proteins and several clinical/demographic features were collected, a small proportion of them are expected to be associated with disease progression, yielding a sparse regression model.

A Gaussian linear mixed effects model with non-stationary covariance structure has been successfully utilized to predict changes in long sequences of longitudinal data. This model was firstly proposed by Diggle, Sousa and Asar⁴ for monitoring the progression of undiagnosed incipient renal failure and later for monitoring rapid CF disease progression by Szczesniak et al.⁵, Taylor et al.⁶ and Szczesniak et al.⁷ have shown in CF FEV₁ analyses that longitudinal models with non-stationary covariance outperform the traditional linear mixed effects models. These random intercept-and-slope models introduced early on by Laird and Ware⁸ are not flexible enough to capture the erratic shape of the trajectories from long follow-up sequences⁶. Although parameter estimation procedures have been developed for this type of stochastic linear mixed effects model by Diggle, Sousa and Asar⁴, Asar et al.⁹, existing work has not focused on variable selection.

For variable selection, one can use regularization (penalization) approaches. Commonly used examples are the so-called lasso (least absolute shrinkage and selection operator)¹⁰, adaptive lasso¹¹ and elastic-net¹². These methods are useful for identifying potential predictors with non-zero effects and achieve good out-of-sample predictive performances. The idea is to add a penalty to the loss function, in order to shrink regression coefficients towards zero, thereby potentially yielding a sparse model. Both frequentist and Bayesian approaches have been proposed for inference with regularized regression, and recent efforts have focused on the latter, see for example the works of Kyung et al.¹³ and Mallick and Yi¹⁴. Bayesian regularization methods have several attractive features compared to the frequentist methods, such as providing credibility intervals (CIs) for parameters of interest, estimating the penalty parameter by using an appropriate prior distribution, allowing one to model uncertainty by averaging over all possible models, and enabling easy consideration of different types of penalties. Bayesian methods perform similarly or sometimes better, compared to traditional penalization methods.¹⁵ There are a few works that considered Bayesian penalized regression approaches for longitudinal data. For instance, Li et al.¹⁶ postulated a Bayesian linear mixed effects model to simultaneously estimate the temporal trend and genetic effects of markers for QTL mapping of a set of wood species. They considered the random intercept-and-slope model and used spike and slab prior for the coefficients. Using such priors plays a similar role as the L₁ norm penalty in the Bayesian lasso and induces variable selection. In another work, Li, Wang and Wu¹⁷ proposed a Bayesian group lasso method for variable selection in a nonparametric varying-coefficient models for longitudinal genome-wide association data. Li, Wang and Wu¹⁷ proposed a stationary first-order autoregressive, AR(1), covariance model, but this is not appropriate for longitudinal data with mistimed (irregularly spaced) measurements, such as those encountered in our CF example. Moreover, most variable selection methods for temporal data were proposed for studies with p < n; thus, they are not ideal for high-dimensional variable selection needed in omics experiments.¹⁸

We propose flexible variable selection for a Gaussian linear mixed effects model with nonstationary covariance. We extend procedures for the Bayesian lasso, Bayesian adaptive lasso, Bayesian ridge, and Bayesian elastic net. Our proposed approach accounts for the complicated structure of longitudinal data while simultaneously estimating unknown parameters and selecting important proteins that explain decline in the lung function. Our paper proceeds as follows. Section 2 describes the longitudinal model setup and notation. In Section 3, we present the proposed Bayesian regularization methods and prior specifications. In Section 4, we provide the details of the posterior computation. We then explore the properties of the proposed methods with simulation studies in Section 5 and report our results for the real data applications in Section 6. Section 7 provides sensitivity analysis and model comparison results. Finally, Section 8 concludes with discussion of our findings and future considerations to further extend the presented work in this paper.

2 |. LONGITUDINAL MODEL AND NOTATION

We use the non-stationary Gaussian linear mixed effects model for modelling longitudinal FEV₁ trajectories and include the protein expression levels as predictors, in addition to demographic and routinely collected clinical variables. This choice is supported by our exploratory analysis such as variogram analysis and also the model was previously considered and suggested for CF applications by Szczesniak et al.⁵. The model may be expressed as:

Y_{i j} = μ_{i} (t_{i j}) + U_{i} + W_{i} (t_{i j}) + Z_{i j},

(1)

where Y_ij denotes the FEV₁ response for subject i(i = 1, …, N) at time t_ij(j = 1, …, n_i) and μ_i(_t) is the mean response function at time t. The U_i is independent N(0, ω²) random variable that represents time-constant differences between patients. W_i(t) represents independent copies of a continuous time stochastic process which we specified as an integrated Brownian motion (representing stochastic change in an individual patient’s lung function over time). The Z_ij terms are mutually independent and identical N(0, τ²) random variables representing measurement error in the determination of Y_ij. Hereafter, we use the terms “random-walk” and “Brownian motion” interchangeably.

W_i(t) is modeled as the integral of a continuous-time random walk,

W_{i} (t) = \int_{0}^{t} B_{i} (v) d v,

(2)

where B_i(t), the rate of change at time t, is Brownian motion. Set B_i(0) = 0 for all i. The conditional distribution of B_i(t) given B_i(s) for some s < t is Normal with mean B_i(s) and variance σ²(t − s). The marginal distribution of B_i(t) is N(0, σ²t) and Cov (B_i(s), B_i(t)) = σ² min(s, t). Hence, $W_{i} (t) ~ N (0, \frac{σ^{2} t^{3}}{3})$ and $C o v (W_{i} (s), W_{i} (t)) = σ^{2} \frac{min {(s, t)}^{2}}{2} (max (s, t) - \frac{min (s, t)}{3})$ . The bivariate process (B_i(s), W_i(t)) is bivariate Gaussian with mean 0 and cross-covariance structure $C o v (B_{i} (s), W_{i} (t)) = σ^{2} \frac{min {(s, t)}^{2}}{2}$ . Note that B_i(t) and the bivariate process (B_i(t), W_i(t)) are Markov, whereas W_i(t) is non-Markov.

The mean response function, μ_i(t_ij) can be decomposed into, Ψ_i(t_ij)α + X_i(t_ij)β, where Ψ_i(t_ij) denotes a set of covariates that we do not perform variable selection and α denotes the corresponding set of population-averaged parameters not to be shrunk. Similarly, X_i(t_ij) denotes a set of covariates that we perform variable selection and β denotes the corresponding set of population-averaged parameters to be shrunk.

Above model specifications induce the following multivariate Normal distribution on Y_i,

Y_{i} ~ M V N (Ψ_{i} α + X_{i} β, V_{i} (ϕ))

(3)

where $Y_{i} = {(Y_{i} (t_{i 1}), \dots, Y_{i} (t_{i n_{i}}))}^{T}$ , $Ψ_{i} = {(\begin{matrix} 1 & 1 & \dots & 1 \\ t_{i 1} & t_{i 2} & \dots & t_{i n_{i}} \end{matrix})}^{T}$ , $α = (\begin{matrix} α_{0} \\ α_{1} \end{matrix})$ , $X_{i} = [X_{i} {(t_{i 1})}^{T}, \dots, X_{i} {(t_{i n_{i}})}^{T}]^{T}$ with $X_{i} (t_{i j}) = (X_{i 1} (t_{i j}), \dots, X_{i p} (t_{i n_{i}}))$ , β = (β₀, …, β_p)^T. V_i(ϕ) = ω²J_i + σ²R_i + τ²I_i where ϕ = (ω², σ², τ²), J_i is an n_i × n_i matrix of ones, R_i is an n_i × n_i matrix with the (j, k)^th element

\frac{min {(t_{i j}, t_{i k})}^{2}}{2} (max (t_{i j}, t_{i k}) - \frac{min (t_{i j}, t_{i k})}{3}),

and I_i is an n_i × n_i identity matrix.

Specifically, the model we apply to the CF data has the following form:

Y_{i j} = α_{0} + α_{1} A g e_{i} + \sum_{k = 1}^{p} β_{k} X_{i k} + U_{i} + W_{i} (t_{i j}) + Z (t_{i j}) .

(4)

In equation (4), our main interest is on shrinking β_k’s to identify potential predictive predictors (biomarkers and other explanatory variables). It is not one of the primary interests to shrink the coefficient for the time variable especially in longitudinal models. If we have any additional variables for which we do not want to shrink their coefficients, we can retain them in the model in a similar fashion. It can be of biological importance to simultaneously adjust the model for other variables while performing variable selection.

3 |. BAYESIAN REGULARIZATION METHODS

There are various shrinkage priors available and research on this area that mostly focus on comparison of different shrinkage priors and comparing them with their frequentist counterparts by simulations, mostly for linear regression models with uncorrelated errors. Kyung et al.¹³, Mallick and Yi¹⁴ and Erp, Oberski and Mulder¹⁵ all provide a broad overview of the shrinkage priors that have been proposed for different regularization. In the following sections, we consider traditional Bayesian lasso, Bayesian adaptive lasso, Bayesian ridge, and Bayesian elastic net by primarily specifying available priors from the literature. We discuss prior choices and posterior computation for the Bayesian lasso in detail in the following subsection. The prior specification and posterior computation for the other Bayesian penalized models could be done similarly.

3.1 |. Bayesian Lasso

Tibshirani¹⁰ stated that lasso estimates can be interpreted as posterior mode estimates when the regression parameters have an independent and identical Laplace prior. Here we specify the prior distributions by following Park and Casella¹⁹ for the common parameters and the prior choices of the additional relevant parameters are discussed below. λ is a regularization parameter which controls the amount of shrinkage toward zero (larger λ implies a larger amount of shrinkage). For the frequentist lasso, the value of λ is determined by cross-validation, but here it can be adaptively determined from the data by assigning a diffuse prior. For convenience in the derivation of conditional posteriors, a form of a gamma prior was suggested for λ² by Park and Casella¹⁹;for the details of motivation for this choice and alternative ways to estimate λ, see the original work.

For measurement error τ² in model (4), the non-informative scale-invariant prior 1/τ² is used which causes no issue with propriety but any inverse-gamma prior would be appropriate. Conditioning the prior of β on the measurement error τ² guarantees the unimodality of posteriors which accelerate convergence of the MCMC sampler and makes point estimates more meaningful. The Bayesian expression of the model would be as follows:

\begin{array}{r} Y_{i} ~ M V N (Ψ_{i} α + X_{i} β, V_{i} (ϕ)) \\ V_{i} (ϕ) = ω^{2} J_{i} + σ^{2} R_{i} + τ^{2} I_{i} \\ β ∣ τ^{2}, a_{1}^{2}, \dots, a_{p}^{2} ~ M V N (0, τ^{2} D_{L}) \\ D_{L} = diag {a_{1}^{2}, \dots, a_{p}^{2}} \\ a_{j}^{2} ∣ λ ~ \frac{λ^{2}}{2} e^{- λ^{2} a_{j}^{2} / 2} \\ λ^{2} ~ Gamma (r, δ) \end{array}

(5)

As discussed before, in model (4), we only aim to penalize the β’s, but not α₀ and α₁. Hence, α₀ and α₁ are modelled separately from β’s and are assumed to have non-informative uniform priors such that α₀, α₁ ~ Uni(−∞, ∞). Further, to simplify the computations we reparametrize the variance parameters as ω² = ϕ₁τ² and σ² = ϕ₂τ², so that $V_{i} (ϕ) = τ^{2} (ϕ_{1} J_{i} + ϕ_{2} R_{i} + I_{i}) = τ^{2} V_{i}^{0} (ϕ)$ where ϕ₁, ϕ₂ > 0. This reparametrisation allows us to have a closed form for the full conditional posterior distribution of τ², which speeds up our MCMC algorithm. The choice of priors for ϕ₁ and ϕ₂ is discussed below to complete model specification.

The natural choice for these two parameters would be flat uniform priors e.g. Uni(0, 100), or similar type of flat prior. Assigning a flat uniform prior on both ϕ₁ and ϕ₂ implies that $π (σ^{2} ∣ τ^{2}) \propto \frac{1}{τ^{2}}$ and $π (ω^{2} ∣ τ^{2}) \propto \frac{1}{τ^{2}}$ . This choice can be shown to yield proper posterior distributions.
Another way is assuming that $\frac{ϕ_{1}}{ϕ_{1} + 1} ~ U n i (0, 1)$ . This induces that $π (ϕ_{1}) = \frac{1}{{(1 + ϕ_{1})}^{2}}$ . Similarly it can be done for ϕ₂ and $π (ϕ_{2}) = \frac{1}{{(1 + ϕ_{2})}^{2}}$ . Assigning these priors for ϕ₁ and ϕ₂ implies that $π (σ^{2} ∣ τ^{2}) = \frac{τ^{2}}{{(τ^{2} + σ^{2})}^{2}}$ and $π (ω^{2} ∣ τ^{2}) = \frac{τ^{2}}{{(τ^{2} + ω^{2})}^{2}}$ . These priors were previously used for multiple testing problem by Scott and Berger²⁰.

In our computations, we considered both of the prior choices for ϕ₁ and ϕ₂. But since the results were very similar for these priors, we only reported the results from the former. We actually considered several additional potential choices of priors for ϕ₁ and ϕ₂ but they have undesirable features. Therefore, we did not use them in our computations.

3.2 |. Bayesian Adaptive Lasso

Fan and Li²¹ argued that lasso yields biased parameter estimates although its simultaneous variable selection performance is satisfying. They stated that lasso does not satisfy the oracle properties, and Zou¹¹ introduced the adaptive lasso to achieve the oracle features. The Bayesian adaptive lasso (BAL) can be obtained by assigning different λ_j’s for each coefficient that is being shrunk instead of assigning a single common λ for all the coefficients. Hence, one only needs to use the following priors in the model specification given for Bayesian lasso. The priors for the rest of the parameters remain the same as discussed in Section 3.1.

\begin{array}{c} a_{j}^{2} ∣ λ_{j} ~ \frac{λ_{j}^{2}}{2} e^{- λ_{j}^{2} a_{j}^{2} / 2} \\ λ_{j}^{2} ~ Gamma (r, δ) \end{array}

(6)

3.3 |. Bayesian Ridge

Although lasso is attractive due to its simultaneous variable selection feature and sparse results, it may not be desirable when the covariates are highly correlated. Lasso does not account for the correlation among the covariates, aims to keep one of the highly correlated variables in the final model and excludes all other correlated ones. Ridge regression firstly introduced by Hoerl and Kennard²² minimizes the residual sum of squares (RSS) subject to the L₂ norm of the coefficients. Ridge regression addresses multicollinearity among predictors, applies continuous shrinkage and improves prediction performance through a bias-variance trade-off. However, ridge regression generally does not result in interpretable model when the number of predictors is large. Moreover, it applies the same degree of shrinkage to both irrelevant and relevant predictors; thus, its variable selection performance is dominated by lasso. Fan and Li²¹ pointed out that the essential difference between the lasso and the ridge priors is that the lasso prior is not differentiable at 0 while ridge regression is, so that the lasso prior yields a sparse model while ridge prior does not.

In the Bayesian context, the ridge estimator requires a Gaussian prior to be viewed as a Bayesian estimator, instead of the Laplace prior used in the Bayesian Lasso. We use the following prior for β and the priors for the remaining parameters are the same as aforementioned.

\begin{array}{r} β ∣ τ^{2}, λ ~ M V N (0, τ^{2} D_{R}) \\ D_{R} = diag {λ^{- 1}} \end{array}

(7)

3.4 |. Bayesian Elastic Net

Zou and Hastie¹² introduced frequentist elastic net method, which merges the features of both the lasso and the ridge and takes into account both sparsity and group structure of the data. Hence, the elastic net simultaneously induces variable selection by forcing the coefficients of redundant variables towards zero and selection of groups of correlated covariates and handles multicollinear predictors. The frequentist elastic net is obtained by minimizing RSS subject to the non-differentiable constraint that can be expressed as a compromise between the L₁ and L₂ norms of the coefficients. On the other hand, the lasso is obtained by minimizing RSS subject to the non-differentiable restriction that is expressed in terms of the L₁ norm of the coefficients. For the priors of elastic net and lasso, non-differentiability is needed for sparse solutions. Elastic net simultaneously overcomes the issue with p ≫ n and can be thought as a stabilized version of lasso, but it is less aggressive than the lasso in terms of predictor exclusions.²³ Zou and Hastie¹² discussed Bayesian connections to the elastic net and suggested a prior that was a compromise between the Gaussian and the Laplace priors. Later, the Bayesian elastic net (BEN) was extensively discussed by Li and Nan²³ and also considered by Kyung et al.¹³, Mallick and Yi¹⁴ in the context of linear regression modeling.

To achieve the Bayesian elastic net for model (4), we modify the model specifications given in (5). The details of prior choices and the conditional posterior distributions for each of the unknown parameters are given below:

\begin{array}{r} β ∣ τ^{2}, a_{1}^{2}, \dots, a_{p}^{2} ~ M V N (0, τ^{2} D_{E N}) \\ D_{E N} = diag {{(a_{1}^{- 2} + λ_{2})}^{- 1}, \dots, {(a_{p}^{- 2} + λ_{2})}^{- 1}} \\ a_{j}^{2} ∣ λ_{1} ~ \frac{λ_{1}^{2}}{2} e^{- λ_{1}^{2} a_{j}^{2} / 2} \\ λ_{1}^{2} ~ Gamma (r_{1}, δ_{1}), and λ_{2} ~ Gamma (r_{2}, δ_{2}) \end{array}

(8)

where λ_s, r_s, δ_s, τ, ϕ_s > 0 and we assume that r_s and δ_s are fixed for s = 1, 2 in our analysis.

In (8), λ₁ and λ₂ are tuning parameters, and λ₁ has the same role with λ in Bayesian lasso. Hence, we consider a gamma prior on $λ_{1}^{2}$ . Kyung et al.¹³ suggested assigning a Gamma prior directly on λ₂, since it acts as a rate parameter in the model. In the above model specification, assuming λ₂ = 0 returns the Bayesian lasso while assuming $a_{j}^{2} = 0$ for all j denotes the model to the Bayesian ridge. Also, under the above specifications, the conditional distribution of β depends on λ₂ through the covariance matrix, but this does not affect conjugacy. Again, the remaining parameters were assigned the same prior distributions that were described in previous sections.

4 |. POSTERIOR COMPUTATION

Unknown parameters and hyper-parameters of each model are sampled from their conditional posterior distributions through the MCMC algorithm. Given the priors and the data likelihood, the posterior distributions of all unknown parameters can be obtained by using the Bayes’ theorem. Below, we omit derivation of all conditional posterior distributions for the unknowns but present their final forms for Bayesian lasso. They will have similar forms under the other Bayesian regularized methods with different shrinkage priors.

The conditional posterior distributions of many of the unknown parameters are available in closed form due to conjugacy, which improves the convergence and reduces the computation time for our MCMC algorithm.

The joint posterior of all parameters θ = (β, α, λ, $a_{j}^{2}$ , τ², ϕ₁, ϕ₂) can be written as:

\begin{array}{l} π (θ ∣ y) \propto \prod_{i = 1}^{M} \frac{1}{\sqrt{{(2 π)}^{n_{i}} | V_{i} (ϕ) |}} e x p (- \frac{1}{2} {(Y_{i} - Ψ_{i} α - X_{i} β)}^{T} V_{i} {(ϕ)}^{- 1} (Y_{i} - Ψ_{i} α - X_{i} β)) \\ \frac{1}{\sqrt{| τ^{2} D_{L} |}} e x p (- \frac{1}{2} β^{T} {(τ^{2} D_{L})}^{- 1} β) \frac{1}{τ^{2}} (\prod_{j = 1}^{p} \frac{λ^{2}}{2} e^{- λ^{2} \frac{σ_{j}^{2}}{2}}) \\ \frac{δ^{r}}{Γ (r)} {(λ^{2})}^{r - 1} e^{- δ λ^{2}} . \end{array}

Then the full conditional posterior distribution of α is MV N₍μ_α, Σ_α), where

Σ_{α} = {(\sum_{i = 1} Ψ_{i}^{T} V_{i} {(ϕ)}^{- 1} Ψ_{i})}^{- 1} and μ_{α} = Σ_{α} {(\sum_{i = 1} {(Y_{i} - X_{i} β)}^{T} V_{i} {(ϕ)}^{- 1} Ψ_{i})}^{T} .

Similarly, the full conditional posterior distribution of β is MV N(μ, Σ), where

Σ = {({(τ^{2} D_{L})}^{- 1} + \sum_{i = 1} X_{i}^{T} V_{i} {(ϕ)}^{- 1} X_{i})}^{- 1} and μ = Σ {(\sum_{i = 1} {(Y_{i} - Ψ_{i} α)}^{T} V_{i} {(ϕ)}^{- 1} X_{i})}^{T} .

The posterior distribution of τ² is inverse chi-square distribution,

π (τ^{2} ∣ .) ~ I n v - χ^{2} (\sum n_{i} + p, \frac{A}{\sum n_{i} + p}),

where $A = \sum {(Y_{i} - Ψ_{i} α - X_{i} β)}^{T} V_{i}^{0} {(ϕ)}^{- 1} (Y_{i} - Ψ_{i} α - X_{i} β) + β^{T} D_{L}^{- 1} β$ .

The posterior distribution of $\frac{1}{a_{j}^{2}}$ is Inv-Gaussian $(\sqrt{\frac{λ^{2} τ^{2}}{β_{j}^{2}}}, λ^{2})$ , and the posterior distribution of λ² is $Gamma (p + r, \frac{\sum a_{j}^{2}}{2} + δ)$ .

Finally, we derive the conditional posterior distribution for ϕ₁ and ϕ₂ from the joint posterior distribution. Here, the conditional posterior distributions of ϕ₁ and ϕ₂ have the same form and they are given below. And, the corresponding Metropolis-Hastings algorithm is developed to update ϕ_l for l = 1, 2 based on the below expressions.

π (ϕ_{l} ∣ .) \propto \prod_{i = 1}^{M} {| V_{i}^{0} (ϕ) |}^{- 1 / 2} e x p (- \frac{1}{2 τ^{2}} {(Y_{i} - Ψ_{i} α - X_{i} β)}^{T} V_{i}^{0} {(ϕ)}^{- 1} (Y_{i} - Ψ_{i} α - X_{i} β)) π (ϕ_{l})

After deriving all the necessary conditional posterior distributions for unknown parameters, we use MCMC to estimate the posterior distribution of each parameter by sampling posterior samples from the corresponding posterior distribution, given the current values of all other parameters and the observed data. The described procedures and MCMC algorithms were implemented by using R (R Core Team²⁴), and the computer code is publicly available at https://github.com/egecili54/BPReg.

5 |. SIMULATION STUDIES

We conduct simulation studies to investigate the properties of our proposed Bayesian penalized regression approaches for selecting predictor variables. In our numerical experiments, we mimic our motivating research question and data by considering different simulation settings. We generate data following the proteomic setting based on model (4) with different numbers of protein isoforms: p = 20, 100, or 500. We specify number of patients as n = 88. We assume that the number of measurements for each subject is the same as our CF proteomic data (between 20 and 98), and all individuals are aged from 6 to 18 years. The proteomic markers were generated from a p-dimensional multivariate normal distribution with mean 0 and compound symmetric covariance matrix A_p×p, where off-diagonal elements of A correspond to ρ = 0.1, 0.6, and 0.9, and the diagonal elements are 1. In our simulations we assume that 5 of the p proteins have non-zero effects and the remaining p−5 proteins have zero effects. We assume the true coefficient vector β = {90, 3, 2, 2, 1, 1, 0, …, 0}_p+1 to generate the response variable FEV₁. By increasing p (when number of signals fixed at 5), we were able to consider sparse scenarios, reflective of variable selection in omics problems. Here, the mean effect was fixed at 90 and we fixed σ² = 6, ω² = 125, and τ² = 77 in V_i based on our empirical studies to simulate more realistic values for the response variable.

We then generate 100 simulated data sets. For each setting, we implement MCMC algorithms as described in Section 4. When all posterior samples are obtained from our MCMC, the covariates whose 95% credible interval does not include zero are selected as relevant effects (non-zero effects) while the remaining covariates are excluded from the final model as irrelevant effects. To summarize the results of our simulations and evaluate the variable selection performance of the proposed procedure over 100 replications, we report the average number of proteins with non-zero coefficients (signals) correctly included in the final model (true discoveries, TD) and the average number of proteins with zero coefficients incorrectly included in the final model (false discoveries, FD) in the following tables for different simulation settings. We also provide rate of correctly and incorrectly included proteins in the final model. In evaluating the performance of the proposed procedure, it is important to note that the non-zero coefficients that we used in our simulations are {3, 2, 2, 1, 1}, and these signals are not strong (not quite far from zero). Correctly classifying such observations as signals is difficult, as the signal-to-noise ratio is relatively low.

Based on our simulation studies, the proposed procedures are very effective in selecting relevant predictors through inspection of CI overlap with zero. The average number of correctly identified predictors is min = 4.94 and max = 5, min = 4.56 and max = 4.8, and min = 2.52 and max = 3.8 out of 5 for all the methods and ρ = 0.1, 0.6, and 0.9, respectively. The simulations also indicate that the proposed BEN and BAL procedures outperform the other two approaches in terms of variable selection. The average rate of correctly-included covariates with non-zero effects in the final model are quite high and close to 1 when ρ is 0.1 and 0.6 (and it is still above 0.7 when ρ = 0.9 for BEN), while the average rate of incorrectly-included covariates with zero coefficients in the final model are very low; typically lower than 0.05 for all procedures and different choices of p and ρ (see Table 1). As expected, BL is slightly more conservative than other methods and yields sparser results. Increased ρ results in a sparser model by yielding smaller TD and FD. Hence, both the average rate of correctly including covariates with non-zero effects in the final model and the average rate of incorrectly including covariates with zero effects in the final model decrease with increased correlation between the predictors (markers).

TABLE 1.

Variable selection performance based on simulations where true values of 5 non-zero coefficients are {3, 2, 2, 1, 1}, ρ = {0.1, 0.6, 0.9} and variance terms in V_i are fixed as σ₂ = 6, ω² = 125, and τ² = 77.

		ρ = 0.1				ρ = 0.6				ρ = 0.9
	p	TD	FD	RRA	RRN	TD	FD	RRA	RRN	TD	FD	RRA	RRN
BL	20	4.97	0.55	0.994	0.037	4.7	0.45	0.94	0.03	2.9	0.5	0.58	0.03
	100	4.97	3.4	0.994	0.0358	4.62	2.74	0.924	0.028	2.72	1.8	0.54	0.019
	500	4.97	5.45	0.994	0.011	4.56	4.66	0.912	0.009	2.52	2.40	0.5	0.005
BAL	20	4.99	0.49	0.998	0.033	4.7	0.39	0.94	0.026	3.5	0.8	0.7	0.053
	100	5	3.11	1	0.032	4.65	2.5	0.93	0.026	3.3	2.4	0.66	0.025
	500	5	4.9	1	0.01	4.6	3.85	0.92	0.008	3.11	3.4	0.62	0.007
BBN	20	4.98	0.45	0.996	0.03	4.8	0.3	0.96	0.02	3.8	0.3	0.76	0.02
	100	5	2.657	1	0.027	4.77	2.12	0.954	0.022	3.7	1.92	0.74	0.02
	500	5	4.41	1	0.009	4.6	3.02	0.92	0.006	3.5	2.7	0.7	0.006
BR	20	4.96	0.62	0.992	0.041	4.75	0.6	0.95	0.04	3.4	1.2	0.68	0.08
	100	4.94	4	0.988	0.042	4.7	3.15	0.94	0.033	3.25	4.6	0.65	0.048
	500	5	12.97	1	0.026	4.6	9.3	0.92	0.019	3.1	8.25	0.62	0.017

Open in a new tab

RRA is the average rate of rejecting true alternative hypothesis (average rate of correctly including covariates with non-zero effects in the final model). Similarly, RRN is the average rate of rejecting the true null hypothesis (average rate of incorrectly including covariates with zero coefficients in the final model). TD and FD stand for the average number of true discovery and false discovery, respectively.

6 |. APPLICATIONS WITH PROTEOMICS AND GENOMICS

6.1 |. Cystic Fibrosis Proteomic and Clinical Data Applications

Here, we use the proposed methods to select diagnostic and prognostic proteomic biomarkers and clinical variables for rapid lung function decline in CF. The cross-sectional serum samples were obtained from 88 CF participants (44 severe and 44 mild subjects) aged 6–18 years who presented with stable disease. This sub-cohort was part of the Early Pseudomonas Infection Control Observational (EPIC-Obs) cohort described by Treggiari et al.²⁵. Banked serum samples for our analysis cohort were provided through the Cystic Fibrosis Foundation Biorepository. Samples from mild and severe subjects were selected with 1 to 1 matching between the mild and severe subject pairs based on age, gender, pancreatic insufficiency status, F508del genotype and Pseudomonas aeruginosa infection status. Our analysis cohort had longitudinal information at each clinical encounter which comprised of 6718 FEV₁ observations and other clinical/demographic characteristics, which were observed both before and after serum collection. The average (range) number of FEV₁ observations per individual patient was 76.3 (31–172), with an average (range) follow-up time of 15.7 (11.7–17.9) years. Average (range) of age at serum collection was 13.7 (11–18) years.

To identify potential proteomic CF biomarkers, liquid chromatography tandem mass spectrometry (LC MS/MS) was performed on the banked serum samples. This mass spectrometry technique allowed the simultaneous identification and quantification of thousands of proteins present in the samples. Although thousands of proteins were detected and quantified by LC MS/MS, only a small number of proteins are expected to be predictive of lung function, yielding in a sparse regression model. Therefore, we apply our proposed regularization methods for model (2.2) to select relevant proteins. There were ~ 61,000 protein isoforms identified in at least half of the subjects from each of the mild and severe groups. Although mass spectrometry-based proteomics has the advantage of detecting thousands of proteins from a single experiment, it brings a serious missing data problem where the missingness rate can exceed 50% in proteomics data.²⁶ To overcome the missingness problem, we first remove the proteins where the proportion of missing values is greater than 20%. We then impute the missing values for the remaining 5,011 protein isoforms by performing predictive mean matching (pmm) imputation in R using the mice package.²⁷ The proteomic data that we use in our analyses consists of protein expression levels, which were log2 transformed and included as regressors. We adjusted for age, gender (male=1; female=0) and genotype (homozygous=1; heterozygous=0). These proteomic markers can be included in the model with different ways but we only considered the main effects. All the regressors were standardized. In the first analysis, we implemented our derived variable selection approaches and aimed to select predictive markers among all 5011 biomarkers. However, we were not able to select any proteomic biomarkers based on estimated parameters and CIs. This was followed by performing the similar analysis for the cases when only top 500 and 1000 biomarkers (ordered based on adjusted p-values obtained from differential expression analysis) were included in the initial model. But again, we were not able to select any proteomic biomarkers. We were motivated by the need for a novel approach aimed at proteomic biomarker discovery in the CF data, but our findings suggest that there may be limited discoveries.

We performed a second analysis with a limited set of proteins, which were selected using marginal testing of the 5,011 proteins. Specifically, we used the likelihood ratio test to check individually if including a given protein improves model fit, compared to the null model which does not include that specific protein. This first phase resulted in selection of 7 proteins (P3366, P10599, P17302, P14746, P15993, P13849, and P15467). Still including the aforementioned demographic and clinical covariates, we again performed each regularization method to select from the subset of proteins (Figure 1). For this application, we identified P3366 from the list of 7 proteins with BL, BAL, and BEN. BR did not pick any of the proteins based on their 95% CIs for coefficient estimates. We found that increased log2 transformed expression level of P3366 was associated with more rapid decline in lung function. We did not shrink the coefficients of clinical variables (age, gender, and genotype), since the goal was performing protein selection when these variables are already included in the model. Of those variables shown in Figure 1, we found that there was significant rate of lung function decline based on the coefficient for age and that having the F508del homozygous genotype corresponded to more rapid decline.

Parameter estimates and CIs from all regularization methods for the CF proteomics data example with 7 proteins. The points represent posterior medians for coefficient estimates. Parameters where 50% CIs overlap 0 are indicated by ‘open’ circles. Parameters where 50% CIs don’t overlap 0 and 95% CIs overlap 0 are indicated by ‘closed’ gray circles. Parameters where 95% CIs don’t overlap 0 are indicated by ‘closed’ black circles. Thick lines represent 50% CIs while thin lines represent 95% CIs.

We performed a third analysis, implementing variable selection for a broader list of clinical and demographic features that are available for the same cohort. These demographic/clinical variables include time varying variables such as diagnosis of CF-related diabetes (CFRD), infection with Pseudomonas aeruginosa (Pa), infection with Methicillin-resistant Staphylococcus aureus (MRSA), socioeconomic status (seslow), rolling numbers of exacerbations and clinic visits in year to each clinical encounter (numPEyr and numVisityr, respectively) in addition to gender and genotype. All of these variables and their interactions with age were included in the model, and variable selection is performed by applying the proposed methods. In this case, the covariates numPEyr, age*MRSA, age*Pa, and age*numPEyr were selected by BL, BAL, and BEN when age was already included in the model by using estimated 95% CIs. These three methods additionally pick age*numVisityr when we use the estimated 90% CIs. We found that increased frequency of pulmonary exacerbations, having an infection with Pa and MRSA were associated with more rapid decline. BR only picked the following interactions age*MRSA and age*Pa. Figure 2 presents median estimates of the regression coefficients and CIs for all four proposed models, highlighting the coefficients selected based on 95% CIs. The computation time for BEN was 262h 20m, 40m, and 2h 10m for the proteomics data example with 5011 proteins, the proteomics data example with 7 proteins, and for the CF data example, respectively, for MCMC size 20K. The computation times are similar for all these models, they only slightly differ.

Parameter estimates and CIs from all methods for the CF data with only demographic and clinical variables. The points represent posterior medians for coefficient estimates. Parameters where 50% CIs overlap 0 are indicated by ‘open’ circles. Parameters where 50% CIs don’t overlap 0 and 95% CIs overlap 0 are indicated by ‘closed’ gray circles. Parameters where 95% CIs don’t overlap 0 are indicated by ‘closed’ black circles. Thick lines represent 50% CIs while thin lines represent 95% CIs.

6.2 |. CD4 Yeast Cell-Cycle Genomic Data Analysis

We analyzed a subset of the yeast cell-cycle gene expression, which has been previously considered by others.^28,2 These data were longitudinally collected in the CDC15 experiment performed by Spellman et al.²⁹ where genome-wide mRNA levels of 6,178 yeast open reading frames in a two cell-cycle period were measured at M/G1-G1-S-G2-M stages. It is critical to identify transcription factors (TFs) that regulate the gene expression levels of cell cycle-regulated genes to better understand the mechanism underlying cell-cycle process. The subset data that we analyze in this application are obtained from PGEE R package³⁰, and it consists of 297 cell-cycle regularized genes observed over 4 time points at G1 stage and the standardized binding probabilities of a total of p = 96 TFs obtained from a mixture model approach of Wang, Chen and Li²⁸. The response variable is the log-transformed gene expression levels and the covariates are the binding probabilities of 96 TFs. The proposed penalized regression models were applied to this data to identify the TFs that influence the gene expression level at G1 stage of the cell process. Again only β_k’s, the coefficients of TFs, are penalized to induce variable selection.

In our analysis, BL identified 3 TPs (MBP1, STB1, and NDD1), BAL identified 4 TPs (MBP1, STB1, NDD1, and MTH1), BEN identified 5 TFs (MBP1, STB1, NDD1, SWI6, and FKH2), while BR were able to picked 14 TPs (ABF1, FKH2, GRF10.Pho2., HIR1, HIR2, MET4, MTH1, NDD1, SWI4, YAP6, CIN5, HSF1, MCM1, and SKN7). The computation time for BEN was 6h 45m for this specific example when MCMC size was 20K which was long enough for convergence. Although the subjects in this study do not have long sequences of repeated measurements, the proposed methods were able to identify important TFs that have already been verified by some biological experiments using genome-wide binding techniques. For example, MBP1 is a crucial transcription factor involved in cell cycle progression from G1 to S stage; NDD1 regulate G2/M genes through binding to their promotes; function of FKH2 is activation of its M stage-specific target genes and it is a cell cycle activator for genes in GFKH2 during the G2 stage; STB1 encodes a protein that contributes to the regulation of SBF and MBF target genes; expression is cell cycle-regulated in stages G1 and S. BR resulted in more discoveries and all of these additional TFs have been reported as key cell cycle TFs in different stages. We refer to studies by Banerjee and Zhang³¹, Tsai, Lu and Li³², Wang, Chen and Li²⁸, Cheng et al.³³ for additional details on context of the TFs.

7 |. SENSITIVITY ANALYSIS AND MODEL COMPARISON

For the sensitivity analysis, we focused on the BEN model since it has better variable selection performance among all four models based on our simulation results. For the regularization parameter λ (λ₁ for BEN), a gamma prior that allows conjugacy was used for all four models. The prior density for λ² should approach 0 sufficiently fast as λ² → ∞ (to avoid mixing problems) but should be relatively flat and place high probability near the maximum likelihood estimate¹⁹. The hyperparameters of this gamma prior, δ and r are assumed to be 1 (default choice) in the simulations and real data examples in our work. Under this specification, the full conditional distribution of λ² is another gamma with shape parameter p+r and rate parameter $\frac{\sum a_{j}^{2}}{2} + δ$ . Here for high dimensional problem (e.g. p is quite large), the shape parameter would not be effected much by r since p is relatively large. Further, for the genomic data, we run our MCMC sampler with r = 1 and δ=1, 2, 4, 8 and the posterior median for λ becomes approximately 5.4, 3.9, 2.89 and 2.1, respectively. Similar sensitivity analysis performed for the CF proteomic data example, and the same behaviour was observed, the posterior median of λ was 2.4, 1.76, 1.26, and 0.93, respectively. However, the posterior medians and 95% credible intervals for the regression coefficients are practically identical to those shown in Figure 2–3. Based on our experience, the results in terms of variable selection did not heavily depend on these hyper parameters, which need to be specified by users in advance. However, these parameters are less intuitive to specify than the credible interval level.

Parameter estimates and CIs of coefficients of penalized regression models for the yeast cell-cycle gene expression data. The points represent posterior medians for coefficient estimates. Parameters where 50% CIs overlap 0 are indicated by ‘open’ circles. Parameters where 50% CIs don’t overlap 0 and 95% CIs overlap 0 are indicated by ‘closed’ gray circles. Parameters where 95% CIs don’t overlap 0 are indicated by ‘closed’ black circles. Thick lines represent 50% CIs while thin lines represent 95% CIs.

The variable selection performance of the proposed models are already compared in previous sections by using simulations and computing average number of true and false discoveries and some error rates such as RRA and RRN (see Table 1).

We now compare the models in terms of model fit and predictive accuracy. We also studied the proposed models with different covariance structures for the real data applications. In addition to the integrated Brownian motion, we considered Brownian motion and exponential covariance^34,35,6 and compared the resulting models. Brownian motion is a non-stationary stochastic process that constitutes a continuous time generalisation of a random walk, in which successive increments are independent of the history of the process. It wanders around an underlying process mean which is a component of variation. An exponential covariance function for repeated measures within an individual induces an exponentially decaying correlation as the separation in time between measurements increases. Watanabe-Akaike information criterion (WAIC) and log pointwise predictive density (lppd)^36,37 are used to compare all four models with different covariance structures. Lower values of WAIC and lppd indicate better model fit. The results presented in Table 2 show that the BEN model with integrated Brownian motion has better fit and predictive accuracy for both real data examples.

TABLE 2.

Model fit metrics for different choices of covariance structures and models for the CF proteomic and Genomic data examples.

		CF data			Genomic data
		BM	IBM	Exp	BM	IBM	Exp
BL	−2 lppd	24560	24236	24870	1504	1024	2302
BL	WAIC	35257	34976	35485	2803	2078	4354
BEN	−2 lppd	24554	24230	24858	1924	870	978
BEN	WAIC	35256	34973	35481	3547	1572	2171
BAL	−2 lppd	24562	24140	25080	1568	1006	2328
BAL	WAIC	35263	34970	35842	2890	2068	4373
BR	−2 lppd	24594	24520	25186	2306	1888	2440
BR	WAIC	35291	35339	35972	4564	3416	5818

Open in a new tab

BM= Brownian motion; IBM= integrated Brownian motion; Exp= exponential covariance; lppd= log pointwise predictive density; WAIC= Watanabe-Akaike information criterion.

By using empirical variograms, we also examined the covariance assumptions for both CF proteomics and genomics data examples. For the CF proteomic data, the variogram for raw residuals was calculated using a bin size corresponding to weekly intervals that represents the variance of the difference between residuals within patients at time lags from 0 to 16 years (left panel, grayline, in Figure A1). This variogram was used to partition total process variance into three components of variation: between patient, within patient, and residual error. The smoothed empirical variogram fit (black line) suggests that the correlation between paired lung function measures decreased as separation in time increased, which suggests non-stationarity. Despite this function did not have an asymptote, the shape is similar to what has been obtained with longitudinal FEV₁ modeling in the Danish CF registry and US CF registry^6,5. The graph in the right panel (Figure A1) displays empirical variances through time bins. Increase in variances through time also suggests non-stationarity.

8 |. CONCLUSIONS

We incorporated a variety of Bayesian shrinkage approaches to perform variable selection for a Gaussian linear mixed effects model with non-stationary covariance structure, thereby incorporating the complicated structure of long sequences of repeated, mistimed measurements on the response variable. We demonstrated through simulation studies that our framework is well suited for identifying sparse predictors when the overarching candidate set is of high dimension. We illustrated our approaches using proteomic, clinical and genomic data sources.

After assessing a variety of scenarios through simulation studies, we found the proposed procedures to be very effective in selecting relevant predictors by correctly estimating the coefficients and producing CIs. Our simulations demonstrate that the proposed BEN and BAL procedures worked best at correctly identifying the covariates with non-zero coefficients and excluding the irrelevant covariates from the final model. The average rate of correctly including covariates with non-zero effects in the final model are quite high and close to 1 for all procedures for different choices of p and ρ (except when ρ = 0.9, in this case this rate was 0.7–0.76 for BEN and lower for other models), while the average rate of incorrectly including covariates with zero coefficients in the final model are very low (predominantly lower than 0.05) for all the procedures (Table 1).

Despite findings that BEN and BAL were superior in variable selection, each of the proposed regularization models may have some advantages dependent upon the study goals and nature of the data. Based on the real data applications performed in this work, the results from BEN are similar to those from the BAL, while BL is more conservative and yields sparser results. Thus, BL could be useful if interest is in narrowing the list of potential predictors but associations between predictors are not highly relevant. We additionally investigated the models with different covariance structures such as Brownian motion and exponential covariance for real data applications. Based on our analyses, we observed that the BEN model with integrated Brownian motion has better fit and predictive accuracy for both real data examples (it yields the lowest WAIC and lppd compared to other models). For the CF proteomic data application, our second analysis identified a single proteomic biomarker P3366 when the model is adjusted for age, gender, and genotype. Although the initial application indicated that no proteins were predictive out of the larger analysis set, further investigation with other cohorts may be warranted. Our findings suggest that there may be limited discoveries in proteomics biomarker discovery in the CF. When we focus on the selection of clinical/ demographic variables, we were able to identify additional markers. The purpose of this additional application was to assess utility of the approaches if restricting the focus to routinely collected characteristics, as taking serum samples is not part of typical clinical care. For the genomic data example, our approaches corroborated TFs that have been identified previously in the literature.

Overall, we suggest use of BEN over the others for general applications, since it has the best variable selection performance among all based on our numerical experiments. Although BR is fastest in terms of computation time, BEN is computationally more efficient since it has the best convergence rate, which is another reason to prefer BEN among these four procedures. Although these methods are proposed for large scale regression problems with p ≫ n, they can be used for small regression problems (p < n) as well, as we illustrated in our simulation studies with various sizes of p. Our work was motivated by omics experiments, but the methods can be applied to any dataset that is appropriate to model (4), such as large scale health studies where various variables are collected frequently over time. Furthermore, the model (4) includes some simpler models as specific case, for example, it reduces to the uniform “correlation model” if ϕ₂ = 0 and further it reduces to the traditional linear model with independent errors if we additionally assume that ϕ₁ = 0. The first case can sometimes be a useful approximation but the second case rarely reasonable in practice. Hence, the penalized regression models provided in our work includes regularized versions of some commonly used simpler models. Although we assumed integrated Brownian motion for the stochastic process term W_i(t) in model (4), one can consider different types of stationary or nonstationary processes whichever is appropriate for the problem of interest. Based on different choices of the processes, one can still use the exactly the same Bayesian models or might need to assign priors to any additional parameters that can be introduced to the model with the choice of process. For instance, if we consider Brownian motion instead of integrated Brownian motion, the model specification remains the same, only the covariance matrix R_i needs to be changed by letting the (j, k)^th element of R_i to be min₍t_ij, t_ik), which requires no modification in the framework.

Given the breadth of our applications, there were limitations encountered. First, our model setup does not account for matching. In the CF study, participants were matched in 1:1 fashion in the mild and severe groups. Future work could focus on extending model (2.2) by including appropriate nested effects to account for matching. Another limitation is the degree of missing data encountered from the protein isoforms. The nature of the missing data mechanism in proteomics experiments is largely unexplored, because it is complicated by the way in which mass spectrometry detects protein isoforms and subsequent filtering the user must select after running the experiment, in order to catalog isoforms. While a wide range of isoforms can be cataloged, some isoforms have limited data observed from the samples. This results in a high degree of missingness for a large portion of the isoforms. As a practical approach, we limited isoforms to those in which half of each of the subjects in the group had observed data for them. It’s possible that proteins completely missing in a given group is related to severity, but this would require further study. We undertook various imputation strategies and found pmm to be the most realistic for those isoforms used in our application. Empirical studies and new strategies are needed to fully understand the impact of missing protein isoforms and expression levels and how to accommodate them.

FDR controlling methods are alternative approaches commonly used in proteomics/genomics to identify important proteins or genes, but results have been shown to disagree with more classical approaches; thus, leading to incorrect exclusions of well known proteins, including albumin (which make up 94% of blood protein). This results in a large number of false negatives regarding protein identification.³⁸

Since our findings with proteomic biomarker discovery was limited, the proposed methods can be further extended by incorporating pathways or networking structures of the proteins/genes. In omics datasets, such as proteomics or genomics, the proteins or genes in the same pathway or networks form natural groups; these grouping structures should not be ignored.³⁹ Others have stated that it is of interest to develop a variable selection procedure which takes into account the information on pathways and networks in proteomic/microarray data to achieve improved predictive accuracy.^40,41,42 For instance, a Bayesian group lasso for longitudinal GWAS data was considered by Li, Wang and Wu¹⁷, where they show improvement in the variable selection and prediction performance. Hence, as a future development, one can expand these established Bayesian penalized regression models to incorporate grouping structure of the proteins by implementing a Bayesian group lasso that can be informed by pathway or network analysis. This may improve the variable selection and estimation performance of the proposed approaches while still allowing for a response variable with complex longitudinal correlation and measurement error. Additional future work could be implementing Bayesian regularization methods for our linear mixed effects model by using global-local approaches (e.g. horse-shoe prior), normal-gamma and the standard spike and slab priors.

ACKNOWLEDGMENTS

Clinical and demographic data access has been restricted due to an information use agreement with the Patient Registry Committee and Biomarker Consortium of the Cystic Fibrosis Foundation, and proteomics data have been restricted due to the sensitive nature of its contents. Anyone wishing to acquire the registry data may contact and request permission from the Cystic Fibrosis Foundation via: datarequests@cff.org; proteomics data requests may be sent to author AZ.

Financial disclosure

This work was supported by the National Institutes of Health (NIH) under Grants K25HL125954 and R01HL141286 (Szczesniak), and R01HL142210, U54HL119810 and R61HL154105 (Ziady); the Cystic Fibrosis Foundation (CFF) under Grant CFF ZIADY18P0 and ZIADY11A0 (Ziady), and GECILI20F0 (Gecili), and CLANCY15R0 (Clancy). The study of the cystic fibrosis data received approval for the Institutional Review Board at Cincinnati Children’s Hospital Medical Center (IRB Approval No. 2018-2859).

APPENDIX

FIGURE A1 — Variogram based on the raw residuals against separation in time, as years with the dashed lines representing total process variance (456.4); the smooth black line is the empirical variogram function with upper and lower dashed lines (460.4 and 41.64, respectively) marking partitions for between-patient variance, within-patient variance, and residual error. The graph in the right panel is presents the variances of the raw residuals over follow-up time (in year) for the CF proteomic data.

Footnotes

Conflict of interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

1.Cohen Freue GV, Kepplinger D, Salibián-Barrera M and Smucler E. Robust elastic net estimators for variable selection and identification of proteomic biomarkers. Ann. Appl. Stat 2019; 13: no. 4, 2065–2090. doi: 10.1214/19-AOAS1269 [DOI] [Google Scholar]
2.Wang L, Zhou J and Qu A. Penalized generalized estimating equations for high dimensional longitudinal data analysis. Biometrics 2012; 68: 353–360. [DOI] [PubMed] [Google Scholar]
3.Morgan WJ, Wagener JS, Yegin A, Pasta DJ, Millar SJ, Konstan MW,… coordinators of the Epidemiologic Study of Cystic Fibrosis. Probability of treatment following acute decline in lung function in children with cystic fibrosis is related to baseline pulmonary function. The Journal of pediatrics 2013;163(4): 1152–7.e2. doi: 10.1016/j.jpeds.2013.05.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Diggle PJ, Sousa I and Asar O. Real-time monitoring of progression towards renal failure in primary care patients. Biostatistics 2015; 16: 3, pp. 522–536. [DOI] [PubMed] [Google Scholar]
5.Szczesniak RD, Su W, Brokamp C, Keogh R, Pestian J, Seid M, Diggle PJ and Clancy JP. Dynamic predictive probabilities to monitor rapid cystic fibrosis disease progression. Statistics in Medicine. 2019; 10.1002/sim.8443. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Taylor RD, Whitehead M, Diderichsen F, Olesen HV, Pressler T, Smyth RL and Diggle P. Understanding the natural progression in FEV1 decline in patients with cystic fibrosis: a longitudinal study. Thorax 2012; 67: 860–866. DOI: 10.1136/thoraxjnl-2011-200953. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Szczesniak RD, McPhail GL, Duan LL, Macaluso M, Amin RS and Clancy JP. A semiparametric approach to estimate rapid lung function decline in cystic fibrosis. Ann Epidemiol 2013;23(12):771–777. doi: 10.1016/j.annepidem.2013.08.009 [DOI] [PubMed] [Google Scholar]
8.Laird NM and Ware JH. Random-effects models for longitudinal data. Biometrics 1982; 963–974. [PubMed] [Google Scholar]
9.Asar O, Bolin D, Diggle PJ and Wallin J. Linear mixed effects models for non-Gaussian continuous repeated measurement data. Appl.Statist 2020; 69: Part 5, pp. 1–39. [Google Scholar]
10.Tibshirani R Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996; 267–288. [Google Scholar]
11.Zou H The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc 2006; 101:476, 1418–1429. [Google Scholar]
12.Zou H and Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B 2005; 67: 301–320 [Google Scholar]
13.Kyung M, Gill J, Ghosh M and Casella G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010; 5(2):369–411. [Google Scholar]
14.Mallick H and Yi N. Bayesian Methods for High Dimensional Linear Models. J Biom Biostat. 2014; 1:005. doi: 10.4172/2155-6180.S1-005 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Erp SV, Oberski DL and Mulder J. Shrinkage Priors for Bayesian Penalized Regression. Journal of Mathematical Psychology 2019; 89: 31–50. [Google Scholar]
16.Li Z, Hallingbäck HR, Abrahamsson S, Fries A, Gull BA, Sillanpää MJ and García-Gil MR. Functional multi-locus QTL mapping of temporal trends in Scots pine wood traits. G3 (Bethesda, Md.) 2014; 4(12), 2365–2379. doi: 10.1534/g3.114.014068 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Li J, Wang Z, Li R and Wu R. Nonparametric varying-coefficient models with application to functional genome-wide association studies. The annals of applied statistics 2015; 9(2), 640–664. doi: 10.1214/15-AOAS808 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Tsagris M, Lagani V and Tsamardinos I. Feature selection for high-dimensional temporal data. BMC Bioinformatics 2018; 19: 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Park T and Casella G. The Bayesian Lasso. JASA 2008; 103: 681–686. [Google Scholar]
20.Scott J and Berger J. An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and Inference. 2006; 136: 2144–2162. 10.1016/j.jspi.2005.08.031. [DOI] [Google Scholar]
21.Fan J and Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001; 96(456), 1348–1360. [Google Scholar]
22.Hoerl A and Kennard R. Ridge regression. In Encyclopedia of Statistical Sciences. 1988; 8: pp. 129–136. New York: Wiley. [Google Scholar]
23.Li Q and Nan L. The Bayesian elastic net. Bayesian Analysis 2010; 5(1):151–170. [Google Scholar]
24.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2020; URL http://www.R-project.org/. [Google Scholar]
25.Treggiari MM, Rosenfeld M, Mayer-Hamblett N, et al. Early anti-pseudomonal acquisition in young patients with cystic fibrosis: rationale and design of the EPIC clinical trial and observational study. Clin Trials. 2009;30(3):256–268. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Karpievitch YV, Dabney AR, and Smith RD. Normalization and missing value imputation for label-free LC-MS analysis. BMC bioinformatics. 2012; 13 Suppl 1:S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.van Buuren S and Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 2011; 45(3), 1–67. doi: 10.18637/jss.v045.i03 [DOI] [Google Scholar]
28.Wang L, Chen G and Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 2007; 23: 1486–1494. [DOI] [PubMed] [Google Scholar]
29.Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998; 9(12):3273–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Inan G and Wang L. PGEE: An R Package for Analysis of Longitudinal Data with High-Dimensional Covariates. The R Journal 2017; 1 (9):393–402. 10.32614/RJ-2017-030 [DOI] [Google Scholar]
31.Banerjee N and Zhang MQ. Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Res. 2003; 31, 7024–7031. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Tsai HK, Lu HHS and Li WS. Statistical methods for identifying yeast cell cycle transcription factors. PNAS 2005; 102: 13532–13537. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Cheng C, Fu Y, Shen L and Garstein M. Identification of yeast cell cycle regulated genes based on genomic features. BMC System Biology. 2013; 7:70, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Robinson GK. Continuous time Brownian motion models for analysis of sequential data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 59: 477–494. 10.1111/j.1467-9876.2009.00705.x [DOI] [Google Scholar]
35.Diggle P, Ghosh M, Liang KY. Analysis of Longitudinal Data. 2nd edn. Oxford: Oxford University Press, 2002. [Google Scholar]
36.Watanabe S Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research 2010; 11, 3571–3594. [Google Scholar]
37.Gelman A, Hwang J, and Vehtari A. Understanding predictive information criteria for Bayesian models. Stat Comput 2014; 24, 997–1016. 10.1007/s11222-013-9416-2 [DOI] [Google Scholar]
38.Marshall J, Bowden P, Schmit JC and Betsou F. Creation of a federated database of blood proteins: a powerful new tool for finding and characterizing biomarkers in serum. [DOI] [PMC free article] [PubMed]
39.Xu X and Ghosh M. Bayesian variable selection and estimation for group Lasso. Bayesian. Anal 2015; 10(4):909–936. [Google Scholar]
40.Stingo FC, Chen YA, Tadesse MG, and Vannucci M. Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. The Annals of Applied Statistics. 2011; 5(3): 1978–2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Li J, Das K, Fu G, Li R and Wu R. The Bayesian lasso for genome-wide association studies. Bioinformatics. 2011; 27: 516–523. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Peterson CB, Stingo FC and Vannucci M. Joint Bayesian variable and graph selection for regression models with network-structured predictors. Stat Med. 2016. Mar 30;35(7):1017–31. doi: 10.1002/sim.6792. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Cohen Freue GV, Kepplinger D, Salibián-Barrera M and Smucler E. Robust elastic net estimators for variable selection and identification of proteomic biomarkers. Ann. Appl. Stat 2019; 13: no. 4, 2065–2090. doi: 10.1214/19-AOAS1269 [DOI] [Google Scholar]

[R2] 2.Wang L, Zhou J and Qu A. Penalized generalized estimating equations for high dimensional longitudinal data analysis. Biometrics 2012; 68: 353–360. [DOI] [PubMed] [Google Scholar]

[R3] 3.Morgan WJ, Wagener JS, Yegin A, Pasta DJ, Millar SJ, Konstan MW,… coordinators of the Epidemiologic Study of Cystic Fibrosis. Probability of treatment following acute decline in lung function in children with cystic fibrosis is related to baseline pulmonary function. The Journal of pediatrics 2013;163(4): 1152–7.e2. doi: 10.1016/j.jpeds.2013.05.013 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Diggle PJ, Sousa I and Asar O. Real-time monitoring of progression towards renal failure in primary care patients. Biostatistics 2015; 16: 3, pp. 522–536. [DOI] [PubMed] [Google Scholar]

[R5] 5.Szczesniak RD, Su W, Brokamp C, Keogh R, Pestian J, Seid M, Diggle PJ and Clancy JP. Dynamic predictive probabilities to monitor rapid cystic fibrosis disease progression. Statistics in Medicine. 2019; 10.1002/sim.8443. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Taylor RD, Whitehead M, Diderichsen F, Olesen HV, Pressler T, Smyth RL and Diggle P. Understanding the natural progression in FEV1 decline in patients with cystic fibrosis: a longitudinal study. Thorax 2012; 67: 860–866. DOI: 10.1136/thoraxjnl-2011-200953. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Szczesniak RD, McPhail GL, Duan LL, Macaluso M, Amin RS and Clancy JP. A semiparametric approach to estimate rapid lung function decline in cystic fibrosis. Ann Epidemiol 2013;23(12):771–777. doi: 10.1016/j.annepidem.2013.08.009 [DOI] [PubMed] [Google Scholar]

[R8] 8.Laird NM and Ware JH. Random-effects models for longitudinal data. Biometrics 1982; 963–974. [PubMed] [Google Scholar]

[R9] 9.Asar O, Bolin D, Diggle PJ and Wallin J. Linear mixed effects models for non-Gaussian continuous repeated measurement data. Appl.Statist 2020; 69: Part 5, pp. 1–39. [Google Scholar]

[R10] 10.Tibshirani R Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996; 267–288. [Google Scholar]

[R11] 11.Zou H The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc 2006; 101:476, 1418–1429. [Google Scholar]

[R12] 12.Zou H and Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B 2005; 67: 301–320 [Google Scholar]

[R13] 13.Kyung M, Gill J, Ghosh M and Casella G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010; 5(2):369–411. [Google Scholar]

[R14] 14.Mallick H and Yi N. Bayesian Methods for High Dimensional Linear Models. J Biom Biostat. 2014; 1:005. doi: 10.4172/2155-6180.S1-005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Erp SV, Oberski DL and Mulder J. Shrinkage Priors for Bayesian Penalized Regression. Journal of Mathematical Psychology 2019; 89: 31–50. [Google Scholar]

[R16] 16.Li Z, Hallingbäck HR, Abrahamsson S, Fries A, Gull BA, Sillanpää MJ and García-Gil MR. Functional multi-locus QTL mapping of temporal trends in Scots pine wood traits. G3 (Bethesda, Md.) 2014; 4(12), 2365–2379. doi: 10.1534/g3.114.014068 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Li J, Wang Z, Li R and Wu R. Nonparametric varying-coefficient models with application to functional genome-wide association studies. The annals of applied statistics 2015; 9(2), 640–664. doi: 10.1214/15-AOAS808 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Tsagris M, Lagani V and Tsamardinos I. Feature selection for high-dimensional temporal data. BMC Bioinformatics 2018; 19: 17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Park T and Casella G. The Bayesian Lasso. JASA 2008; 103: 681–686. [Google Scholar]

[R20] 20.Scott J and Berger J. An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and Inference. 2006; 136: 2144–2162. 10.1016/j.jspi.2005.08.031. [DOI] [Google Scholar]

[R21] 21.Fan J and Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001; 96(456), 1348–1360. [Google Scholar]

[R22] 22.Hoerl A and Kennard R. Ridge regression. In Encyclopedia of Statistical Sciences. 1988; 8: pp. 129–136. New York: Wiley. [Google Scholar]

[R23] 23.Li Q and Nan L. The Bayesian elastic net. Bayesian Analysis 2010; 5(1):151–170. [Google Scholar]

[R24] 24.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2020; URL http://www.R-project.org/. [Google Scholar]

[R25] 25.Treggiari MM, Rosenfeld M, Mayer-Hamblett N, et al. Early anti-pseudomonal acquisition in young patients with cystic fibrosis: rationale and design of the EPIC clinical trial and observational study. Clin Trials. 2009;30(3):256–268. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Karpievitch YV, Dabney AR, and Smith RD. Normalization and missing value imputation for label-free LC-MS analysis. BMC bioinformatics. 2012; 13 Suppl 1:S5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.van Buuren S and Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 2011; 45(3), 1–67. doi: 10.18637/jss.v045.i03 [DOI] [Google Scholar]

[R28] 28.Wang L, Chen G and Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 2007; 23: 1486–1494. [DOI] [PubMed] [Google Scholar]

[R29] 29.Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998; 9(12):3273–97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Inan G and Wang L. PGEE: An R Package for Analysis of Longitudinal Data with High-Dimensional Covariates. The R Journal 2017; 1 (9):393–402. 10.32614/RJ-2017-030 [DOI] [Google Scholar]

[R31] 31.Banerjee N and Zhang MQ. Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Res. 2003; 31, 7024–7031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Tsai HK, Lu HHS and Li WS. Statistical methods for identifying yeast cell cycle transcription factors. PNAS 2005; 102: 13532–13537. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Cheng C, Fu Y, Shen L and Garstein M. Identification of yeast cell cycle regulated genes based on genomic features. BMC System Biology. 2013; 7:70, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Robinson GK. Continuous time Brownian motion models for analysis of sequential data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 59: 477–494. 10.1111/j.1467-9876.2009.00705.x [DOI] [Google Scholar]

[R35] 35.Diggle P, Ghosh M, Liang KY. Analysis of Longitudinal Data. 2nd edn. Oxford: Oxford University Press, 2002. [Google Scholar]

[R36] 36.Watanabe S Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research 2010; 11, 3571–3594. [Google Scholar]

[R37] 37.Gelman A, Hwang J, and Vehtari A. Understanding predictive information criteria for Bayesian models. Stat Comput 2014; 24, 997–1016. 10.1007/s11222-013-9416-2 [DOI] [Google Scholar]

[R38] 38.Marshall J, Bowden P, Schmit JC and Betsou F. Creation of a federated database of blood proteins: a powerful new tool for finding and characterizing biomarkers in serum. [DOI] [PMC free article] [PubMed]

[R39] 39.Xu X and Ghosh M. Bayesian variable selection and estimation for group Lasso. Bayesian. Anal 2015; 10(4):909–936. [Google Scholar]

[R40] 40.Stingo FC, Chen YA, Tadesse MG, and Vannucci M. Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. The Annals of Applied Statistics. 2011; 5(3): 1978–2002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Li J, Das K, Fu G, Li R and Wu R. The Bayesian lasso for genome-wide association studies. Bioinformatics. 2011; 27: 516–523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Peterson CB, Stingo FC and Vannucci M. Joint Bayesian variable and graph selection for regression models with network-structured predictors. Stat Med. 2016. Mar 30;35(7):1017–31. doi: 10.1002/sim.6792. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bayesian regularization for a non-stationary Gaussian linear mixed effects model

Emrah Gecili

Siva Sivaganesan

Ozgur Asar

John P Clancy

Assem Ziady

Rhonda D Szczesniak

Summary

1 |. INTRODUCTION

2 |. LONGITUDINAL MODEL AND NOTATION