Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Aug 1.
Published in final edited form as: Stat Probab Lett. 2011 Aug 1;81(8):1056–1062. doi: 10.1016/j.spl.2011.02.029

Sparse Variational Analysis of Linear Mixed Models for Large Data Sets

Artin Armagan 1,, David Dunson 2
PMCID: PMC3138673  NIHMSID: NIHMS281971  PMID: 21779136

Abstract

It is increasingly common to be faced with longitudinal or multi-level data sets that have large numbers of predictors and/or a large sample size. Current methods of fitting and inference for mixed effects models tend to perform poorly in such settings. When there are many variables, it is appealing to allow uncertainty in subset selection and to obtain a sparse characterization of the data. Bayesian methods are available to address these goals using Markov chain Monte Carlo (MCMC), but MCMC is very computationally expensive and can be infeasible in large p and/or large n problems. As a fast approximate Bayes solution, we recommend a novel approximation to the posterior relying on variational methods. Variational methods are used to approximate the posterior of the parameters in a decomposition of the variance components, with priors chosen to obtain a sparse solution that allows selection of random effects. The method is evaluated through a simulation study, and applied to an epidemiological application.

Keywords: Mixed-effects model, Variational approximations, Shrinkage estimation

1. Introduction

It is often of interest to fit a hierarchical model in settings involving large numbers of predictors (p) and/or large sample size (n). For example, in a large prospective epidemiology study, one may obtain longitudinal data for tens of thousands of subjects, while also collecting ~ 100 predictors. Even in more modest studies, involving thousands of subjects, the number of predictors collected is often large. Unfortunately, current methods for inference in mixed effects models are not designed to accommodate large p and/or large n. This article proposes a method for obtaining sparse approximate Bayes inferences in such problems using variational methods (Jordan et al., 1999; Jaakkola and Jordan, 2000).

For concreteness we focus on the linear mixed effects (LME) model (Laird and Ware, 1982), though the proposed methods can be applied directly in many other hierarchical models. When considering LMEs in settings involving moderate to large p, it is appealing to consider methods that encourage sparse estimation of the random effects covariance matrix. There are a variety of methods available in the literature, including approaches based on Bayesian methods implemented with MCMC (Chen and Dunson, 2003; Kinney and Dunson, 2007; Frühwirth-Schnatter and Tüchler, 2008) and methods based on fast shrinkage estimation (Foster et al., 2007).

Frequentist procedures encounter convergence problems (Pennell and Dunson, 2007) and MCMC based methods tend to be computationally intensive and not to scale well as p and/or n increases. The methods relying on stochastic search variable selection (SSVS) algorithms (George and McCulloch, 1997) face difficulties when p increases beyond ~ 30 in linear regression applications, with the computational burden substantially greater in hierarchical models involving random effects selection. Approaches have been proposed to make MCMC implementations of hierarchical models feasible in large data sets (Huang and Gelman, 2008; Pennell and Dunson, 2007). However, these approaches do not solve the large p problem or allow sparse estimation or selection of random effects covariances. In addition, the algorithms are still time consuming to implement.

Model selection through shrinkage estimation has gained much popularity since the Lasso of Tibshirani (1996). Similar shrinkage effects were later obtained through hierarchical modeling of the regression coefficients in the Bayesian paradigm. A few examples of these are Tipping (2001); Bishop and Tipping (2000); Figueiredo (2003); Park and Casella (2008). Most approaches have relied on maximum a posteriori (MAP) estimation. MAP estimation produces a sparse point estimate with no measure of uncertainty, motivating MCMC and variational methods.

It would be appealing to have a fast approach that could be implemented much more rapidly in cases involving moderate to large data sets and numbers of variables, while producing sparse estimates and allowing approximate Bayesian inferences on predictor effects. In particular, it would be very appealing to have an approximation to the marginal posteriors instead of simply obtaining a point estimate. Basing inferences on point estimates does not account for uncertainty in the estimation process, and hence is not useful in applications, such as epidemiology.

One possibility is to rely on a variational approximation to the posterior distribution (Jordan et al., 1999; Jaakkola and Jordan, 2000; Bishop and Tipping, 2000). Within this framework, we develop a method for sparse covariance estimation relying on a decomposition and the use of heavy-tailed priors in a related manner to Tipping (2001), though they did not consider estimation of covariance matrices.

2. Variational inference

Except in very simple conjugate models, the marginal likelihood of the data is not available analytically. As an alternative to MCMC and Laplace approximations (Tierney and Kadane, 1986), a lower-bound on marginal likelihoods may be obtained via variational methods (Jordan et al., 1999) yielding approximate posterior distributions on the model parameters. Let θ be the vector of all unobserved quantities in the model and y be the observed data. Given a density q(θ), the marginal log-likelihood can be decomposed as

logp(y)=q(θ)logp(y,θ)q(θ)dθL+KL(qp), (1)

where p(y, θ) is an unnormalized posterior density of θ and KL(qp) denotes the Kullback-Leibler divergence between the true posterior p(θ|y) and q(θ). Since this quantity is strictly non-negative and is equal to 0 only when p(θ|y) = q(θ), the first term in (1) constitutes a lower-bound on log p(y). Maximizing the first term in the right hand side of (1) is equivalent to minimizing the second term in the right hand side, suggesting that q(θ) is an approximation to the posterior density p(θ|y).

Following Bishop and Tipping (2000) we consider a factorized form

q(θ)=iqi(θi), (2)

where θi is a sub-vector of θ and there are no restrictions on the form of qi(θi). Then the lower-bound can be maximized with respect to qii) yielding an iterative procedure

qi(θi)=explogp(y,θ)θjiexplogp(y,θ)θjidθi, (3)

where 〈․〉θji denotes the expectation with respect to the distributions qj(θ j) for ji. Due to conjugacy, these expectations will be easily evaluated yielding standard distributions for qi(θi) with parameters expressed in terms of the moments of θj. Thus the procedure will consist of initializing the expectations required and re-iterating through them updating the expectations with respect to the densities provided by (3).

Due to the factorized form in (2), information pertaining to the dependence structure among θis is lost with these approximations. This comes as a trade-off to the computational advantages. It is this factorized form and the use of conjugate priors that help us identify q(θi) as well-known densities, making the required moments readily available. However, considering that this factorization will be realized in a block form, we will be able to preserve the dependence structure within the blocks, i.e. within each θi. Potentially, the approximation may become less accurate with increasing dimension if there are important dependencies among the θis.

In the following text G, IG, IW, N, U, W will respectively denote gamma, inverse gamma, inverse Wishart, normal, uniform and Wishart distributions.

3. The Model

3.1. The Standard Model

Suppose there are n subjects under study, with ni observations for the ith subject. For subject i at observation j, let yij denote the response, let xij and zij denote p × 1 and q × 1 vectors for predictors. Then, a linear mixed effects model can be written as

yi=Xiα+Ziβi+εi, (4)

where yi = (yi1, …, yini)′, Xi = (xi1, …, xini)′, Zi = (zi1, …, zini)′, α is a p×1 vector of unknown fixed effects, βi is a q×1 vector unknown subject-specific random effects with βi ~ N(0,D), and the elements of the residual vector, εi, are N (0, σ2I).

Given the formulation in (4), the joint density of the observations given the model parameters can be written as

p(y|α,β,σ2)=(2πσ2)i=1nni/2exp{12σ2i=1nj=1ni(yijxijαzijβi)2}. (5)

For conditionally-conjugate priors, inference is straightforward both in exact (via MCMC) and approximate (via variational methods) cases. To save space, the variational approximations to the posteriors of the model parameters will be given only for the shrinkage model explained in the following section.

3.2. The Shrinkage Model

Let D = ΛBΛ where B is a symmetric, positive-definite matrix and Λ = diag(λ1, …, λq) with λk ∈ ℝ with no further restrictions. This decomposition is not unique yet is sufficient to guarantee the positive-semidefiniteness of D.

Let us re-write (4) as

yi=Xiα+ZiΛbi+εi, (6)

where bi ~ N(0, B) and λk acts as a scaling factor on the kth row and column of the random effects covariance D. Although the parameterization is redundant, it has been noted that the redundant parameterizations are often useful for computational reasons and for inducing new classes of priors with appealing properties (Gelman, 2006). The incorporation of λk allows greater control on adaptive predictor-dependent shrinkage, with values λk ≈ 0 (along with small corresponding diagonals in B) leading to the kth predictor being effectively excluded from the random effects component of the model through setting the values in the kth row and column of D close to zero. This maintains the positive-semidefinite constraint. One issue with redundant parameterization is the lack of identifiability in a frequentist sense, i.e. it will lead to a likelihood which comprises multiple ridges along possible combinations of Λ and B. This does not create insurmountable difficulties for Bayesian procedures as a non-flat prior should take care of this problem. When MCMC is used, the sampling of λk and B would occur along these ridges where the prior assigns a positive density. Multiple modes will exist due to the fact that each λk may take either sign.

The variational procedure used will converge to one of multiple exchangeable modes which live on the aforementioned ridges in the posterior. The tracking of the lower-bound plays an important role to stop the iterative procedure. As the lower-bound stops its monotonic increase (according to some preset criterion), we stop the procedure and assume that any further change in λk and B will not change inferences as we are moving along one of these ridges.

The joint density of the observations now can be written as

p(y|α,Λ,b,σ2)=(2πσ2)i=1nni2exp{12σ2i=1nj=1ni(yijxijαzijΛbi)2} (7)

where b=(b1,,bn).

3.2.1. Priors and Posteriors

After this decomposition the joint (conditional) density of the observations remains almost identical, replacing β by Λbi in (7). Notice that λk and bik are interchangeable which is going to allow us to model λk as redundant random-effects coefficients.

We will use independent t priors for αk and λk due to its shrinkage reinforcing quality. This will be accomplished through the scale mixtures of normals (West, 1987) due to the conjugacy properties, i.e. αk ~ N(0, ak) and λ k ~ N(0, νk) where αk1,νk1~G(η0,ζ0). Under this setup, we would hope that those αk and λk corresponding to insignificant fixed and random effects would shrink towards the neighborhood of zero. This will allow us to obtain a much smaller set of fixed effects for prediction purposes as well as a much more compact covariance structure on the random-effects coefficients. The usefulness of this approach will be especially emphasized in high dimensional problems. We also set σ2 ~ IG(c0, d0) and B ~ IW(n0,Ψ0).

The approximate marginal posterior distributions of the parameters, using (3), are obtained as q(α)=dN(α^,A^),q(σ2)=dIG(c^,d^),q(bi)=dN(b^i,B^i),q(λ)=dN(λ^,V^),q(B)=dIW(n^,Ψ^),q(αk1)=dG(η^,ζ^k)andq(νk1)=dG(η*,ζk*) where

α^=σ2A^i=1nXi(yiZiΛbi),A^=σ21(i=1nXiXi+σ21A1)1,c^=i=1nni/2+c0,d^=i=1n(yiyi2αXiyi2biΛZiyi+k=1pxikααxik+k=1qZikλλbibiZik+2αXiZiΛbi)/2+d0,b^i=σ2B^iΛZi(yiXiα),B^i=σ21(λλZiZi+σ21B1)1,λ^i=σ2V^i=1ndiag(bi)Zi(yiXiα),V^=σ21(i=1nbi,biZiZi+σ21V1)1,n^=n+n0,Ψ^=i=1nbibi+Ψ0,η^=1/2+η0,ζ^k=αk2/2+ζ0,η*=1/2+η0,ζ^k*=ζk2/2+ζ0,

λ = diag(Λ), A = diag(ak : k = 1, …, p), V = diag(νk : k = 1, …, q), (•) denotes the Hadamard product and diag(․), depending on its argument, either builds a vector from the diagonal elements of a matrix or builds a diagonal matrix using the components of a vector as the diagonal elements of that matrix.

The required moments are α=α^,αα=A^+α^α^,bi=b^i,bibi=B^i+b^ib^i,σ2=c^/d^,Λ=diag(λ^),λλ=V^+λ^λ^,ak1=η^/ζ^k,νk1=η*/ζk*,B1=n^Φ^1. The iterative procedure that (3) implies is then to initiate these moments and cycle through them until convergence is reached.

After simplifications, the expression for the lower-bound, L, is given by

L=12{i=1nnilog(2π)+q(n+1)+p+log|𝕍(α)|+log|𝕍(λ)|+i=1nlog|𝕍(bi)|+q(n^n0)log2+log|Ψ0|n0|Ψ^|n^}+logΓq(n^/2)Γq(n0/2)+logd0c0d^c^+logΓ(c^)Γ(c0)+j=1plogζ0η0ζ^jn^+j=1qlogζ0η0ζ^j*η*+plogΓ(η^)Γ(η0)+qlogΓ(η*)Γ(η0). (8)

4. Simulations

Focusing on the model of Section 3.2, we assess the performance in estimating the fixed effects and random effects covariance. We specify two alternatives for the number of subjects, n = {400, 2000}, three alternatives for the number of potential covariates, p = {4, 20, 60}, q = p, and three alternatives for the underlying sparsity corresponding respectively to the number of potential covariates, p′ = {.75p, .50p, .25p}, q′ = p′, where p′ and q′ denote the number of active covariates in the underlying model. We generate ni = 8, i = 1, …, n observations per subject and set α1:p = 1 and α(p′+1):p = 0. In addition, we let xij1 = 1, xij(2:p) ~ Np−1 (0,C) and C ~ W(p − 1, Ip−1); zij = xij; σ2 ~ U(1, 3); D1:q′×1:q ~ W(q′, Iq) and the rest of the entries along the dimensions (q′ + 1) : q are 0.

We run the variational procedure both for the standard and shrinkage models on 100 independently generated data sets. For the standard model, the priors are specified as α ~ N (0, A0), σ−2 ~ G(c0, d0) and D ~ IW(n0,Ψ0) where α0 = 0, A0 = 1000I (Chen and Dunson, 2003), c0 = 0.1, d0 = 0.001 (Smith and Kohn, 2002), n0 = q and Φ0 = I to reflect our vague prior information on α, σ−2 and D. All these priors are proper yet specify very vague information about the parameters relative to the likelihoods that are observed in this simulation. For the shrinkage model, we choose c0 = η0 = 0.1, d0 = ζ0 = 0.001, n0 = q and Ψ0 = I to express our vague information on σ−2, ak1,νk1 and B. It is important that we refrain from using improper priors on the higher level parameters, i.e. ak1,νk1, for meaningful marginal likelihoods as the limiting cases of these conjugate priors will lead to the impropriety of the posterior distribution and consequently the decomposition in (1) will lose its meaning.

The boxplots in Figure 1 (a) and (b) give the quadratic losses in the estimation of α and D respectively arising from the standard and the shrinkage models. As the dimension of the problem increases and the underlying model becomes sparser, the advantage of the shrinkage model is highly pronounced.

Figure 1.

Figure 1

Quadratic loss for (a) α and (b) D. The vertical axis is given in log-scale.

5. Real Data Application

We apply the proposed method to US Collaborative Perinatal Project (CPP) data on maternal pregnancy smoking in relation to child growth (Chen et al., 2005). Chen et al. (2005) examine the relationship between maternal smoking habits during pregnancy and childhood obesity within n = 34866 children in the CPP using generalized estimating equations (GEE) (Liang and Zeger, 1986). The size of the data, in particular the number of subjects, hinders a random effects analysis (Pennell and Dunson, 2007).

Having removed the missing observations we were left with 28211 subjects and 115811 observations. We set aside 211 observations across the subjects as a hold-out sample to test the performance of our procedure which leaves us with 115600 observations to train our model with. The main predictors considered in the model are, child’s age, recruitment center, race, whether ever breast-fed, number of prior live births by mother, socioeconomic index, mothers BMI, number of cigarettes smoked per day, mother’s age, mother’s education, marital status of mother at recruitment and trimester at recruitment. Dummy variables were created for the categorical variables. We also include a quadratic term for age and interaction terms between all the predictors (except for center) and age. Our design matrix, X = Z, (with a column of 1s for the intercept term) has 72 columns. Each column of X = Z is scaled to have unit length. The response was continuous, the weight of the children in kilograms. A detailed description of the data is available in Chen et al. (2005).

We apply our shrinkage model to the data. Figure 2 give the 99% credible sets for the fixed effect coefficients and for the diagonals of the random effects covariance matrix respectively. We can see for both fixed effects coefficients and the diagonals of the random effects covariance, many credible sets are concentrated around 0. Figure 3 (a) also gives the point estimates and 99% credible sets for the hold-out sample. Here the R2 on the test set was found to be 94.8%.

Figure 2.

Figure 2

Vertical lines represent 99% credible sets for (a) the fixed effect coefficients for 72 predictors used in the model and (b) the diagonal elements of the random effects covariance matrix. (c) and (d) provide a closer look at those credible sets that are closer to zero.

Figure 3.

Figure 3

(a) gives the predicted vs. observed response (wight in kg) plot where black circles represent the point estimates for the shrinkage model, dashed line is the 45° line and the solid line is the linear fit between the predicted and observed values. Shaded area gives the 99% point-wise credible region for the mean response. (b) tracks the lower-bound (upper) and log ψ for convergence (lower) over time. Vertical dashed line marks the iteration/time the pre-specified convergence criterion is reached (ψ = 10−6).

The computational advantage of the procedure is undeniable. For the shrinkage model, the algorithm was implemented in MATLAB on a computer with a 2.8 GHz processor and 12 GB RAM. Figure 3(d) tracks the lower-bound and the relative error between two subsequent lower-bound values, ψ = |L(t) − L(t−1)|/|L(t)|, for convergence where L(t) denotes the lower-bound evaluated at iteration t. The preset value of ψ = 10−6 is reached after 2485 iterations which takes 2.4 days. It should be noted that the computational intensity for one iteration is almost identical to a Gibbs sampling scenario, which suggests, if Gibbs sampling procedure were to be used, only 2485 samples would have been drawn. Considering the burn-in period required for convergence and the thinning of the chain to obtain less correlated draws, this number is far from sufficient. Thus, with data sets this large or larger, MCMC is not a computationally feasible option.

6. Conclusion

Here we provided a fast approximate solution to fully Bayes inference to be used in the analysis of large longitudinal data sets. The proposed parameterization also allows for identifying the predictors that contribute as fixed and/or random effects. Although this parameterization leads to an unidentifiable likelihood, and would also cause the so-called label-switching problem with the application of Gibbs sampling, the variational approach allows us to converge to one of many solutions which lead to identical inferences. The utility of the new parameterization is justified through a simulation study. The application to a large epidemiological data set also demonstrates computational advantages obtained through the proposed method over conventional sampling techniques.

The computational advantages are contingent upon the availability of the required moments in analytical form which are easily obtained with the conjugate priors we considered. When the likelihood is no longer normal, these moments may not be available in analytical form and we may have to resort in some numerical solution. In such cases computational advantages may not be as pronounced. As a special case, the same computational advantages should be observed in linear mixed models with a binary outcome through latent variable modeling of Albert and Chib (1993). A similar approach for shrinkage estimation in probit models was considered in Armagan (2009).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Artin Armagan, Email: artin@stat.duke.edu, Department of Statistical Science, Duke University, Durham, NC 27708.

David Dunson, Email: dunson@stat.duke.edu, Department of Statistical Science, Duke University, Durham, NC 27708.

References

  1. Albert JH, Chib S. Bayesian Analysis of Binary and Polychotomous Response Data. Journal of the American Statistical Association. 1993:88. [Google Scholar]
  2. Armagan A. Variational Bridge Regression. JMLR: W&CP. 2009;5:17–24. [Google Scholar]
  3. Bishop CM, Tipping ME. Variational Relevance Vector Machinesc; UAI ’00: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence; San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2000. pp. 46–53. [Google Scholar]
  4. Chen A, Pennell M, Klebanoff MA, Rogan WJ, Longnecker MP. Maternal smoking during pregnancy in relation to child overweight: follow-up to age 8 years. International Journal of Epidemiology. 2005 doi: 10.1093/ije/dyi218. [DOI] [PubMed] [Google Scholar]
  5. Chen Z, Dunson DB. Random Effects Selection in Linear Mixed Models. Biometrics. 2003;59(4):762–769. doi: 10.1111/j.0006-341x.2003.00089.x. [DOI] [PubMed] [Google Scholar]
  6. Figueiredo MAT. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003;25:1150–1159. [Google Scholar]
  7. Foster, Scott D, Verbyla, Arunas P, Pitchford, Wayne S. Journal of Agricultural, Biological and Environmental Statistics. 2007;12(2):300–314. [Google Scholar]
  8. Frühwirth-Schnatter S, Tüchler R. Bayesian parsimonious covariance estimation for hierarchical linear mixed models. Statistics and Computing. 2008;18(1):1–13. [Google Scholar]
  9. George E, McCulloch R. Approaches for Bayesian variable selection. Statistica Sinica. 1997;7:339–373. [Google Scholar]
  10. Huang Z, Gelman A. Sampling for Bayesian computation with large datasets. Technical Report, Department of Statistics, Columbia University. 2008;7:339–373. [Google Scholar]
  11. Jaakkola TS, Jordan MI. Bayesian parameter estimation via variational methods. Statistics and Computing. 2000;10(1):25–37. [Google Scholar]
  12. Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Machine Learning. 1999;37:183–233. [Google Scholar]
  13. Kinney SK, Dunson DB. Fixed and Random Effects Selection in Linear and Logistic Models. Biometrics. 2007;63(3):690–698. doi: 10.1111/j.1541-0420.2007.00771.x. [DOI] [PubMed] [Google Scholar]
  14. Laird NM, Ware JH. Random-Effects Models for Longitudinal Data. Biometrics. 1982;38(4):963–974. [PubMed] [Google Scholar]
  15. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. [Google Scholar]
  16. Park T, Casella G. The Bayesian Lasso. Journal of the American Statistical Association. 2008;103(6):681–686. [Google Scholar]
  17. Pennell ML, Dunson DB. Fitting semiparametric random effects models to large data sets. Biostat. 2007;8(4):821–834. doi: 10.1093/biostatistics/kxm008. [DOI] [PubMed] [Google Scholar]
  18. Smith M, Kohn R. Parsimonious Covariance Matrix Estimation for Longitudinal Data. Journal of the American Statistical Association. 2002;97:1141–1153. [Google Scholar]
  19. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996;58(1):267–288. [Google Scholar]
  20. Tierney L, Kadane JB. Accurate Approximations for Posterior Moments and Marginal Densities. Journal of the American Statistical Association. 1986;81(393):82–86. [Google Scholar]
  21. Tipping ME. Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research. 2001;1 [Google Scholar]
  22. West M. On Scale Mixtures of Normal Distributions. Biometrika. 1987;74(3):646–648. [Google Scholar]

RESOURCES