Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 22.
Published in final edited form as: Biometrika. 2020 Aug 26;108(2):269–282. doi: 10.1093/biomet/asaa068

Approximating posteriors with high-dimensional nuisance parameters via integrated rotated Gaussian approximation

W VAN DEN BOOM 1, G REEVES 2, D B DUNSON 3
PMCID: PMC9216391  NIHMSID: NIHMS1815596  PMID: 35747172

Summary

Posterior computation for high-dimensional data with many parameters can be challenging. This article focuses on a new method for approximating posterior distributions of a low- to moderate-dimensional parameter in the presence of a high-dimensional or otherwise computationally challenging nuisance parameter. The focus is on regression models and the key idea is to separate the likelihood into two components through a rotation. One component involves only the nuisance parameters, which can then be integrated out using a novel type of Gaussian approximation. We provide theory on approximation accuracy that holds for a broad class of forms of the nuisance component and priors. Applying our method to simulated and real data sets shows that it can outperform state-of-the-art posterior approximation approaches.

Keywords: Bayesian statistics, Dimensionality reduction, Marginal inclusion probability, Nuisance parameter, Posterior approximation, Support recovery, Variable selection

1. Introduction

Consider the regression model

y~N(Xβ+η,σ2In), (1)

where y is an n-dimensional vector of observations, X is an n × p design matrix, β is a p-dimensional parameter of interest, η is an n-dimensional nuisance parameter, and σ2 is the error variance. The nuisance parameter can for instance capture the effect of a large set of covariates not included in X, or of non-Gaussian errors. Our goal is Bayesian inference on the model in (1) when p is of moderate size such that pnp with the focus on the posterior

π(β|y)=π(β,η|y)dη=1π(y)π(y|β,η)π(β,η)dη. (2)

The integrals in (2) and π(y) are intractable to approximate accurately for certain priors π(β, η), with direct approximations such as Laplace’s method producing inaccurate results and Monte Carlo sampling being daunting computationally. Our key idea is to transform the hard problem with nuisance parameter η in a principled way to a p-dimensional one which can be written as a linear model including only β. Then, a low-dimensional inference technique can be applied to this p-dimensional model. The transformation uses a novel type of Gaussian approximation using a data rotation to integrate out η from (1).

Section 3 discusses special cases of the model in (1). Applications include epidemiology studies in which y is a health outcome, X consists of exposures and key clinical or demographic factors of interest, and η is the effect of high-dimensional biomarkers. The goal is inference on the effect of the exposures and the clinical or demographic covariates, but adjusting for the high-dimensional biomarkers. For example, η may result from genetic factors, such as single-nucleotide polymorphisms (SNPs), and we want to control for these in identifying an environmental main effect. It is often impossible to isolate the impact of individual genetic factors so we consider these effects as nuisance parameters. Another use of (1) is computation of posterior inclusion probabilities in high-dimensional Bayesian variable selection as detailed in § 3·2.

Data with a complex component η that is not of primary interest and only a moderate number p of parameters of interest, are more and more common. Unfortunately, the complexity of η can make accurate approximation of π(β | y) in (2) challenging even when p = 1. One naive approach is to ignore the nuisance parameter η by setting it to zero. The result can be problematic as omitting η changes the interpretation of the parameter of interest β, which therefore might take on a different value. For example, η might capture the effect of covariates with it being important to adjust for them to avoid misleading conclusions on β.

Many posterior approximation methods exist, including Monte Carlo (George & McCulloch, 1993, 1997; O’Hara & Sillanpää, 2009), variational Bayes (Carbonetto & Stephens, 2012; Ormerod et al., 2017), integrated nested Laplace approximations (Rue et al., 2009), and expectation propagation (Hernández-Lobato et al., 2015). However, these methods can be computationally expensive, do not apply to our setting, or lack theoretical results regarding approximation accuracy. A notable exception to the latter is the fast posterior approximation algorithm of Huggins et al. (2017) which comes with bounds on the approximation error under conditions on the prior such as log-concavity, Gaussianity, and smoothness. The class of priors that we allow on β and η is much larger. Our method and its analysis for instance apply to dimensionality reduction and shrinkage priors such as spike-and-slab, horseshoe, and Laplace distributions.

The main computational bottleneck of our method is calculation of the mean and variance of a nuisance term, for which one can choose any suitable algorithm. As a result, the computational cost of our method is comparable to that of the fast algorithm chosen for this step.

2. Integrated rotated Gaussian approximation

2·1. Notation and assumptions

Denote the multivariate Gaussian distribution with mean μ and covariance Σ by N(μ, Σ), and its density function evaluated at a by N(a|μ,Σ). Denote the distribution of a conditional on b by (a|b) and its density, with respect to some dominating measure, evaluated at a by π(a|b). We assume that β and η are a priori independent so that (β,η)=(β)(η). We treat X and σ2 as known constants unless otherwise noted. Assume that pn. We assume that X is full rank to simplify the exposition, but our method also applies to rank deficient X.

2·2. Description of the method

We integrate out η from (1) by splitting the model into two parts, one of which does not involve β. A data rotation provides such a model split. Specifically, consider as rotation matrix the n × n orthogonal matrix Q from the QR decomposition of X. Define the n × p matrix M and the n × (np) matrix S by (M, S) = Q. Then, the columns of M form an orthonormal basis for the column space of X since X is full rank by assumption (Golub & Van Loan, 1996, § 5.2). Since Q is orthogonal, the columns of S form an orthonormal basis for the orthogonal complement of the column space of X. Therefore, STX = 0(npp, an (np) × p matrix of zeros, which can also be derived from the fact that QTX is upper triangular.

By the rotational invariance of the Gaussian distribution and QTQ=In, QTy~N(QTXβ+QTη,σ2In) is distributionally equivalent to (1). This rotated model splits as

MTy~N(MTXβ+MTη,σ2Ip), (3a)
STy~N(STη,σ2Inp); (3b)

using STX=0(np)×p. This transformation motivates a two-stage approach in which one first computes Π(η|STy) from submodel (3b) and then uses this distribution as an updated prior for the projected nuisance term MTη in submodel (3a). Following this approach, the posterior of β can be expressed as Π(β|y)Π(β)N(MTy|MTXβ+MTη,σ2Ip)dΠ(MTη|STy).

In practice, Π(MTη|STy) may be intractable to compute exactly because of the complexity of Π(η). To alleviate this challenge, we consider an approximation Π^(MTη|STy), which then leads to an approximation for the posterior of β:

Π^(β|y)Π(β)N(MTy|MTXβ+MTη,σ2Ip)dΠ^(MTη|STy). (4)

All distributions, densities, and probabilities resulting from this approximation carry a hat to distinguish them from their exact counterparts.

A Gaussian approximation is analytically convenient:

Π^(MTη|STy)=N(μ^,Σ^), (5)

where μ^ and Σ^ are estimates of the mean and covariance of Π(MTη|STy), respectively. In this case, (4) simplifies as

Π^(β|y)Π(β)N(MTy|MTXβ+μ^,σ2Ip+Σ^). (6)

Only β is unknown such that the computational problems with (1) resulting from the complexity of Π(η) have been resolved in (6). Furthermore, (6) is equivalent to a Gaussian linear model with observations MTyμ^, design matrix MTX, and parameter β. We have reduced a model with a potentially challenging nuisance parameter to a low-dimensional one with the nuisance integrated out while controlling for the effect of the nuisance parameter in a principled manner. Algorithm 1 summarizes our method when the Gaussian approximation from (5) is used.

Algorithm 1.

Integrated rotated Gaussian approximation.

Input: Data (y, X)
1. Compute the QR decomposition of X to obtain the rotation matrix Q = (M, S).
2. Compute the estimates μ^ and Σ^ for the mean and covariance of Π(MTη|STy) based on submodel (3b) using an algorithm of choice.
3. Approximate the posterior Π(β|y) according to (6).
Output: The approximate posterior Π^(β|y)

2·3. Relation to other methods

Algorithm 1 has resemblances with other approximation methods. Integrated nested Laplace approximations (Rue et al., 2009) also approximate a nested part of a Bayesian model by a Gaussian distribution but with important differences. A Laplace approximation is applied without a data rotation and is done at two, rather than one, nested levels of the model. Moreover, a Laplace approximation matches the mode and curvature of the approximating Gaussian while (5) matches the moments. Laplace’s method (Tierney & Kadane, 1986) requires a continuous target distribution and integrated nested Laplace approximations assume a conditionally Gaussian prior on some parameters. Our Gaussian approximation needs no such conditions on priors but assumes a Gaussian error distribution. For instance, § 3 considers examples of priors on η that are not continuous or are non-Gaussian.

The approximation in (5) aims to match the first two moments of the exact Π(MTη | STy). Such matching is the principle behind expectation consistent inference (Opper & Winther, 2005). Our method matches moments for the nuisance parameter but not for the parameter of interest β. This differs from applications of the expectation consistent framework in which moment matching is pervasive such as in expectation propagation (Hernández-Lobato et al., 2015). Implementations of expectation propagation are usually not able to capture dependence among dimensions of the posterior while our method allows for dependence in the p-dimensional β.

Effectively, our method integrates out the nuisance parameter η approximately. Integrating out nuisance parameters from the likelihood is not new (Berger et al., 1999), including doing so approximately (Severini, 2011). Previous approximations, however, do not apply a data rotation and consider cases where the distribution on the nuisance parameter is regular enough so that a Laplace approximation can be applied. Our method does not need such regularity conditions.

The rotation Q is similar to the projection in the Frisch-Waugh-Lovell theorem (Stachurski, 2016, Theorem 11.2.1) for least-squares estimation of a parameter subset. Our method applies beyond least squares. Also, our estimation of the nuisance parameter through the rotation is merely an intermediate step for inference on β. Our method reduces to the algorithm from van den Boom et al. (2015) when considering the example in § 3·2 with p = 1.

2·4. Estimating σ2 and hyperparameters

So far, we have treated σ2 as fixed and known. In practice, σ2 usually needs to be estimated, as well as any unknown parameters in the prior on η. This estimation fits naturally into Step 2 of Algorithm 1 as the methods that can be used there frequently come with such estimation procedures: See for instance § S5·3 and § S6 of the Supplementary Material. The resulting estimates can then be plugged into Step 3. By doing so, only the (np)-dimensional submodel (3b) informs the estimates of these parameters and not the p-dimensional submodel (3a). We expect (3b) to contain the vast majority of information on the unknown parameters if (np) ⨠ p, which is often the case in scenarios of interest.

3. Examples of nuisance parameters η

3·1. Adjusting for high-dimensional covariates

Section 3 provides examples of the general setting of model (1) that demonstrate the utility of the integrated rotated Gaussian approximation in Algorithm 1. As a first example, consider η = with Z a known n × q feature matrix and α an unknown q-dimensional parameter with qn. Then, the model in (1) becomes y~N(Xβ+Zα,σ2In), so that we are adjusting for high-dimensional covariates Z in performing inference on the coefficients β on the predictors X of interest. One way to deal with the fact that the number of covariates q exceeds the number of observations n is by inducing sparsity in α via its prior Π(α). We consider the spike-and-slab prior, αj~λN(0,ψ)+(1λ)δ(0) independently for j = 1, … , q, where λ = pr(αj ≠ 0) is the prior inclusion probability, ψ the slab variance, and δ(0) a point mass at zero. By specifying Π(α), we have also defined Π(η) = Π(). Since each Π(αj) is a mixture of a point mass and a Gaussian, Π(α) and thus Π(η) are mixtures of 2q Gaussians. As a result, computation of π(β | y) in (2) involves summing over these 2q components. This is infeasible for large q.

Algorithm 1 provides an approximation Π^(β|y) while avoiding the exponential computational cost. Step 2 in Algorithm 1 requires choice of an estimation algorithm. Substituting η = into (3b) yields STy~N(STZα,σ2Inp), which is a linear model with (np) observations and design matrix STZ. As such, methods for linear regression with spike-and-slab priors can produce an approximation to Π(α|STy) and thus the estimates μ^ and Σ^ in (5). We choose vector approximate message passing (Rangan et al., 2017), detailed in § S5 of the Supplementary Material, to approximate Π(α|STy) because of its computational scalability and accuracy. The computational scalability limits the size of q. For instance, § 5·3 considers a subset of q = 10, 000 SNPs as using all SNPs was computationally infeasible. As a more scalable alternative, we consider the debiased lasso (Javanmard & Montanari, 2013) in § 5·2 as it can also approximate Π(α|STy) as detailed in § S6 of the Supplementary Material. A q in the millions is feasible with embarrassingly parallel split-and-merge strategies (Song & Liang, 2014). The q-dimensional distribution Π(α|STy) is possibly highly non-Gaussian, being a mixture of Gaussians. At the same time, the p-dimensional distribution Π(MTZα|STy)=Π(MTη|STy) can be nearly Gaussian such that the approximation in (5) is accurate as discussed in § 4·2.

3·2. Bayesian variable selection

For a second application of (1), consider the linear model y~N(Aθ,σ2In) where A is a known n × r design matrix and θ an unknown r-dimensional parameter. Variable selection is the problem of determining which entries of θ are non-zero. Modeling the data in a Bayesian fashion provides a natural framework to evaluate statistical evidence via the posterior Π(θ | y). A standard variable selection prior Π(θ) is the spike-and-slab prior defined by θj~λN(0,ψ)+(1λ)δ(0) independently for j = 1, … , p. As in § 3·1, the cost of computing the exact posterior with a spike-and-slab prior grows exponentially in r. Therefore, computation of Π(θ | y) is infeasible for r beyond moderate size. A variety of approximation methods exist for larger r including Monte Carlo (George & McCulloch, 1993, 1997; O’Hara & Sillanpää, 2009), variational Bayes (Carbonetto & Stephens, 2012; Ormerod et al., 2017), and expectation propagation (Hernández-Lobato et al., 2015).

Monte Carlo methods do not scale well with the number of predictors r. For r even moderately large, the 2r possible non-zero subsets of θ is so huge that there is no hope of visiting more than a vanishingly small proportion of models. The result is high Monte Carlo error in estimating posterior probabilities, with almost all models assigned zero probability as they are never visited. As an alternative to Monte Carlo sampling, fast approximation approaches for Bayesian variable selection include variational Bayes (Carbonetto & Stephens, 2012; Ormerod et al., 2017) and expectation propagation (Hernández-Lobato et al., 2015). Their accuracy, however, does not come with theory guarantees. Our method, which applies to variable selection as detailed in the next paragraph, allows for theoretical analysis as § 4 shows.

In variable selection, often the main question asked is whether θj ≠ 0 (j = 1, … , r) as measured by the posterior inclusion probability pr(θj ≠ 0 | y). Algorithm 1 can estimate pr(θj ≠ 0 | y): Let p < r elements from θ constitute β and let the other q = rp elements in θ constitute α. Then, = + where X and Z consist of the respective columns in A, and Π(α,β)=Π(α)Π(β) since Π(θ)=j=1rΠ(θj). This set-up is the same as in § 3·1 and Algorithm 1 approximates Π(β | y) as in § 3·1. Assuming θj is contained in β, an approximation of Π(θj | y) can be obtained as a marginal distribution of Π^(β|y). Repeating Algorithm 1 with different splits of θ into β and α provides estimates of all pr(θj0|y)(j=1,,r). Computations for these different splits can run in parallel.

The approximation accuracy is not very sensitive to how θ is split into α and β, and to p per § S9 of the Supplementary Material. We therefore use simple sequential splitting, where the first p elements of θ constitute β in the first split, and recommend choosing p based on computational complexity. Assume that the number of CPU cores is less than the number of variables r. Then, computation time to obtain all pr^(θj0|y) is a trade-off between the length p of β, which affects the cost of each execution of Algorithm 1, and the number r/p of executions of Algorithm 1. The order of r is limited by the order of q, which is again limited by the algorithm chosen for Step 2 of Algorithm 1 as discussed in § 3·1. The complexity in terms of p and r of computing all pr^(θj0|y) is O(r2 log2 r) if p = O(log r) and vector approximate message passing is used as detailed in the next paragraph.

Step 1 of Algorithm 1 is the QR decomposition of an n × p matrix which has complexity O(np2) (Golub & Van Loan, 1996, § 5.2). Step 2 involves vector approximate message passing on np observations and q parameters, which has a complexity of O{(np + K) q min(np, q)} where K is the number of message passing iterations as detailed in § S5·2 of the Supplementary Material. Additionally for Step 2, computation of STy and STZ, which are the observations and design matrix in (3b), and computing μ^ and Σ^ in (5) from the message passing output is O(n2q). Computing Step 3 with the spike-and-slab prior Π(β) is O(2p p3), ignoring dependence on n. The complexity of obtaining all pr^(θj0|y) by applying Algorithm 1 r/p times is thus O{(r/p)(q+2pp3)}=O{(r/p)(rp+2pp3)}, ignoring dependence on n and K. For p = O(log r), this complexity reduces to O(r2log2r).

3·3. Non-parametric adjustment for covariates

As a last example, let ηi = (gf)(zi) (i = 1, … , n) where g: is a known, differentiable, non-linear function, f:q is an unknown function, gf:q is g composed with f, and zi is a q-dimensional feature vector. Then, ηi provides a non-parametric adjustment for the covariate zi in performing inferences on the effect of xi on yi. Take f’s prior as a Gaussian process that induces a prior Π(η). Algorithm 1 applies if a Gaussian approximation Π^(MTη|STy) is available: Submodel (3b) reduces to STy~N{STG(F),σ2Inp} where F={f(z1),,f(zn)}T and G(F)={g(F1),,g(Fn)}T, which is a non-linear Gaussian model as studied in Steinberg & Bonilla (2014). Linearizing G using a first-order Taylor series yields a Gauss-Newton algorithm for a Laplace approximation of Π(F | STy) as detailed in § S7 of the Supplementary Material. Based on that approximation, compute μ^ and Σ^ in (5), for instance by sampling F from a Laplace approximation Π^(F|STy) and computing the sample mean and covariance of MTG(F) since MTη=MTG(F).

4. Analysis of integrated rotated Gaussian approximation

4·1. Approximation accuracy

This section provides theoretical guarantees on the accuracy of our posterior approximation framework. We begin with a general upper bound in terms of the accuracy of the approximation for the projected nuisance parameter. For this, denote the distribution of the p-dimensional a + b where b~N(0,σ2Ip) by Π(a)Nσ2. Define the Kullback-Leibler divergence from Π(b) to Π(a) as D{Π(a)Π(b)}=log{π(a)/π(b)}dΠ(a).

At a high level, it is clear that the accuracy of the approximation Π^(β|y) defined in (4) depends on the accuracy of the approximation Π^(MTη|STy). The following result quantifies the nature of this dependence in the setting where the data are generated from the prior predictive distribution. This result applies generally for any approximation Π^(MTη|STy) and thus includes the Gaussian approximation (5) used in Algorithm 1 as a special case.

Theorem 1. Let y be distributed according to the model in (1) with β ~ Π(β) and η ~ Π(η) distributed according to their priors. Conditional on any realization STy, the posterior approximation Π^(β|y) described in (4) satisfies

E[D{Π(β|y)Π^(β|y)}|STy]D{Π(MTη|STy)Nσ2Π^(MTη|STy)Nσ2},

where the expectation on the left is with respect to the conditional distribution of y given STy.

A particularly useful property of Theorem 1 is that the upper bound does not depend in any way on the prior Π(β). This differs from some of the related work on posterior approximation, such as Huggins et al. (2017), which requires additional smoothness constraints, and thus excludes certain priors such as the spike-and-slab prior in § 3·1. Another useful property of Theorem 1 is that it does not require any assumptions about the extent to which the exact posterior Π(MTη | STy) is concentrated about the ground truth. As a consequence, this result is relevant for non-asymptotic settings where there may be high uncertainty about η.

4·2. Accuracy of the Gaussian approximation

Next, we provide theoretical justification for a Gaussian approximation to Π(MTη | STy) by showing that such an approximation can be accurate even when the prior on η is highly non-Gaussian. Without loss of generality, we focus on the set-up of § 3·1 where the nuisance term has the form η = with a known n × q feature matrix Z and unknown parameter vector α. In this setting, the projected nuisance parameter MTη can be expressed as MT where MTZ is a p × q matrix with pq. There are no constraints on the dimension n other than np.

As motivation for a Gaussian approximation to the projected nuisance term, consider the special case where the conditional distribution Π(α | STy) is a product measure with uniformly bounded second moments. Under regularity assumptions on the columns of MTZ, the multivariate central limit theorem combined with the assumption pq implies that the distribution of the projection MT is close to the Gaussian distribution with the same mean and covariance. By contrast, the unprojected n-dimensional nuisance term η = can be very far from Gaussian, particularly if n is of a similar order to q.

More realistically, one may envision settings where the entries of Π(α | STy) are not independent but are weakly correlated on average. In this case, the usual central limit theorem does not hold because one can construct counterexamples in which the normalized sum of dependent but uncorrelated variables is far from Gaussian. Nevertheless, a classic result due to Diaconis & Freedman (1984) suggests that these counterexamples are atypical. Specifically, if one considers a weighted linear combination of the entries in α, then approximate Gaussianity holds for most choices of the weights, where most is quantified with respect to the uniform measure on the sphere. The implications of this phenomenon have been studied extensively in the context of statistical inference (Hall & Li, 1993; Leeb, 2013), and Meckes (2012) and Reeves (2017) provide approximation bounds for the setting of multidimensional linear projections.

In the context of our approximation framework, these results imply that a Gaussian approximation is accurate for most, but not necessarily all, instances of the p × q feature matrix MTZ. To make this statement mathematically precise, we consider the expected behavior when the rows of Z are drawn independently from the q-dimensional Gaussian distribution N(0, Λ) where Λ is positive definite. As in the rest of the paper, we assume that X is fixed and arbitrary. Under these assumptions, the rows of the projected matrices MTZ and STZ are independent with the same distribution as in Z.

Our results depend on certain properties of the conditional distribution Π(α|STy,STZ). Let ξ and Ψ denote the mean and covariance of Π(α|STy,STZ), respectively. Define

m1=E{|Λ12(αξ)2tr(ΛΨ)1||STy,STZ},m2=tr{(ΛΨ)2}tr(ΛΨ)2.

The term m1 provides a measure of the concentration of Λ1/2(αξ)2 about its mean and satisfies 0 ≤ m1 ≤ 2. The term m2 provides a measure of the average correlation between the entries of Λ1/2α and satisfies 1/qm2 ≤ 1 with equality on the left when ΛΨ is proportional to the identity matrix and equality on the right when ΛΨ has rank one.

Given estimates ξ^ and Ψ^ that are functions of STy and STZ, we consider the Gaussian approximation

Π^(MTη|STy,STZ)=N{MTZξ^,tr(ΛΨ^)Ip}. (7)

The covariance is chosen independently of MTZ and depends only on a scalar summary of the estimated covariance. The following result bounds the accuracy of this approximation in terms of the terms m1 and m2 and the accuracy of the estimated mean and covariance.

Theorem 2. Conditional on any STy and STZ, the Gaussian approximation in (7) satisfies

EMTZ[D{Π(MTη|STy,STZ)Nσ2Π^(MTη|STy,STZ)Nσ2}]δ1+δ2,

where the expectation is with respect to MTZ and

δ1=3p[m1log{1+tr(ΛΨ)σ2}+m214+m212{1+3tr(ΛΨ)σ2}p4],
δ2=pΛ12(ξξ^)22σ2+p2σ2{tr(ΛΨ)12tr(ΛΨ^)12}2.

This result is meaningful when pq and the noise variance is non-negligible compared to the covariance of the nuisance term such the ratio tr(ΛΨ)/σ2 is bounded from above. Then, δ1 converges to zero as m1 and m2 become small. The term δ2 quantifies the effect of mismatch between the first and second moments of Π(α | STy, STZ) and their approximations. The dependence on the second moments appears only in the terms tr(ΛΨ) and tr(ΛΨ^). Thus, this bound can be small even if the approximation Ψ^ is very different from the true covariance Ψ.

To illustrate the significance of our results, consider two scaling regimes. First, if nq then the same arguments used in the proof of Theorem 2 can be used to show that the distribution of the n-dimensional nuisance term η is also approximately Gaussian. Then, our approximation framework is well motivated, but does not differ fundamentally from existing approaches that apply a Laplace approximation directly on the unrotated data. The second, and more interesting, regime occurs when nq or nq. Then, the n-dimensional nuisance term is non-Gaussian in general, because there exists a near isometry between η and α. Our approximation framework can provide significant gains by taking this non-Gaussianity into account when estimating the mean and covariance. Moreover, combining Theorems 1 and 2 provides an upper bound on the error of the approximation to the posterior of β described in Algorithm 1. In particular, if the approximations of the mean and covariance are accurate enough, then this approximation error converges to zero as the terms m1 and m2 become small.

4·3. Variable selection consistency

Finally, we provide guarantees for variable selection consistency of (6), which only considers β in contrast to § 3·2. Let the set γ ⊂ {1, … , p} contain all indices j such that βj ≠ 0. Define γ0 analogously for a non-random β0. Variable selection consistency as in Fernández et al. (2001) and Liang et al. (2008) means that, for y ~ N (0 + η0, σ2In), the posterior probability of the true model γ0 converges to one, pr(γ = γ0 | y) → 1 as n → ∞ where p does not change with n. It is desirable for a posterior approximation to inherit this property. Monte Carlo approximations do, but only if they are run for an infinite amount of time. Our approximation bypasses the need for such sampling, instead requiring mean and variance estimation for (5), while inheriting the consistency property if Π^(MTη|STy) concentrates appropriately. Relatedly, Ormerod et al. (2017) established such consistency for their variational Bayes algorithm. More recently, K. Ray and B. Szabó (arXiv:1904.07150) showed optimal convergence rates of variable selection using variational Bayes with different priors than we consider here.

Let the |γ|-dimensional vector βγ consist of the elements in β with indices in γ, and the n × |γ| matrix Xγ consist of the columns of X with indices in γ. Then, specifying Π(γ) and Π(βγ | γ) defines Π(β). We consider g-priors (Zellner, 1986):

βγ|γ~N{0,σ2gn(XγTXγ)1},gn(0,). (8)

Liang et al. (2008) showed variable selection consistency for priors of this form. Our approximation inherits this property under the additional assumption (9) on Π^(MTη|STy) and gn. This is an assumption on gn and σ2 jointly since Π^(MTη|STy) depends on σ2. Otherwise, the sensitivity on σ2 is limited since the property considers the asymptotic regime n → ∞, when the signal-to-noise ratio goes to infinity regardless of σ2.

Theorem 3. Let Π(βγ) be the g-prior on βγ from (8). Assume that gn in (8), Π(γ), and X satisfy pr(γ=γ0)>0, limn{InXγ(XγTXγ)1XγT}Xβ0/n>0 for any γ not containing γ0, gn → ∞, and log(gn)/n → 0, which are standard assumptions used in Fernández et al. (2001) and Liang et al. (2008) as detailed in § S4 of the Supplementary Material. Let y be distributed according to the data-generating model in (1) with β and η fixed to β0 and η0, respectively. Assume that Π^(MTη|STy) concentrates appropriately in that

MTηMTη02loggn0, (9)

in probability with respect to MTη~Π^(MTη|STy) and y. Let Π^(β|y) be as in (4). Then, pr^(γ=γ0|y)1 in probability with respect to y as n → ∞.

5. Simulation studies and applications

5·1. Non-parametric adjustment for covariates

Consider the set-up from § 3·3 with g(a) = a2 and q = 1. We assign f: a zero-mean Gaussian process prior with a squared exponential covariance function such that cov{f(zi),f(zj)}=exp{(zizj)2/10}(i,j=1,,n), and β ~ N(0, 16Ip). Set n = 100, p = 3, and σ2 = 1. We draw the rows of X independently from N(0p×1, Φ) where Φ is a Toeplitz matrix defined so that its first row equals (0.90, … , 0.9p). Then, the columns of X are correlated. The features zi (i = 1, … , n) equal the ith element of the first column of X. Generate y according to (1) with f equal to a draw from its prior distribution and β = (4, −4, 4)T.

We approximate the posterior Π(β | y) using a random walk Metropolis-Hastings algorithm on f with 10,000 burnin and 90,000 recorded iterations. We marginalize out β since Π(β | f, y) is analytically available, allowing approximation of π(β | y) with samples from Π(f | y). Algorithm 1 also provides Π^(β|y) per § 3·3. Lastly, ignoring the non-parametric nuisance parameter by setting ηi = (gf)(zi) = 0 yields a simpler approximation. The Metropolis-Hastings algorithm took 6 minutes while our method finished in 2 seconds. The resulting posterior density estimates for βj (j = 1, … , p) are in Fig. 1. Taking the Metropolis-Hastings estimate as the gold standard, our method yields an approximation that matches the location and spread of the posterior better than the result from ignoring the non-parametric nuisance term η.

Fig. 1.

Fig. 1.

Marginal posterior density estimates from the simulation in § 5·1 with the solid line representing the Gibbs estimate π(βj | y), the thick dotted line the estimate π^(βj|y) from Algorithm 1, and the thin dotted line the estimate resulting from ignoring the nuisance parameter.

5·2. Bayesian variable selection

We consider the diabetes data from Efron et al. (2004) as it is a popular example of variable selection with collinear predictors (Park & Casella, 2008; Polson et al., 2013). The outcome y measures disease progression one year after baseline for n = 442 patients with diabetes. The r = 64 predictors come from 10 covariates with their squares and interactions. The outcome and predictors are standardized to have zero mean and unit norm. Consider the variable selection set-up from § 3·2 with prior inclusion probability λ = 1/2 and ψ = 1. Usually, one would not use scalable approximations for such a moderate-dimensional problem as a Gibbs sampler can provide accurate estimates. The latter is why we include it here as these accurate estimates enable assessment of the approximation accuracy of scalable methods.

We estimate the posterior inclusion probabilities pr(θj ≠ 0 | y) (j = 1, … , r) using 1) a Gibbs sampler with 10,000 burnin and 90,000 recorded iterations, Algorithm 1 as described in § 3·2 using 2) vector approximate message passing and 3) the debiased lasso in Step 2 with p = 4 as suggested by p = O(log r) and parallelization across 8 CPU cores, 4) expectation propagation as in Hernández-Lobato et al. (2015), and 5) variational Bayes as in Carbonetto & Stephens (2012). To implement expectation propagation and variational Bayes, we used the R code from https://jmhl.org/publications/ dated January 2010 and the R package ‘varbvs’ version 2.5-7, respectively. Results from the variational Bayes algorithm by Ormerod et al. (2017) are omitted as the method from Carbonetto & Stephens (2012) outperforms it in the scenarios that we consider. Since the error variance is unknown, we assign it the prior 1/σ2 ~ Ga(1, 1), a gamma distribution with unit shape and rate parameter. The Gibbs sampler incorporates this prior. Algorithm 1 estimates σ2 as described in § 2·4, and § S5·3 and § S6 of the Supplementary Material. Expectation propagation estimates σ2 by maximizing approximate evidence (Hernández-Lobato etal., 2015). The R package ‘varbvs’ (Carbonetto & Stephens, 2012) uses approximate maximum likelihood for σ2 within the variational Bayes method.

As discussed in § 3·2, determining whether posterior inclusion probabilities from a Gibbs sampler are accurate is non-trivial. Overlapping batch means (Flegal & Jones, 2010, § 3) estimates their average Monte Carlo standard error as 0.0015 in this application.

Table 1 focuses on the errors in the posterior inclusion probability estimates. An approximation error of 0.01 is worse when the inclusion probability is 0.01 versus 0.5. We therefore transform the probabilities to log odds. Our method with vector approximate message passing outperforms expectation propagation and variational Bayes as its error is lowest in Table 1, though at a higher computational cost. Our method is slowest but still considerably faster than the Gibbs sampler which took 11 minutes to run. Since the debiased lasso yielded the worst approximation, we do not consider it in the remainder of this article.

Table 1.

Summary statistics of the absolute difference between the Gibbs sampler estimates and the approximations of the posterior log odds of inclusion for the application in § 5·2 with computation times. IRGA and VAMP stand for integrated rotated Gaussian approximation and vector approximate message passing, respectively.

Method Min Q1 Median Q3 Max Mean Computation time (seconds)
IRGA with VAMP 0.003 0.036 0.076 0.133 10.7 0.599 4.1
IRGA with the debiased lasso 0.003 0.100 0.142 0.199 7.85 0.470 3.8
Expectation propagation 0.003 0.061 0.109 0.168 11.9 0.666 0.8
Variational Bayes 0.002 0.093 0.124 0.166 11.6 0.667 1.0

5·3. Controlling for single-nucleotide polymorphisms

The Geuvadis dataset from Lappalainen et al. (2013), available at https://www.ebi.ac.uk/Tools/geuvadis-das, contains gene expression data from lymphoblastoid cell lines of n = 462 individuals from the 1000 Genomes Project along with roughly 38 million SNPs. We focus on the gene E2F2, ensemble ID ENSG00000007968, as it plays a key role in the cell cycle (Attwooll et al., 2004). Our focus is on assessing whether expression differs between populations, even after adjusting for genetic variation between individuals. Specifically, we compare people from British descent with the four other populations given in Table 2. If such differences occur, they can be presumed to be due to environmental factors that differ between these populations and that relate to E2F2 expression. We therefore consider the set-up from § 3·1 with y the E2F2 gene expressions, X demographic factors, and Z containing SNPs we would like to control for.

Table 2.

Posterior inclusion probabilities for the demographic factors from the application in § 5·3. IRGA stands for integrated rotated Gaussian approximation.

Population
Method Gender Utahn of European ancestry Finnish Tuscan Yoruba
IRGA 0.83 0.96 0.96 0.92 0.00
Ignoring the SNPs 0.73 0.07 0.04 0.20 0.49

The demographics in X are gender and the 4 populations with British as the reference group. The matrix X thus has p = 5 columns. The covariates Z consist of q = 10, 000 SNPs selected using sure independence screening (Fan & Lv, 2008) as vector approximate message passing on all 38 million SNPs was infeasible. We standardize y and the columns of X and Z to have zero mean and unit variance. To complete the set-up from § 3·1, set λ = n/(10 q) and ψ = 1/n for the spike-and-slab prior on α while Π(β) is a spike-and-slab with prior inclusion probability 1/2 and slab variance 1 such that, a priori, the SNPs do not capture more variation in the outcome than the demographic factors. This may provide a reasonable default for SNP data, but in other settings, hyperparameter values should be reconsidered. Vector approximate message passing estimates σ2 using the prior 1/σ2 ~ Ga(1, 1) and employs damping to achieve convergence in this application, as described in § S5·3 and § S5·4 of the Supplementary Material, respectively.

Table 2 contains the resulting posterior inclusion probabilities for the demographic factors, also when not controlling for the SNPs. The results vary hugely by whether SNPs are controlled for, with more evidence of a difference in the expression of gene E2F2 by population when controlling for SNPs using Algorithm 1. Section S10 of the Supplementary Material contains additional comparisons with other high-dimensional inference methods.

Section S8 of the Supplementary Material contains additional simulation studies. They further show that integrated rotated Gaussian approximation outperforms variational Bayes and either beats or is on par with expectation propagation in terms of approximation accuracy. This improved accuracy comes with increased computational cost for our method in certain scenarios.

6. Discussion

Although our focus was Bayesian inference, our method marginalizes out nuisance parameters from the likelihood for β as an intermediate step. This approximate likelihood from (6) can be useful in frequentist inference. It is well known that priors used in Bayesian inference correspond to penalties in frequentist inference. One can think of the log prior for the nuisance parameter η as a penalty on η. An L2 penalty might not be ideal due to the complex or high-dimensional nature of η. Instead, one might want to use sparsity-inducing penalties, such as L1 or the non-convex smoothly clipped absolute deviation, which come with attractive theoretical properties (Pötscher & Leeb, 2009) but can be computationally challenging. Our method obtains the marginal likelihood for β with such penalties on η, resolving the main computational bottleneck for frequentist inference on β in the model of interest in (1).

Supplementary Material

PDF

Acknowledgment

This work was partially supported by the National Institute of Environmental Health Sciences of the U.S. National Institutes of Health, the Singapore Ministry of Education Academic Research Fund, and the Laboratory for Analytic Sciences. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Laboratory for Analytic Sciences and/or any agency or entity of the United States Government.

Footnotes

Supplementary material

Supplementary material available at Biometrika online includes the proofs for § 4, a corollary to Theorem 1, details of vector approximate message passing and the Laplace approximation for § 3·3, and additional simulation studies. The R code for the numerical results is available at https://github.com/willemvandenboom/IRGA.

Contributor Information

W. VAN DEN BOOM, Yale-NUS College, National University of Singapore, 16 College Avenue West #01-220, Singapore 138527, Singapore

G. REEVES, Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27708, U.S.A.

D. B. DUNSON, Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27708, U.S.A.

References

  1. Attwooll C, Denchi EL & Helin K (2004). The E2F family: Specific functions and overlapping interests. The EMBO Journal 23, 4709–4716. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Berger JO, Liseo B & Wolpert RL (1999). Integrated likelihood methods for eliminating nuisance parameters. Statistical Science 14, 1–28. [Google Scholar]
  3. Carbonetto P & Stephens M (2012). Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7, 73–108. [Google Scholar]
  4. Diaconis P & Freedman D (1984). Asymptotics of graphical projection pursuit. Ann. Stat 12, 793–815. [Google Scholar]
  5. Efron B, Hastie T, Johnstone I & Tibshirani R (2004). Least angle regression. Ann. Stat 32, 407–499. [Google Scholar]
  6. Fan J & Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Statist. Soc. B 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fernández C, Ley E & Steel MF (2001). Benchmark priors for Bayesian model averaging. J. Econom 100, 381–427. [Google Scholar]
  8. Flegal JM & Jones GL (2010). Batch means and spectral variance estimators in Markov chain Monte Carlo. Ann. Stat 38, 1034–1070. [Google Scholar]
  9. George EI & McCulloch RE (1993). Variable selection via Gibbs sampling. J. Am. Statist. Assoc 88, 881–889. [Google Scholar]
  10. George EI & McCulloch RE (1997). Approaches for Bayesian variable selection. Stat. Sin 7, 339–374. [Google Scholar]
  11. Golub GH & Van Loan CF (1996). Matrix Computations. Baltimore: Johns Hopkins University Press, 3rd ed. [Google Scholar]
  12. Hall P & Li K-C (1993). On almost linearity of low dimensional projections from high dimensional data. Ann. Stat 21, 867–889. [Google Scholar]
  13. Hernández-Lobato JM, Hernández-Lobato D & Suárez A (2015). Expectation propagation in linear regression models with spike-and-slab priors. Mach. Learn 99, 437–487. [Google Scholar]
  14. Huggins J, Adams RP & Broderick T (2017). PASS-GLM: Polynomial approximate sufficient statistics for scalable Bayesian GLM inference. In Advances in Neural Information Processing Systems 30. pp. 3611–3621. [Google Scholar]
  15. Javanmard A & Montanari A (2013). Confidence intervals and hypothesis testing for high-dimensional statistical models. In Advances in Neural Information Processing Systems 26. pp. 1187–1195. [Google Scholar]
  16. Lappalainen T, Sammeth M, Friedländer MR, t Hoen PAC, Monlong J, Rivas MA, González-Porta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlöf J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, Sultan M, Bertier G, MacArthur DG, Lek M, Lizano E, Buermans HPJ, Padioleau I, Schwarzmayr T, Karlberg O, Ongen H, Kilpinen H, Beltran S, Gut M, Kahlem K, Amstislavskiy V, Stegle O, Pirinen M, Montgomery SB, Donnelly P, McCarthy MI, Flicek P, Strom TM, Lehrach H, Schreiber S, Sudbrak R, Carracedo Á, Antonarakis SE, Häsler R, Syvänen A-C, van Ommen G-J, Brazma A, Meitinger T, Rosenstiel P, Guigó R, Gut IG, Estivill X & Dermitzakis ET (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Leeb H (2013). On the conditional distributions of low-dimensional projections from high-dimensional data. Ann. Stat 41, 464–483. [Google Scholar]
  18. Liang F, Paulo R, Molina G, Clyde MA & Berger JO (2008). Mixtures of g priors for Bayesian variable selection. J. Am. Statist. Assoc 103, 410–423. [Google Scholar]
  19. Meckes E (2012). Projections of probability distributions: A measure-theoretic Dvoretzky theorem. In Lecture Notes in Mathematics. Berlin: Springer, pp. 317–326. [Google Scholar]
  20. O’Hara RB & Sillanpää MJ (2009). A review of Bayesian variable selection methods: What, how and which. Bayesian Anal. 4, 85–117. [Google Scholar]
  21. Opper M & Winther O (2005). Expectation consistent approximate inference. J. Mach. Learn. Res 6, 2177–2204. [Google Scholar]
  22. Ormerod JT, You C & Müller S (2017). A variational Bayes approach to variable selection. Electron. J. of Stat 11, 3549–3594. [Google Scholar]
  23. Park T & Casella G (2008). The Bayesian lasso. J. Am. Statist. Assoc 103, 681–686. [Google Scholar]
  24. Polson NG, Scott JG & Windle J (2013). The Bayesian bridge. J. R. Statist. Soc. B 76, 713–733. [Google Scholar]
  25. Pötscher BM & Leeb H (2009). On the distribution of penalized maximum likelihood estimators: The LASSO, SCAD, and thresholding. J. of Multivar. Anal 100, 2065–2082. [Google Scholar]
  26. Rangan S, Schniter P & Fletcher AK (2017). Vector approximate message passing. In IEEE International Symposium on Information Theory. pp. 1588–1592. [Google Scholar]
  27. Reeves G (2017). Conditional central limit theorems for Gaussian projections. In IEEE International Symposium on Information Theory. pp. 3045–3049. [Google Scholar]
  28. Rue H, Martino S & Chopin N (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested laplace approximations. J. R. Statist. Soc. B 71, 319–392. [Google Scholar]
  29. Severini TA (2011). Frequency properties of inferences based on an integrated likelihood function. Stat. Sin 21, 433–447. [Google Scholar]
  30. Song Q & Liang F (2014). A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression. J. R. Statist. Soc. B 77, 947–972. [Google Scholar]
  31. Stachurski J (2016). A Primer in Econometric Theory. Cambridge: MIT Press. [Google Scholar]
  32. Steinberg DM & Bonilla EV (2014). Extended and unscented Gaussian processes. In Advances in Neural Information Processing Systems 27. pp. 1251–1259. [Google Scholar]
  33. Tierney L & Kadane JB (1986). Accurate approximations for posterior moments and marginal densities. J. Am. Statist. Assoc 81, 82–86. [Google Scholar]
  34. van den Boom W, Dunson D & Reeves G (2015). Quantifying uncertainty in variable selection with arbitrary matrices. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). pp. 385–388. [Google Scholar]
  35. Zellner A (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian inference and decision techniques: Essays in Honor of Bruno de Finetti, Goel PK & Zellner A, eds. Amsterdam: North-Holland/Elsevier, pp. 233–243. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

PDF

RESOURCES