Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 May 14.
Published in final edited form as: Biometrics. 2026 Apr 9;82(2):ujag071. doi: 10.1093/biomtc/ujag071

Scalable Gaussian Process Regression via Median Posterior Inference for Estimating the Health Effects of an Environmental Mixture

Aaron Sonabend-W 1, Jiangshan Zhang 2, Edgar Castro 3, Joel Schwartz 4, Brent A Coull 5, Junwei Lu 6
PMCID: PMC13164939  NIHMSID: NIHMS2158246  PMID: 42118800

Summary:

Humans are exposed to complex mixtures of environmental pollutants rather than single chemicals, necessitating methods to quantify the health effects of such mixtures. Research on environmental mixtures provides insights into realistic exposure scenarios, informing regulatory policies that better protect public health. However, statistical challenges, including complex correlations among pollutants and nonlinear multivariate exposure-response relationships, complicate such analyses. A popular Bayesian semi-parametric Gaussian process regression framework (Coull et al., 2015) addresses these challenges by modeling exposure-response functions with Gaussian processes and performing feature selection to manage high-dimensional exposures while accounting for confounders. Originally designed for small to moderate-sized cohort studies, this framework does not scale well to massive datasets. To address this, we propose a divide-and-conquer strategy, partitioning data, computing posterior distributions in parallel, and combining results using the generalized median. While we focus on Gaussian process models for environmental mixtures, the proposed distributed computing strategy is broadly applicable to other Bayesian models with computationally prohibitive full-sample Markov Chain Monte Carlo fitting. We apply this method to estimate associations between a mixture of ambient air pollutants and 650,000 birthweights recorded in Massachusetts during 2001–2012. Our results reveal negative associations between birthweight and traffic pollution markers, including elemental and organic carbon and PM2.5, and positive associations with ozone and vegetation greenness.

Keywords: BKMR, Median Posterior, Multi-Pollutant Mixtures, Scalable Bayesian Inference, Semi-parametric regression

1. Introduction

Ambient air pollution consists of a heterogeneous mixture of multiple chemical components, with these components being generated by different sources. Therefore, quantification of the health effects of this mixture can yield important evidence on the source-specific health effects of air pollution, which has the potential to provide evidence to support targeted regulations for ambient pollution levels.

As is now well-documented, there are several statistical challenges involved in estimating the health effects of an environmental mixture. First, the relationship between health outcomes and multiple pollutants can be complex, potentially involving non-linear and non-additive effects. Second, pollutant levels can be highly correlated, but only some may impact health. Therefore, models inducing sparsity are often advantageous. Feature engineering, such as basis expansions to allow interaction terms, can lead to high dimensional inference. Alternatively, parametric models can be used, however they require the analyst to impose a functional form, which can yield biased estimates in the likely case that the model is miss-specified.

Several methods address the issues discussed above (Billionnet et al., 2012). A common approach to modelling the complex relationship between pollutants and outcomes is to use flexible models such as random forests which have been shown to be consistent (Scornet et al., 2015), or universal approximators, such as neural networks (Schmidhuber, 2015). These are useful but yield results which are hard to interpret: one cannot report the directionality or magnitude of the feature effect on the outcome. In this context, our interest lies in both prediction as well as interpretation. Another possible way to incorporate flexible multi-pollutant modelling is by clustering pollution-exposure levels and including clusters as covariates in parametric models. This approach essentially stratifies exposure levels which results in important loss of information. It ultimately forces the analyst to adapt the question of interest into one that can be solved by available tools, instead of tackling the relevant questions. A common approach to address the high-dimensionality of multi-pollutants effects is to posit a generalized additive model. This allows one to estimate the association between a health outcome and a single pollutant, which can be repeated for every exposure of interest (Stieb et al., 2012). Flexible modelling such as quantile regression can be employed to deal with outliers and account for possible differences in associations across the health outcome (Fong et al., 2019). However, the clear downside is that incorporating multi-pollutant mixtures quickly makes this approach computationally infeasible. Alternatively, other parametric models (Gaskins et al., 2019; Joubert et al., 2022; Yu et al., 2022) have be used to evaluate the associations of interest, with the downside of imposing a functional form. To enforce sparsity on the feature space, variable selection methods such as least absolute shrinkage and selection operator (LASSO) penalty can be used (Tibshirani, 1996), however to use such methods one must specify a parametric model which brings us back to the likely misspecification scenario, in which estimated associations and causal effects may be biased.

A popular approach to simultaneously addressing these issues on small-scale data is the use of a semi-parametric Gaussian process model, often referred to as Bayesian kernel machine regression (BKMR)(Bobb et al., 2015; Coull et al., 2015). The pollutants-health outcome relationship is modelled through a Gaussian process, which allows for a flexible functional relationship between the pollutants and the outcome of interest. The model allows for feature selection among the pollutants to discard those with no estimable health effect and to account for high correlation among those with and without a health effect. This framework allows the incorporation of linear effects of baseline covariates, yielding an interpretable model.

Even though this framework is frequently employed in the multi-pollutant context, large datasets make it prohibitively slow as it involves Bayesian posterior calculation. To address this scalability challenge, we propose a divide-and-conquer approach in which we split samples, compute the posterior distribution, and then combine the smaller samples using the generalized median. This method allows capturing small effects from large datasets in little time. Our distributed algorithm is based on aggregating the median of the posteriors computed in parallel on the distributed datasets. Such a strategy is not only applicable to Gaussian process regression but also can be applied to a wider range of Bayesian methods. We provide theoretical guarantees for the convergence of the core Gaussian process component of the model, flexible to different function spaces. We then apply this scalable method to a challenging, large-scale dataset of 650,000 birthweights in Massachusetts to quantify the health effects of a mixture of ambient air pollutants.

2. Method

2.1. Semi-parametric Regression

Suppose we observe a sample of n independent, identically distributed (i.i.d.) random vectors Sn=Dii=1n, where Di=Yi,Xi,Zi~P0 with Xi𝒳Rp a vector of possible confounders, and Zi𝒵Rq a vector of environmental exposure levels. We will assume health outcome Y has a linear relationship with confounders X and a non-parametric relationship with exposure vector Z. In particular, for Di we assume the following semi-parametric relationship:

Yi=Xiβ0+h0Zi+ei, (1)

where ei~𝒩0,σ2, and h0:𝒵R is an unknown function which we allow to incorporate non-linearity and interaction among the pollutants. We require h0 to be in an α-H older space or to be infinitely differentiable. We formalize this in Section 3.

2.2. Prior Specification

To perform inference on h0, we will use a re-scaled Gaussian process prior (Williams and Rasmussen, 2019). In particular, we will use a squared exponential process equipped with an inverse Gamma bandwidth. That is, we will use prior

h0(Z)~𝒩(0,K),

where CovZ,Z=KZ,Z;ρ=exp-1ρ2Z-Z22, and 1ρq is a Gamma distributed random variable. We choose this kernel as it is flexible enough to approximate smooth functions, more so when the bandwidth parameter ρ can be estimated from the data.

Alternatively, we can also augment the Gaussian kernel to allow for sparse solutions on the number of pollutants that contribute to the outcome (Bobb et al., 2015). Let the augmented co-variance function be CovZ,Z=KZ,Z;r=exp-j=1qrjZj-Zj2. To select pollutants we assume a “slab-and-spike” prior on the selection variables rj~grη, with

grηr,η=ηf1r+1-ηf0,η~Bernoulliπ,

where f1 has support on R+ and f0 is the point mass at 0. The random variables ηj~Bernoulliπj can then be interpreted as indicators of whether exposure Zj is associated with the health outcome, and the variable importance for each exposure given the data is reflected by the posterior probability that ηj=1. Finally, for simplicity, we will assume an improper prior on the linear component: β~1. This linear component will capture the effects of confounders. We further use a Gamma prior distribution for the error term variance σ2.

2.3. Estimation

Let h=h0Z1,,h0Zn, Liu et al. (2007) have shown that model (1) can be expressed as

Yi~𝒩h0Zi+XiTβ0,σ2,h~𝒩(0,τK).

This will allow us to simplify our inference procedure and split the problem into tractable posterior estimation (Bobb et al., 2015) for each component of interest. In particular, we can use Gibbs steps to sample the conditionals for β,σ2 and h analytically. Letting λ=τσ2, we use a Metropolis-Hastings step, the full set of posteriors is given in equation (2).

βσ2,λ,r,Y~NVβXVλ,Z,r-1Y,σ2Vβ,σ2β,λ,r,Y~Gammaασ+n2,bσ+12WSSβ,λ,r,hβ,σ2,λ,r,Y,X,Z~NλKZ,rVλ,Z,r-1(Y-Xβ),σ2λKZ,rVλ,Z,r-1,fr,η,β,σ2,λ,yΓjηj+aπΓq-jηj+bπj=1qfrjηj,f(λβ,r,η,Y,X,Z)Vλ,Z,r-1-1/2exp-12σ2WSSβ,λ,rGammaλaλ,bλ, (2)

where Vλ,Z,r=In+λKZ,r,Vβ=XVλ,Z,r-1X-1,WSSβ,λ,r=(Y-Xβ)Vλ,Z,r-1(Y-Xβ).

To perform inference for functions of interest in (2), we will use Markov Chain Monte Carlo (MCMC) techniques. Furthermore, even though function h has a closed-form posterior, large samples will require large matrix inversions.

Posterior sampling can be challenging for this model. This is particularly true for sampling h, as the posterior Gaussian process is n-dimensional. First, in practice the number of samples for the burn-in state required from the true posterior significantly increases with dimension. Second, the problem worsens as the sample size grows. The computational cost for each iteration is 𝒪n3, since to sample from the posterior of h we need to compute an inverse of an n×n kernel matrix indicated in (2). This renders the method prohibitively slow for real applications on large data sets. On the other hand these are precisely the data sets needed as they can actually shed light on the small effects of environmental mixtures on health outcomes. This predicament motivates the development of a computationally fast version of inference for (2), and particularly for h.

2.4. Fast Inference on Posteriors via Sub-sampling

In order to make posterior sampling computationally feasible, we propose a sample splitting technique which is guaranteed to satisfy the needed theoretical properties. Our approach consists of computing multiple noisy versions of the posteriors we are interested in, and using the median of these as a proxy for the full data posterior.

First, we randomly split the entire data set into several disjoint subsets with roughly equal sample size without replacement. Let Sk=1K denote a random partition of Sn into K disjoint subsets of size nk=n/K with index sets kk=1K. Then for each subset Sk, we run a modified version of the estimation approach described in Section 2.3 using sub-sampling sketching matrices SkRn×nk. This will yield K posterior distributions for each parameter and function in (2).

We define the n×nk sketching matrix Sk with its ith column Sk,i=Kpi, where pi is uniformly sampled from the columns of the identity matrix. Using Sk we denote by V~k and A~k, any vector and matrix transformation respectively as V~k=SkV,A~k=SkASk. In specific, h~k=Skh,Y~k=SkY,X~k=SkX, and K~k=SkKSk. We can then redefine model (1) for sample Sk as

Y~k~Nh~k+X~kβ,σ2SkSk,h~k~N0,τK~k. (3)

We then implement our inference from Section 2.3 to the above by using V~β(k)=X~kV~λ,Z,r(k)-1X~k-1,V~λ,Z,r(k)=SkSk+λK~k,WSS~β,λ,r(k)=Y~k-X~kβV~λ,Z,r(k)-1Y~k-X~kβ in (2) and sample from each of the K posteriors.

Let Θ be a parameter space and let PθθΘ be a set of probability measures on Rq which is indexed by Θ, and which are absolutely continuous with respect to the Lebesgue measure μ. For any i.i.d. random vectors D1,,Dn~P0, let P0Pθ0. Bayesian inference usually consists of specifying a prior distribution Π for θ, and using sample Sn to compute a posterior distribution for θ defined as

ΠnθSni=1npθDiΠ(θ)Θi=1npθDiΠ(θ),

where pθdμ=dPθ.

Note that this definition is general enough that θ can be any parameter in (2) as well as function h0, in which case prior Π(θ) is a Gaussian process. Thus, we compute Πkθk for each split k=1,,K and each parameter of interest. Naturally, posterior Πkθk will be a noisy approximation of ΠnθSn. We denote Πk=Πkθk for notation simplicity. We combine each Πk using the geometric median. This aggregation framework is general and not restricted to parametric models; it can be applied to probability measures on abstract spaces, including non-parametric function spaces. To formalize this, we first define the geometric median, which is a multi-dimension generalization of the univariate median (Minsker, 2015).

To construct the geometric median posterior, first define ψ to be the Hellinger metric on Θ such that, for θ1,θ2Θ

ψθ1,θ2=12Rqpθ1(x)-pθ2(x)2dμ(x)

and let (Θ,ψ) be a separable metric space. Now, let H,,H be the Reproducing Kernel Hilbert Space of functions f:ΘR with characteristic reproducing kernel κ:Θ×ΘR defined as the L2 inner product of pθ1 and pθ2 with respect to μ. Then, let H1 be the defined as the unit ball in H, that is, H1=fHfH1.

Now, let 𝒫 be the space of probability measures over Θ and denote the space

𝒫κ=Π𝒫𝒫κ(x,x)dΠ(x)<.

We define the distance between probability measures Π1,Π2𝒫κ by

Π1-Π2H1=Θκ(x,)dΠ1-Π2(x)H.

With this notion of distance for priors and posteriors on Θ, we now define the geometric median by

Πn=argminΠ𝒫κk=1KΠ-ΠkH1. (4)

It is important to note that this definition operates on a general parameter space Θ, which in our case is a non-parametric function space. Our approach applies this general, non-parametric aggregation theory directly to the posterior measures for the function h0.

As shown in Minsker et al. (2017), the geometric median Πn is robust to outlier observations, meaning that our estimation procedure will be robust to outliers.

The convergence rate will be in terms of nk, however this rate improves geometrically with K with respect to the rate at which the K estimators are weakly concentrated around the true parameter (Minsker et al., 2017).

As the median function Πn is generally analytically intractable, an achievable solution is to estimate Πn from samples of the subset posteriors. We can approximate the median function by assuming that subset posterior distributions are empirical measures and their atoms can be simulated from the subset posteriors by a sampler (Srivastava et al., 2015). Let θk1,,θkN be N samples of parameters θ obtained from subset posterior distribution Πn(k). In our method, samples can be directly generated from subsets posteriors by using an MCMC sampler. Then we can approximate the Πn(k) by the empirical measure corresponding with θk1,,θkN, which is defined as:

Π^n(k)()=i=1N1Nδθki(),(k=1,,K), (5)

where δθki() is the Dirac measure concentrated at θki. In order to approximate the subset posterior accurately, we need to make N large enough. Then the empirical probability measure of median function Π^n, can be approximated by estimating the geometric median of the empirical probability measure of subset posteriors. Using samples from subsets posteriors, the empirical probability measure of the median function is defined as:

Π^n,()=k=1Ki=1Nakiδθki(),0aki1,k=1Ki=1Naki=1 (6)

where aki are unknown weights of atoms. Here the problem of combining subset posteriors to give a problem measure is switched to estimating aki in (6) for all the atoms across all subset posteriors. Fortunately, aki can be estimated by solving the optimization problem in (4) via kernel trick with posterior distributions restricted to atom forms in (5) and (6). There are several different algorithms to solve this, such as Bose et al. (2003) and Cardot et al. (2013). We use an efficient algorithm developed by Minsker et al. (2017), which is summarized as its Algorithm 2 via Weiszfeld’s algorithm.

Algorithm 1.

Fast Posterior Inference Via Sub-sampling.

REQUIRE Observed sample Sn=Dii=1n, subset number K, parameter sample size N
 Randomly partition Sn without replacement into K subsets Sk=1K with size nk
 For SkSk=1K do
  1. Get index set k for Sk
  2. Get sub-sampling sketching matrix Sk
  3. Run MCMC sampling on modified model described as (2) and (3) to generating parameter samples θk1,,θkN
 Solve the linear program in (4) using Weizfeld’s Algorithm with (5)–(6)
 RETURN empirical approximation posterior median function Π¯^n,

Algorithm 1 provides a sample splitting approach to decrease computational complexity for posterior inference. Sampling β,σ2,λ and h requires computing K, and inverting Vλ,Z,r, this translates into On2q,On3 operations respectively per iteration. For λ we also need to compute Vλ,Z,r which is On3. Bobb et al. (2015) recommend using at least 104 iterations, which translates into O104n3 operations (assuming qn). There is a clear trade-off between a large number of splits K which decreases computational complexity, and using the whole sample K=1 which yields better inference. For example, choosing K=n1/2 yields a computational complexity of O104n3/2 for Algorithm 1. Next in Section 3 we discuss the posterior convergence rate on a simpler, special case, and its dependence on the number of splits in detail.

3. Theoretical Results

In this section we present the assumptions needed for our theoretical results, state our main theorem and discuss its implications. Our results focus on estimation of h0 as this is the main function of interest and our main contribution and do not extend to the variable selection parameters r1,,rq. A complete theoretical analysis of the convergence for the distributed “slab-and-spike” model remains a non-trivial challenge for future work. In the rest of this section, we will consider the revised model by removing the variable selection parameters r1,,rq in (2) and focus on the estimation of h0 only.

The proof structure logically layers two distinct nonparametric theories: first, we use established results on Gaussian Process posterior contraction rates for a single subset (e.g., van der Vaart and van Zanten (2009)); second, we use these rates as inputs for the general, nonparametric theory of median posterior convergence (Minsker et al., 2017) to derive the final rate for our combined estimator. By proving the convergence of the nonparametric component h0, our theorem provides the necessary theoretical guarantee for the semi-parametric model.

We first assume that the confounder and pollution exposure space 𝒳,𝒵 respectively are compact bounded sets. This is easily satisfied in practice. Next we define two function spaces. Letting α>0, we define Cα[0,1]q to be the Holder space of smooth functions h:[0,1]qR which have uniformly bounded derivatives up to α, and the highest partial derivatives are Lipschitz order α-α. More precisely, we define the vector of q integers as v=v1,,vq and

Dvgz=ivigzz1v1,,zqvq.

Then for function h we define

hα=maxiviαsupDvg(z)+maxivi=αsupDvg(z)-Dvgzz-zα-α,forzz.

With the above we say that

Cα[0,1]q=h:[0,1]qRhα<M

(Vaart, 1996).

Note that Cα[0,1]q might be too large of a space as it is highly flexible in terms of differentiability restrictions. In light of this, if we only consider smooth functions, we introduce the following space.

Let the Fourier transform be h^(λ)=1/(2π)qei(λ,t)h(t)dt and define

𝒜γ,rRq=h:RqReγλrh^(λ)2dλ<.

Set 𝒜γ,rRq contains infinitely differentiable functions which are increasingly smooth as γ or r increase (van der Vaart and van Zanten, 2009).

Theorem 1: Let δ0(h) be the Dirac measure concentrated at h0. For any δ(0,1), there exists a constant C1 such that if we choose the number of splits KC1log1/δ, then with a probability of at least 1-δ, we have

Π^n,g-δ0h0C2(n/δ)-α2α+q(log(n/δ))4α+q4α+2qifh0𝒞α[0,1]q,C3(n/δ)-12(log(n/δ))(q+1+q/(2r))ifh0𝒜γ,rRqandr<2,C3(n/δ)-12(log(n/δ))(q+1)ifh0𝒜γ,rRqandr2,

where C2,C3 are sufficiently large constants.

The proof follows from results on convergence of the posterior median and scaled squared exponential Gaussian process properties. We defer the proof to the supplementary. The rate in Theorem 1 is achieved for all levels of regularity α simultaneously. If h0𝒞α[0,1]q, then the adaptive rate is 𝒪~(n/δ)-(α/(2α+q), however further assuming h0 is infinitely differentiable, then h0𝒜γ,rRq and we recover the usual 𝒪~n-1/2 rate. Intuitively, understanding α as the number of derivatives of h0, this n-1/2 rate is obtained letting α. Theorem 1 sheds light into the trade-off between choosing the optimal number of splits K: large K negatively impacts the statistical rate as it slows down convergence, however it helps with respect to computation complexity. Finally, dimension q affects the rate on a logarithmic scale if h0 is infinitely differentiable; in the case that h0𝒞α[0,1]q then q has a larger effect in the rate. This trade-off is further illustrated in Section 4.

4. Simulation Results

To study our method’s empirical performance in finite samples we evaluated it in several simulation settings. The simulated data is generated with the following procedure. We generated data sets with n observations, Sn=Dii=1n,Di=yi,xi,zi, where zi=zi1,,ziq is the profile for observation i with q mixture components and xi is a confounder of the mixture profile generated by xi~N3coszi1,2. The outcomes were generated by yi~Nβ0xi+h0zi,σ2. Given the main focus is on how the algorithm performs for large n, we considered q=4 exposures, and each exposure vector zii=1n was obtained by sampling each component zi1,,zi4 from the standard normal distribution N(0,1).

We considered the mixture-response function h0() as a non-linear and non-additive function of only (zi1,zi2) with an interaction. In particular, let ϕ(x)=1/1+e-x. We generated h as

h0zi=4ϕ56zi1+zi2+12zi1zi2.

We set β0=2, and σ2=0.5. We considered the total number of samples n{512,1024,2048,4096}, and the number of splits K=nt, with t{0,0.05,0.1,0.15,,0.7}. Note that for n=512, the subset sample size is not enough for performing the MCMC sampler when t=0.7. Therefore, simulation under this particular combination of n=512,t=0.7 was not performed. Each simulation setting is replicated 300 times. All computations are performed on a server with 2.6GHz 10 core compute nodes, with 15000MB memory.

4.1. Assessing Performance Across Split Levels

Figures 1 and 2 show the method performance for approximating h0 by the median posterior. To evaluate performance, we ran a linear regression of the posterior mean h^ on true h0, i.e. h0zi=γ0+γ1h^zi+ϵii=1,,n and plot the estimated slope γ^1, intercept γ^0 and R2 while varying number of sample splits K. A good h^() would yield γ^0=0,γ^1=1,R2=1 as h0(z)h^(z). As the figure shows, as the number of splits increases with t, inference on h0 starts to lose precision. This is natural; although the median geometrically improves at rate nk, as splits increase each posterior sample becomes noisier. However, near t=1/2 the median performance for h^ is close to the performance when the entire sample is used (t=0) as measured by γ^0,γ^1,R2, with significant computation time gains. Figure 2 shows computing time for inference on h0 through the posterior median. There is a trade-off between sampling from a high dimensional Gaussian process posterior of n samples, and a large number of data splits which require almost equivalent computation power to sample. However, this can be mitigated when datasets are sampled or collected distributedly. Results suggest that splits with t[1/4,1/2] decrease computational burden significantly. On the other hand Figure 1,2 and theoretical results in Section 3 suggest that t1/2 offers a good approximations to the full-data posterior. Figures 1 and 2 and the theoretical results suggest choosing t[1/4,1/2], with t1/2 as n increases will optimize the computation-cost vs. precision trade-off. In the meantime, the mean squared error (MSE) of the estimated response h was reported as Figure A4 in the supplementary, which provides the same result as Figures 1 and 2.

Figure 1:

Figure 1:

Regression summary results for h=γ0+γ1h^ across different sample size n and data set splits. The setting of number of subsets are described above as nt. We show (A) intercept: γ^0, (B) slope: γ^1.

Figure 2:

Figure 2:

(A)Regression R2 for h=γ0+γ1h^ and (B) Logarithmic runtime for the proposed method across different sample size n and data set splits. The setting of number of subsets are described above as nt.

To assess credible intervals for h^ estimation, we reported mean coverage probability (MCP) of 95% credible intervals of h^, which is calculated as MCP=1U1nu=1Ui=1nIh^i(u), where I() is the indicator function for whether hi falls within the estimated credible interval. As shown in Table 1, for all the sample size settings, as the number of splits increases with t near 1/2, the MCP remains close to the nominal 95%. Once t>1/2, coverage deteriorates sharply. The result is consistent with the performance shown in Figure 1. Moreover, as n increases the estimated standard deviations shrink, which amplifies the under-coverage for t>1/2; in our largest-n settings, MCP can approach zero under this regime.

Table 1:

Mean Coverage Probabilty of 95% Credible Interval of h^

n t
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
512 0.9628 0.9863 0.9808 0.9648 0.6015 0.3937 0.0937 NA
1024 0.9267 0.9218 0.9541 0.9414 0.8769 0.6943 0.0802 0.0003
2048 0.9877 0.9431 0.9609 0.9897 0.8442 0.8334 0.0522 0.0004
4096 0.9652 0.9522 0.9248 0.9609 0.9306 0.9324 0.0576 0.0000

Variable selection performance of our method was also evaluated with the mean posterior inclusion probabilities (PIPs). We provided the mean PIPs for X3 and X4 cross sample sizes and split exponents t as Table A2 in the supplementary. The mean PIPs for both variables remain near zero as t1/2. When t>1/2, the precision of PIPs decreases. The same qualitative pattern holds for X1 and X2: PIPs are exactly 1 when t is small, and they decline modestly to around 0.8 once t>1/2.

The performance of the method under a more realistic exposure setting where exposures are highly correlated is reported in the supplementary material.

4.2. Assessing Performance Across Heterogeneous Subsetting Scheme

Another concern about the algorithm consistency is whether the subsetting scheme will affect the performance. Specifically, how heterogeneous splitting will affect the estimation accuracy. As for the subset sample size variability, we assess the model performance via the following additional simulation. We considered the same simulation setting as section 4.1, with a total number of samples n=2048. The number of splits K=nt is fixed with t{0.1,,0.4}. Instead of randomly evenly partitioning the sample into K subsets, we considered the sample size for each subset to be randomly sampled from [(1-Δ)*n/K,(1+Δ)*n/K], with Δ{0,0.1,,0.7}. Each simulation setting is replicated 300 times. We reported the MSE of posterior mean h^ in Figure A5 in the supplementary. As the figure shows, when the number of splits is relatively small, inference on h0 keeps the same level of precision as the variability added into the subset sample sizes. The performance is mainly affected by the smallest subset, and it can be seen that when Δ>0.7 and t=0.4, estimation accuracy decreases significantly. We thus still recommend ensuring the smallest sample size of subsets should be at least n0.5. A more extreme uneven partition scenario is investigated and reported in the supplementary work.

As for the subset exposure covariate distributions variability, the model variable selection performance was assessed via the following additional simulation. We considered the same simulation setting as section 4.1, with a total number of samples n=2048. The number of splits K=nt is fixed with t=0.2. To include scale heterogeneity for covariates, for the first K/2 subsets, we randomly select one of four exposure variables and shrink its marginal variance by a factor cv, with cv0.12,0.32,0.52,0.72. For the remaining K/2 subsets, exposure variances are unchanged. Each simulation setting is replicated 300 times. The method performance was compared with the traditional BMKR fitted to the whole dataset(t=0). We reported the mean overall PIPs and mean across-subset PIP standard deviations(SD) for exposure variables as Table 2. Across all scenarios, PIPs were highly stable for X1 and X2. While the mean PIPs and mean across-subset PIPs SDs of two noise exposure variables X3 and X4 increase as the level of covariate heterogeneity increases, they still present reliable variable selection behavior with conventional thresholds (e.g., 0.5). Overall, variable selection is thus robust to moderate subsetting-induced heterogeneity in exposure distributions when a reasonable number of subsets is used.

Table 2:

Mean estimated PIP and mean across-subset standard deviations of exposure variables across different shrinkage factors cv of exposure variance.

cv 0.1 0.3 0.5 0.7 1
t 0 0.2 0 0.2 0 0.2 0 0.2 0 0.2
X1 1 1(0.000) 1 1(0.000) 1 1(0.000) 1 1(0.000) 1 1(0.000)
X2 1 1(0.000) 1 1(0.000) 1 1(0.000) 1 1(0.000) 1 1(0.000)
X3 0.088 0.291(0.143) 0.076 0.284(0.125) 0.022 0.182(0.090) 0.008 0.159(0.076) 0.004 0.006(0.046)
X4 0.084 0.251(0.127) 0.042 0.173(0.094) 0.012 0.112(0.075) 0.008 0.055(0.053) 0.004 0.004(0.024)

5. Application: Major Particulate Matter Constituents and Greenspace on Birthweight in Massachusetts

To further evaluate our method on a real data set, we considered data from a study of major particulate matter constituents and greenspace and birthweight. The data consisted of the outcome, exposure and confounder information from 907,766 newborns in Massachusetts born during January 2001 to 31 December 2012 (Fong et al. (2019)). After excluding the observation records with missing data, there were n=685,857 observations used for analysis. As exposures, we included exposure data averaged over the pregnancy period on address-specific normalized Ozone, PM2.5, EC, OC, nitrate, and sulfate, as well as the normalized difference vegatation index (NDVI), in the nonparametric part of the model, and confounders including maternal characteristics in the linear component of the model. These pollutant exposures were estimated using a high-resolution exposure model based on remote-sensing satellite data, land use and meterologic variables (Di et al., 2016). We randomly split the sample to K=686 (using t1/2) different splits, each split contains 1000 samples. For each split, we ran the MCMC sampler for 1,000 iterations after 1,000 burn-in iterations, and every fifth sample was kept for further inference, thus we retain N=200 posterior samples for each split. Further details on the confounders included in the analysis can be found in Supplementary. In addition, the correlation matrix for this data is in Figure A1 in Supplementary.

Figure 3 shows estimated exposure-response relatoinships between the mean outcome and each exposure, when fixing all other exposures at their median values. Results suggests that, for the PM2.5, EC, and OC terms, exposure levels are negatively associated with mean birthweight. In contrast, Ozone, nitrate, and NDVI levels are positively associated with mean birthweight, and there is no association between birthweight and maternal exposure to sulfate. Among negatively associated constituents, EC and remaining PM2.5 constituents have stronger linear negative associations compared to OC. Among positive associations, NDVI and Ozone seem to have a strong linear relationship with birth-weight. However, for nitrate, when its concentration is lower than +1 standard deviation, it is positively associated with birth weight increase, whereas when it is above the mean level over 1 standard deviation, it is negatively associated with birth-weight. This suggests the possibility of effect modification.

Figure 3:

Figure 3:

Univariate estimated effects on birth-weight per standard deviation increase in PM2.5, EC, OC, nitrate, sulfate, NDVI, and Ozone. 95% confidence bands of estimates are in gray. All of the other mixture components are fixed to 50th percentile level when investigating single mixture effect on birth-weight. We show h(Z): difference of birth-weight comparing to the mean birth-weight of samples in grams; Pollutant(Z): change of each of the major constituents with the measure of standard deviation of that constituent.

Figure 4 investigates the bivariate relationship between pairs of exposures and birthweight, with other exposures fixed at their median levels. Results suggest different levels of non-linear relationships between constituent concentrations and birthweight. Unlike the pattern of sulfate shown in figure 3, there exists a strong inverted u-shaped relationship between sulfate and mean birthweight when nitrate concentration is at around −1 standard deviation. A similar relationship is visible between nitrate and mean birthweight when sulfate concentration is higher than +0.5 standard deviation. Moreover, the PM2.5 shows no association with birth weight when its concentration is lower than 0 standard deviation, with sulfate concentration lower than −1 standard deviation.

Figure 4:

Figure 4:

Bivariate estimated effects on birthweight per standard deviation increase between PM2.5, EC, OC, nitrate, sulfate, NDVI, and Ozone. All of the other mixture components are fixed to 50th percentile level when investigating bivariate mixture effects on birthweight. We show hzi,zj: difference of birth-weight compared to the mean birth-weight of samples in grams; Pollutantzi and Pollutantzj: change of each of the major constituents with the measure of standard deviation of that constituent.

6. Discussion

As industry and governments invest in new technologies that ameliorate their environmental and pollution impacts, the need to quantify the effects of environmental mixtures on health is increasingly of interest. In parallel, electronic data registries such as the Massachusetts birth registry are increasingly being used in environmental health studies. These rich data sets allow measuring small, potentially non-linear effects of pollutant mixtures that impact public health. To the best of our knowledge, we propose the first semi-parametric Gaussian process regression framework that can be used to estimate effects using large datasets. In particular, we model the exposurehealth outcome surface with a Gaussian process, and our applied model utilizes priors that allow for feature selection. Additionally, we use a linear component to incorporate confounder effects. Previous approaches for similar analysis had to either assume a parametric relationship or use a single pollutant per regression to estimate effects of interest (Fong et al., 2019).

To ameliorate the computational burden of computing the Bayesian posteriors of the Gaussian process, we propose a divide-and-conquer approach. Our method consists of splitting samples into subsets, computing the posterior distribution for each data split, and then combining the samples using a generalized median based on the order 1 Wasserstein distance (Minsker et al., 2017).

We tailor the method to incorporate a squared exponential kernel and provide theoretical guarantees for the convergence of this foundational Gaussian Process component, which validates the soundness of our divide-and-conquer strategy for the non-parametric function. Our convergence results accommodate different assumptions for the underlying space of the true feature-response function. We provide theoretical and empirical results which illustrate a trade-off for the optimal number of splits. As the number of data splits increases, the posterior computation of the small data subsets will be faster; however, these posteriors will be noisy. In other words, there is a tension between computational cost and obtaining precise estimates. We propose using K=n1/2 sample splits to efficiently approximate the posterior in a relatively short time.

To illustrate the benefit of our method, we analyzed the association of a mixture of pollution constituents and green space and birthweights in the Massachusetts birth registry. To our knowledge, a Gaussian process regression analysis, commonly known as Bayesian kernel machine regression in the environmental literature, of the Massachusetts birth registry data would not have been possible using existing fitting algorithms. Our analysis found the strongest adverse associations between traffic-related particles and PM2.5 not accounted for by the other pollutants. We observed the strongest positive associations with birthweight for Ozone and greenspace. A possible explanation for the Ozone finding is that Ozone is often typically negatively correlated with NO2, another traffic pollutant, which was not included in this analysis. The greenspace finding is consistent with other analyses showing potentially beneficial effects of maternal exposure to greenness.

Our work also has some theoretical limitations. While our primary focus was on the computational scalability and the estimation of the non-linear function h0, we did not include an analysis of the posterior distributions for the variable selection parameters (rj). A full validation of the distributed model’s variable selection properties is an important next step, but was beyond the scope of this paper, which is focused on solving the computational bottleneck of the GP component. In terms of applications, future efforts will also explore other multi-pollutant analyses both in this dataset as well as a range of other analyses in large-scale birth registry databases.

Supplementary Material

supplementary material

Web Appendices, Tables, and Figures referenced in Sections 3, 4, and 5, are available with this paper at the Biometrics website on Oxford Academic. The simulation code is available online, and the corresponding package is available at https://github.com/junwei-lu/fbkmr.

Acknowledgments

The first two authors contributed equally to this work. We thank Dr Luke Duttweiler for his work that improved the overall quality of the paper. We thank the referees and the associate editor for their constructive comments.

Funding

Brent Coull was supported by the National Institutes of Health (NIH) under grant R01ES035735. Joel Schwartz was supported by the NIH under grant R01ES032418. Junwei Lu was supported by the National Science Foundation (NSF) under grant DMS-2434664, the William F. Milton Fund, and NIH R01ES032418.

Contributor Information

Aaron Sonabend-W, Google Research, 1600 Amphitheatre Parkway, Mountain View, U.S.A..

Jiangshan Zhang, Department of Statistics, University of California, Davis, One Shields Avenue, Davis, U.S.A..

Edgar Castro, Department of Environmental Health, Harvard University, 677 Huntington Ave, Boston, U.S.A..

Joel Schwartz, Department of Environmental Health, Harvard University, 677 Huntington Ave, Boston, U.S.A..

Brent A. Coull, Department of Biostatistics, Harvard University, 677 Huntington Ave, Boston, U.S.A.

Junwei Lu, Department of Biostatistics, Harvard University, 677 Huntington Ave, Boston, U.S.A..

Data Availability Statement

Due to data privacy, we will only share the source code in the github repository in https://github.com/junwei-lu/fbkmr. Access to real data is available from the author, Joel Schwartz, upon reasonable request at jschwrtz@hsph.harvard.edu.

References

  1. Billionnet C, Sherrill D, and Annesi-Maesano I (2012). Estimating the health effects of exposure to multi-pollutant mixture. Annals of epidemiology 22, 126–141. [DOI] [PubMed] [Google Scholar]
  2. Bobb JF, Valeri L, Claus Henn B, Christiani DC, Wright RO, Mazumdar M, Godleski JJ, and Coull BA (2015). Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics 16, 493–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bose P, Maheshwari A, and Morin P (2003). Fast approximations for sums of distances, clustering and the Fermat-Weber problem. Comput. Geom 24, 135–146. [Google Scholar]
  4. Cardot H, Cénac P, and Zitt P-A (2013). Efficient and fast estimation of the geometric median in hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli (Andover.) 19, 18–43. [Google Scholar]
  5. Coull BA, Bobb JF, Wellenius GA, Kioumourtzoglou M-A, Mittleman MA, Koutrakis P, and Godleski JJ (2015). Part 1. statistical learning methods for the effects of multiple air pollution constituents. Research report - Health Effects Institute; page 5. [PubMed] [Google Scholar]
  6. Di Q, Koutrakis P, and Schwartz J (2016). A hybrid prediction model for pm2.5 mass and components using a chemical transport model and land use regression. Atmospheric environment 131, 390–399. [Google Scholar]
  7. Fong KC, Di Q, Kloog I, Laden F, Coull BA, Koutrakis P, and Schwartz JD (2019). Relative toxicities of major particulate matter constituents on birthweight in massachusetts. Environmental epidemiology 3, e047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fong KC, Kosheleva A, Kloog I, Koutrakis P, Laden F, Coull BA, and Schwartz JD (2019). Fine particulate air pollution and birthweight: Differences in associations along the birthweight distribution. Epidemiology (Cambridge, Mass.) 30, 617–623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Gaskins AJ, Mínguez-Alarcón L, Fong KC, Abu Awad Y, Di Q, Chavarro JE, Ford JB, Coull BA, Schwartz J, Kloog I, Attaman J, Hauser R, and Laden F (2019). Supplemental folate and the relationship between traffic-related air pollution and livebirth among women undergoing assisted reproduction. American journal of epidemiology 188, 1595–1604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Joubert BR, Kioumourtzoglou M-A, Chamberlain T, Chen HY, Gennings C, Turyk ME, Miranda ML, Webster TF, Ensor KB, Dunson DB, and Coull BA (2022). Powering research through innovative methods for mixtures in epidemiology (prime) program: Novel and expanded statistical methods. International Journal of Environmental Research and Public Health 19,. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Liu D, Lin X, and Ghosh D (2007). Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics 63, 10791088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Minsker S (2015). Geometric median and robust estimation in banach spaces. Bernoulli : official journal of the Bernoulli Society for Mathematical Statistics and Probability 21, 2308–2335. [Google Scholar]
  13. Minsker S, Srivastava S, Lin L, and Dunson D (2017). Robust and scalable bayes via a median of subset posterior measures. Journal Of Machine Learning Research 18,. [Google Scholar]
  14. Schmidhuber J (2015). Deep learning in neural networks: An overview. Neural networks 61, 85–117. [DOI] [PubMed] [Google Scholar]
  15. Scornet E, Biau G, and Vert J-P (2015). Consistency of random forests. The Annals of statistics 43, 1716–1741. [Google Scholar]
  16. Srivastava S, Cevher V, Tran Dinh Q, and Dunson DB (2015). Wasp: Scalable bayes via barycenters of subset posteriors. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics 38, 912–920. [Google Scholar]
  17. Stieb DM, Chen L, Eshoul M, and Judek S (2012). Ambient air pollution, birth weight and preterm birth: A systematic review and meta-analysis. Environmental research 117, 100–111. [DOI] [PubMed] [Google Scholar]
  18. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, Methodological 58, 267–288. [Google Scholar]
  19. Vaart AW (1996). Weak Convergence and Empirical Processes : With Applications to Statistics. Springer Series in Statistics. Springer New York : Imprint: Springer, New York, NY. [Google Scholar]
  20. van der Vaart AW and van Zanten JH (2009). Adaptive bayesian estimation using a gaussian random field with inverse gamma bandwidth. The Annals of Statistics 37, 2655–2675. [Google Scholar]
  21. Williams CKI and Rasmussen CE (2019). Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning series. The MIT Press. [Google Scholar]
  22. Yu L, Liu W, Wang X, Ye Z, Tan Q, Qiu W, Nie X, Li M, Wang B, and Chen W (2022). A review of practical statistical methods used in epidemiological studies to estimate the health effects of multi-pollutant mixture. Environmental Pollution 306, 119356. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material

Data Availability Statement

Due to data privacy, we will only share the source code in the github repository in https://github.com/junwei-lu/fbkmr. Access to real data is available from the author, Joel Schwartz, upon reasonable request at jschwrtz@hsph.harvard.edu.

RESOURCES