ABSTRACT
Environmental epidemiologic studies routinely utilize aggregate health outcomes to estimate effects of short-term (eg, daily) exposures that are available at increasingly fine spatial resolutions. However, areal averages are typically used to derive population-level exposure, which cannot capture the spatial variation and individual heterogeneity in exposures that may occur within the spatial and temporal unit of interest (eg, within a day or ZIP code). We propose a general modeling approach to incorporate within-unit exposure heterogeneity in health analyses via exposure quantile functions. Furthermore, by viewing the exposure quantile function as a functional covariate, our approach provides additional flexibility in characterizing associations at different quantile levels. We apply the proposed approach to an analysis of air pollution and emergency department (ED) visits in Atlanta over 4 years. The analysis utilizes daily ZIP code-level distributions of personal exposures to 4 traffic-related ambient air pollutants simulated from the Stochastic Human Exposure and Dose Simulator. Our analyses find that effects of carbon monoxide on respiratory and cardiovascular disease ED visits are more pronounced with changes in lower quantiles of the population’s exposure. Software for implement is provided in the R package nbRegQF.
Keywords: air pollution, Bayesian hierarchical modeling, functional data analysis, quantile process
1. INTRODUCTION
Environmental epidemiological studies routinely utilize aggregate health data to assess short-term health effects of environmental exposures. For example, citywide daily counts of deaths, emergency department (ED) visits, and preterm births have been linked to short-term exposure to air pollution and extreme temperature (Alhanti et al., 2016; Guo et al., 2017; Bekkar et al., 2020; Yoo et al., 2021). The use of aggregate health data represents an ecological design, where the true exposure corresponds to the distribution of individual-level exposure across the at-risk population (Richardson and Best, 2003; Sheppard, 2003). However, individual environmental exposures cannot be practically measured in large population-based studies. Instead, exposure surrogates that reflect summaries of individual-level exposure (eg, the mean) are used. This can result in exposure misclassification that leads to biases and incorrect characterization of uncertainties associated with the health effect estimates (Dominici et al., 2000; Richmond-Bryant and Long, 2020).
In the context of air pollution, ambient pollutant levels are frequently the exposure of interest so that regulatory policies may be developed. Within short-term timescales of interest, there can be considerable exposure heterogeneity within the population since (1) air pollutants can exhibit spatial variation and (2) individuals spend most of their time in different indoor environments, each with unique outdoor to indoor pollution infiltration characteristics. To address the first challenge, spatial-temporal models have been developed to estimate pollutant concentrations at fine spatial resolutions that can provide complete spatial-temporal coverage (Jerrett et al., 2001; 2005). In contrast, less work has been done to address the second challenge. With the development of wearable devices, personal exposure to environmental contaminants can be more easily measured (Steinle et al., 2015; Sugg et al., 2018). While the cost is prohibitive for population-based epidemiological studies, findings from small exposure assessment studies have been used to develop probabilistic models to simulate population-level distributions of personal exposures. Examples of probabilistic models include pCNEM and the Stochastic Human Exposure and Dose Simulation (SHEDS) (Burke et al., 2001; Zidek et al., 2005).
With simulated personal exposures from such probabilistic models, previous studies of aggregate health outcomes have used exposure distribution summary statistics (eg, daily mean or median) as the covariate in the health model (Calder et al., 2008; Chang et al., 2012; Sarnat et al., 2013). However, no within-unit exposure heterogeneity is considered when the daily mean of personal exposures is used. Within-unit exposure heterogeneity may be important for understanding the association between the exposures and the aggregate health outcome (Richardson and Best, 2003). For example, in the special case where personal exposures are normally distributed, inclusion of exposure variance in the health model can result in unbiased estimates (Sheppard, 2003; Reich et al., 2009) compared to solely including mean exposure. However, the population distribution of environmental exposure is often skewed and poorly approximated by a normal distribution (Leiva et al., 2008; Huang et al., 2018). Furthermore, all previous methods using population summary statistics implicitly assume that the heath effect of air pollution can be entirely characterized by the summary statistic selected. In other words, health risks only depend on changes in that selected summary statistic. However, health risks may also depend on changes in the exposure distribution that cannot be reflected by changes in the selected summary statistic. For example, consider a scenario where a short-term increase in air pollution may be more detrimental to individuals who are typically exposed to lower pollution levels. For such scenario, an increase in the lower tail of population-level exposure results in larger increases in the risk of adverse health events, while solely including the population-level mean exposure is insufficient to fully characterize the exposure-response relationship.
In this work, we propose a general modeling framework to incorporate within-unit population-level exposure heterogeneity via exposure quantile functions. Instead of only using unit-level population-average or variance, exposure quantile functions comprehensively summarize the entire within-unit exposure distribution throughout the study. In our framework, the exposure quantile function is viewed as a functional covariate with respect to quantile levels. Therefore, we further allow effects of exposure at different quantile levels to vary. Estimation and inference are carried out under a Bayesian hierarchical modeling framework that also propagates uncertainties associated with the estimation of exposure quantile functions into the health effect estimate.
Our proposed scalar-on-quantile-function approach is in the same spirit of the scalar-on-function regression in the functional analysis literature, which regresses the scalar outcome against functional covariates. In this literature, the functional covariate defined over a continuous domain is often measured over time and/or space (Brockhaus et al., 2015; Morris, 2015). The proposed modeling framework is novel in that the scalar outcome is an aggregate count, while we treat the quantile function of environmental exposures that reflects within-unit heterogeneity as the functional covariate. Compared to functions defined over time and/or space, the quantile function is restricted to be nondecreasing. To ensure this property, we propose a 2-stage Bayesian estimation procedure. At the first stage, exposure quantile functions are estimated using an existing semiparametric Bayesian method (Reich, 2012). The focus of the first-stage estimation also differs from the typical quantile function regression in the functional data analysis context (Li et al., 2022; Yang et al., 2020). Specifically, instead of being interested in studying prespecified percentiles conditional on a set of covariates (eg, demographic variables), we focus on obtaining continuous and smooth quantile functions to serve as a comprehensive summary of unit-specific exposure distributions based on multiple exposures collected for each unit. Those estimated continuous and smooth quantile functions are then treated as functional covariates to study health risks associated with changes in exposure distributions.
2. MOTIVATING DATA AND APPLICATION
Ambient air pollution exposure has been identified as a risk factor of various diseases (Landrigan, 2017), contributing significantly to global disease burden (Boogaard et al., 2019). Here, we studied short-term associations between daily ED visits and ambient air pollution exposures in a time-series design (Bhaskaran et al., 2013). We obtained daily counts of ED records during the period January 1, 1999 to December 31, 2002 in Atlanta. Counts were also stratified by 1 of the 40 ZIP code tabulation areas (ZCTAs). We analyzed 3 causes for ED visits: (1) respiratory disease, (2) a subset of respiratory diseases, which only includes asthma or wheeze, and (3) cardiovascular disease. The International Classification of Diseases 9th Revision (ICD-9) diagnosis codes used for identifying those ED visits can be found in Table S1.
We examined 4 traffic-related air pollutants: particulate matter with aerodynamic diameter
microns (PM
), carbon monoxide (CO), nitrogen oxides (NOx), and elemental carbon (EC), a constituent of PM
. Separately for each pollutant, population-level distributions of personal exposure were obtained from SHEDS model (Burke et al., 2001). This model is a stochastic simulator producing daily personal exposure at the census tract level (Jenkins, 1996; Özkaynak et al., 1996; Dionisio et al., 2013). To estimate the personal exposure distributions of ambient concentrations of an air pollutant, the SHEDS model first simulates exposures for multiple hypothetical individuals for each census tract that reflect the demographic characteristics (eg, age, sex, work locations) of the at-risk population using census data. The amount of time each hypothetical individual spends in various microenvironments is obtained by randomly assigning an activity diary from the US Environmental Protection Agency (EPA)’s Consolidated Human Activity Database based on their demographics. Their daily personal exposure is then computed by summing time-weighted average exposure across all 13 microenvironments, which are categorized into 4 types (outdoors, vehicle, residential indoors, and nonresidential indoors microenvironments). For this analysis, the personal exposure distribution at the census tract level were then aggregated to the ZCTA level.
For the health analysis, several meteorological variables were obtained from Daymet to account for potential confounding by meteorology (Thornton et al., 2016). These include daily minimum temperature, maximum temperature, and dew-point temperature.
3. METHODS
For studies focusing on aggregate outcomes, the conventional model commonly includes the mean of the group-specific exposures as the covariate. A group can be formed by geographical areas or a time interval. In this work, we propose a model treating exposure quantile functions as functional covariates to capture effects of the entire exposure distribution and to allow effects to vary by quantile levels.
3.1. Model for count outcome and exposure distribution
Let
denote the number of events (eg, ED visits, hospital admissions, deaths) observed for group
, and let
denote a vector of exposures (eg, personal exposures to air pollution) collected from
individuals for group i. The proposed scalar-on-quantile-function over-dispersed Poisson log-linear regression model (ie, negative-binomial regression model) is given as
![]() |
(1) |
where
denotes the exposure quantile function of a continuous exposure for group i, which can be obtained based on
,
represents the effect of the exposure’s
th percentile on the mean of the health outcome,
is a vector of other covariates with regression coefficients
(including an intercept),
controls the amount of over-dispersion, and
represents a mean-zero spatial/temporal residual process. Using this parametrization, the log of the expectation of
equals to
. We note one special case where effects of exposure are the same at different quantile levels (ie,
is a constant in
). Then, the proposed model (1) will reduce to the conventional model because
represents the mean exposure.
To flexibly characterize the association between health outcome and the exposure, the coefficient
is assumed to be a smooth function of quantile levels and is modeled via a finite number of basis functions. Specifically,
is specified as
![]() |
(2) |
where
denotes the orthonormal Bernstein polynomials of degree p defined over the interval
(Bellucci, 2014). With this basis expansion, the proposed model in Equation 1 can be written as
![]() |
(3) |
where
is a vector of basis coefficients and
is a vector of basis functions. We note that the domain of Bernstein polynomials coincides with the domain of exposure quantile functions, which could facilitate the estimation of
, compared to B-splines or Gaussian kernels. The Bernstein polynomials are chosen also because they have been shown to accurately approximate various smooth function forms with a small number of basis functions (Bellucci, 2014).
3.2. Estimation and inference
3.2.1. Quantile functions are known
In studies that cover a large geographic area and/or over a long time period, one might assume exposure quantile functions are known to facilitate the estimation and inference. For example, parametric distributions with parameters estimated based on observed exposure data are introduced to approximate exposure distributions. With known quantile functions, Equation 3 can be reparametrized as a regular scalar-on-scalar model with a covariate vector
. An efficient fully Bayesian inference procedure is available for the over-dispersed Poisson regression by introducing latent Polya-Gamma random variables (Polson et al., 2013; Neelon, 2019).
3.2.2. Quantile functions are unknown
For the more realistic case of unknown quantile functions, we propose a 2-stage Bayesian estimation procedure. In the first stage, quantile function
for each group i is again modeled using basis expansion and estimated from individual-level exposures (eg, SHEDS simulations in our application). In the second stage, the health model (1) is fitted with estimated quantile functions while accounting for the statistical uncertainties associated with the first-stage estimation.
Following previous semiparametric Bayesian approaches for modeling quantile processes for continuous variables (Reich, 2012), the quantile function for exposures in group i is expanded as
![]() |
(4) |
where
is the lth basis function, and
are basis coefficients. Choices of basis function
include piecewise Gaussian or piecewise Gamma functions, and their expressions are provided in Supplementary Materials. Both choices permit us to flexibly characterize the potentially skewed distribution of exposures. However, for exposures that are strictly positive (eg, ambient air pollution concentration), piecewise Gamma functions are recommended. With the use of piecewise Gaussian or Gamma functions,
represents the median of the ith group exposure distribution and
for
characterize the shape of the distribution. The quantile function uniquely defines the density. Let
denote the exposure level measured for the jth individual within the ith group. When
is assumed to follow a distribution corresponding to a quantile function
, the likelihood for a set of individual-level exposures is given by
![]() |
(5) |
where
,
, and
are vectors of personal exposures collected for group i of
individuals. It is important to note that quantile functions have to be nondecreasing. To ensure this property,
should hold for any i and l (Reich, 2012). In the first-stage estimation, this constraint is imposed by introducing an unconstrained latent variable
. Specifically,
, where
is a small constant (eg, 0.01). For spatial or time-series data, one can easily introduce spatial or temporal dependence for
and
in the Bayesian hierarchical modeling framework to allow quantile processes to vary by time or locations. Estimation of all model parameters is carried out via Markov Chain Monte Carlo (MCMC) algorithms.
In the second stage, the health model is fitted while accounting for the uncertainties associated with estimating exposure quantile functions. When the exposure quantile function is expanded using basis functions, the health model in Equation 3 becomes
![]() |
(6) |
where
and
. Since
is prespecified, uncertainties of estimated
contribute to uncertainties of the estimation of quantile functions.
To appropriately propagate uncertainties resulting from the first-stage estimation, we consider an approach, which is commonly used in environmental health studies for incorporating uncertainties in estimated exposures (Carroll et al., 2006; Lee et al., 2017). Specifically, a multivariate normal (MVN) prior is assumed for
with mean and variance-covariance matrix computed from its posterior predictive distribution obtained from the first-stage estimation. Similarly to the case in which quantile functions are known, we view
as a random covariate vector. With MVN prior assumed for
, the prior of
is also MVN. It is worth noting that posterior distributions of
are correlated across groups (eg, time points) when quantile processes are assumed to vary by groups. As the number of groups increases, this approach becomes computational expensive since the MCMC algorithm requires sampling from a high-dimensional MVN distribution. In the simulation study and real data analysis, we ignore the correlation between groups to facilitate the computation. We find that this does not meaningfully impact inference for the health effect association, as also recently demonstrated in Comess et al. (2022). As in the case of known quantile functions, the estimation of coefficients
and
is carried out by Gibbs sampling with normal priors specified for those coefficients using the Polya-Gamma method. Metropolis-Hastings (MH) algorithm is implemented to estimate the over-dispersion parameter
using an uniform prior. For hyperparameters in defining the mean-zero spatial/temporal residual process
, vague conjugate inverse-Gamma priors are used for variance parameters. When spatial/temporal random effects follow the proper conditional autoregressive (CAR) prior, the parameter controlling spatial/temporal dependency is updated using MH algorithm with a discrete prior. Details of MCMC algorithms for the estimation of quantile functions, and the estimation with known and unknown quantile functions can be found in Supplementary Materials.
4. SIMULATION STUDIES
We conducted simulation studies to examine the impact of not accounting for exposure heterogeneity motivated by our application and evaluate the performance of the 2-stage estimation procedure. A variety of associations between exposures and outcomes are considered by specifying different forms of coefficient regression function
.
The simulation study assumes that health outcome is collected over
time points and exposure quantile functions are temporally correlated. The temporal dependence was introduced by defining a first-order Gaussian Markov random field process for the unconstrained latent basis coefficients. The true health model and exposure quantile function are
![]() |
where
are piecewise Gamma functions with 4 basis functions; W is a symmetric matrix with the
entry equal to 1 if time point i and
are adjacent, and 0 otherwise; and
is a diagonal matrix with the ith diagonal element equal to the row sum of ith row of W. For each time point i, individual exposure data with a sample size of 100 were generated using the quantile function
. To mimic the right-skewed distribution of PM
observed in the motivating data, we set
,
for
, and parameters controlling temporal correlations of quantile processes
equal to
.
We examined 6 coefficient regression functions displayed in Figure S1: (1)
, (2)
, (3)
, (4)
, (5)
, and (6)
. These 6 functions include 3 types of effects (constant, increasing, and decreasing). We note that
equals to 0.5 under all scenarios, and this quantity is interpreted as the effect of exposures associated with the exposure distribution shifted to the right by 1 unit (ie, the mean is increased by 1 unit). Based on simulated quantile functions
, 100 health datasets were generated for each of different forms of
while fixing
. The conventional “mean” model using the mean of individual exposures at each time point as the covariate was also fitted for the comparison.
Three quantities were used for performance evaluation: (1) the effect associated with the exposure distribution shifted to the right by 1 unit, which is measured by
for the proposed model, and the regression coefficient of the mean exposure (ie,
) in the conventional “mean” model denoted as
; (2) predictive values of the exposure, which is measured by
and
for the proposed and mean models, respectively; and (3) the total number of exposure-attributable events, which is computed as
and
for the proposed and mean models, respectively. For these quantities, their relative bias, mean squared error (MSE) or integrated mean squared error (IMSE), and 95% coverage probability (CP) were computed.
For
and the total number of exposure-attributable events, their relative bias, MSE, and CP were computed by averaging over simulations. However, for predictive values, we averaged over both simulations and time points to calculate IMSE. Specifically, when exposure quantile functions were estimated, the IMSE of predictive values of the exposure was calculated as
![]() |
(7) |
for the proposed model, where subscript d is the index for the simulation with
and
;
,
is the vector of estimated basis coefficients defined in Equation 3;
,
are estimated basis coefficients defined in Equation 6. The computation of the relative bias and CP follows a similar manner.
Additionally, the estimation performance of
was examined with respect to its bias, IMSE, and CP by averaging over simulations and quantile levels
. For example, IMSE of
is
. Relative bias was not computed, since
can take value of 0.
In this simulation study, we modeled
using orthonormal Bernstein polynomials of degree 2 (ie, 3 basis functions) and specified vague normal priors N(0,100) for the corresponding basis coefficients. The over-dispersion parameter included in the health model was updated using MH algorithms with an uniform prior. For the estimation of exposure quantile functions, we used 4 piecewise Gamma functions to expand quantile functions. Coefficients
and
were assigned N(0, 100) priors and were updated using MH algorithms;
and
were given InvGamma(0.1, 0.1) priors and were updated using Gibbs sampling; discrete priors (ie, 1000 equally spaced values between 0 and 1) were assigned for
and
, which were updated using MH algorithms. We generated 10 000 MCMC samples and discarded the first 5000 samples as burn-in when estimating quantile functions; while 5000 samples were generated and the first 2500 samples were discarded as burn-in when estimating health effects regardless of using true or estimated quantile functions.
Simulation results from using different exposure covariates (true mean exposures, true exposure quantile functions, and estimated quantile functions) are summarized in Table 1. We first focus on the case where exposure quantile functions and mean exposures are assumed to be known. Under the scenario where
is a constant (ie, S1), results from using quantile functions and mean of exposures are similar. Since
was modeled using the basis expansion, a larger IMSE was observed for the proposed model. For other scenarios, the proposed model resulted in empirically unbiased estimates of health effects. However, we observed biased estimates of health effects when using mean exposures. Specifically, we found positive biases when the upper tail of exposure distributions has larger effects (eg, S2 and S3), and negative biases when the effects decrease as the quantile level increases (eg, S5 and S6). The reverse pattern was observed for the total number of events attributed to exposures and predictive values of exposures. In S4, where the health effect first increases and then becomes a constant, the model using mean exposure happened to lead to a nearly unbiased estimate with conservative 95% CIs for
. However, this model failed to characterize effects of exposures as indicated by the larger MSE and severe under-coverage associated with predictive values of exposures. It is also worth noting that the proposed model performed slightly worse in this scenario, for example, the MSE of
was higher than S2, S3, and S5, likely because
in S4 is approximated less well by the basis expansion.
TABLE 1.
Simulation results using different exposure covariates.
|
|
Predictive values of exposures | Exposure-attributable events | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Scenario | Covariatea | Relative bias | Relative MSEc | CP (%) | Bias | IMSE | CP (%) | Relative bias | Relative IMSEc | CP (%) | Relative bias | Relative IMSEc | CP (%) |
| S1 | Mean | 0.004 | 1.00 | 94 | – | – | – | 0.004 | 1.000 | 94.00 | 0.001 | 1.000 | 95 |
| Quantile | 0.005 | 1.10 | 96 | 0.002 | 0.016 | 93.70 |
0.002 |
3.889 | 93.39 | 0.000 | 1.069 | 95 | |
| Quantile with errors | 0.011 | 1.47 | 92 | 0.004 | 0.031 | 95.48 | 0.004 | 6.475 | 96.41 | 0.001 | 1.101 | 94 | |
| S2 | Mean | 0.009 | 1.00 | 89 | – | – | – |
0.122 |
1.000 | 0.01 |
0.008 |
1.000 | 83 |
| Quantile |
0.002 |
0.80 | 96 |
0.001 |
0.015 | 92.17 |
0.002 |
0.052 | 93.66 |
0.000 |
0.610 | 93 | |
| Quantile with errors | 0.010 | 1.07 | 91 | 0.003 | 0.032 | 93.13 | 0.010 | 0.108 | 95.50 | 0.000 | 0.620 | 93 | |
| S3 | Mean | 0.019 | 1.00 | 75 | – | – | – |
0.174 |
1.000 | 0.00 |
0.010 |
1.000 | 82 |
| Quantile |
0.001 |
0.39 | 97 |
0.000 |
0.013 | 92.92 |
0.003 |
0.023 | 91.74 | 0.000 | 0.408 | 96 | |
| Quantile with errors | 0.014 | 0.74 | 90 | 0.005 | 0.036 | 90.47 | 0.017 | 0.056 | 94.12 | 0.001 | 0.398 | 96 | |
| S4 | Mean | 0.003 | 1.00 | 100 | – | – | – |
0.081 |
1.000 | 0.04 |
0.006 |
1.000 | 89 |
| Quantile |
0.001 |
1.01 | 98 |
0.001 |
0.014 | 93.61 | 0.003 | 0.127 | 93.56 | 0.000 | 0.751 | 95 | |
| Quantile with errors | 0.008 | 1.38 | 98 | 0.001 | 0.039 | 90.57 | 0.015 | 0.302 | 94.82 | 0.001 | 0.759 | 94 | |
| S5 | Mean |
0.010 |
1.00 | 95 | – | – | – | 0.182 | 1.000 | 0.00 | 0.017 | 1.000 | 77 |
| Quantile | 0.002 | 0.80 | 95 | 0.001 | 0.018 | 95.90 |
0.003 |
0.057 | 95.73 | 0.000 | 0.361 | 95 | |
| Quantile with errors | 0.003 | 0.85 | 97 | 0.000 | 0.033 | 97.12 |
0.002 |
0.107 | 96.02 |
0.000 |
0.381 | 95 | |
| S6 | Mean |
0.011 |
1.00 | 94 | – | – | – | 0.163 | 1.000 | 0.02 | 0.014 | 1.000 | 80 |
| Quantile | 0.000 | 0.75 | 95 | 0.000 | 0.016 | 95.74 |
0.001 |
0.075 | 95.67 |
0.001 |
0.537 | 95 | |
| Quantile with errors | 0.002 | 0.83 | 96 | 0.000 | 0.042 | 94.42 |
0.003 |
0.127 | 95.89 |
0.002 |
0.586 | 93 | |
a mean = true mean exposures, quantile = true exposure quantile functions, and quantile with errors = estimated exposure quantile functions.
b
For the model using mean exposure as the exposure covariate, the parameter
was reported.
c Relative MSE/IMSE was computed by treating the model using the true mean exposure as the reference.
We also evaluated the decision to select the proposed model over the conventional “mean” exposure model informed by the widely available information criterion (WAIC) (Watanabe and Opper, 2010). Among 100 simulations, the proportion of simulations in which the WAIC favored the mean model is 82% under S1 where the mean model is a valid and better choice. While for other scenarios, the WAIC favored the proposed model over
of the time.
When exposure quantile functions were estimated from the simulated dataset containing individual-level exposures, we applied the 2-stage estimation procedure to account for uncertainties of estimating exposure quantile functions. Figure 1(A) shows simulated exposure data over 9 consecutive days where exposure distributions are right skewed and change over time smoothly. Estimated quantile functions traced the truth well as shown in Figure 1(B). Compared to the case where true quantile functions were used, MSE/IMSE of those 4 quantities were increased (Table 1). However, across all scenarios, coverage probabilities still achieved or were close to the nominal level. We found that without accounting for estimation uncertainties by using the posterior means of the estimated quantile functions can lead to bias and under-coverage (results are not shown).
FIGURE 1.
(A) Boxplots of exposures observed on 9 consecutive days in the simulated data, (B) true/estimated exposure quantile functions for 9 consecutive days.
To assess the impact of the first-stage estimation on the estimation of health effects in the second stage, we further carried out simulations in which 6 and 8 piecewise Gamma functions were applied to estimate exposure quantile functions. Corresponding simulation results are presented in Table S3. As expected, when including more piecewise Gamma functions than needed, larger IMSE were observed for
, predictive values of exposures, and exposure-attributable events. We noted that the WAIC is an adequate tool for selecting the number of basis functions that is sufficient for estimating exposure quantile functions in the first stage. Specifically, simulation results showed that the WAIC computed for the first-stage and second-stage estimation favored cases where 4 piecewise Gamma functions (ie, the number of functions used to generate exposure data) were used.
5. REAL DATA ANALYSIS
In this section, we analyzed the motivating data introduced in Section 2 using both the proposed scalar-on-quantile-function model and the conventional model using average concentrations of air pollutants as the exposure.
5.1. Estimation of daily exposure quantile functions
Distributions of 4 air pollutants from SHEDS at 1 representative ZCTA are presented in Figures 2(A)-(D). We observed different degrees of skewness in exposure distributions for all 4 air pollutants. We noted that distributions of CO and EC are very different across days compared to NOx and PM
. Hence, for estimating quantile functions, quantile functions of CO and EC were assumed to be independent across days and ZCTAs, while temporal correlations were introduced for quantile functions of NOx and PM
. In this analysis, we used the same priors, the number of MCMC iterations, and burn-in as in Section 4. Figures 2(E)-(H) show the corresponding empirical quantile and estimated quantile functions from using 4 piecewise Gamma functions. We increased the number of piecewise Gamma functions to 6 and 8, estimated quantile functions are similar (results are not shown). Thus, results from using 4 basis functions are reported. Overall, our use of basis functions sufficiently capture exposure distributions with larger uncertainties in lower and upper tails as expected.
FIGURE 2.
(A-D) Boxplots of concentrations of air pollutants obtained from SHEDS at ZCTA 30032 on 7 representative days and (E-H) empirical/estimated quantile functions with 95% CIs of air pollutants obtained from SHEDS at ZCTA 30032 on 7 representative days (empirical quantile functions are denoted by solid points).
5.2. Estimation of health effects for ED visits
Associations between ED visits and same-day air pollution concentrations were examined using the proposed model with estimated quantile functions and uncertainty propagation, and the conventional model using sample mean of individual exposures at the ZCTA level. Following the previous health analysis (Sarnat et al., 2013), the following confounders were controlled for in all models: nonlinear effect of year-specific temporal trends using day of year, nonlinear effect of same-day dew-point temperature, indicators of day of week, an indicator of federal holidays, nonlinear effect of 3-day moving average of minimum temperature (maximum temperature was controlled for cardiovascular disease ED visits). All nonlinear effects were modeled with natural cubic splines with 4 degrees of freedom. Exchangeable ZCTA-specific random intercepts were included for all models. We varied the number of basis functions used for modeling
from 2 to 3, and results from the model with lower WAIC are reported. The same priors as in Section 4 were introduced for basis coefficients and the over-dispersion parameter; the variance parameter of spatial random effects were assumed to have InvGamma(0.1, 0.1) prior. We ran for 7500 iterations, the first 3500 being discarded as burn-in.
Figure 3 plots the estimated
with 95% CIs for different combinations of air pollutants and causes of ED visits. We observed that the WAIC clearly favors the proposed model under some cases (eg, panels (B), (C), (E), (F), and (H) of Figure 3). For example, Figure 3(E) shows that lower and upper tails of the distribution of EC concentrations have larger effects on respiratory disease ED visits. The resulting percent (%) increase in risk associated with 1 unit shift to right for the distribution of EC concentrations is estimated to be 2.58 (95% CI: 0.82-4.31) and 0.56 (95% CI:
0.14 to 1.26) using the proposed and the conventional models, respectively. In some cases, the WAIC indicates that the model utilizing quantile functions and the model using sample mean of exposures fitted the data equally. For example, Figure 3(J) shows that estimates of health effects of PM
on asthma or wheeze ED visits basically remain the same across quantile levels, matching the estimate obtained from the mean model. Estimates of percent increase in risk for the rest of combinations are presented in Table S2.
FIGURE 3.
Estimates of
with 95% credible intervals from analyzing the motivating data (as_whz = asthma or wheeze ED visits, resp = respiratory disease ED visits, cvd = cardiovascular disease ED visits).
Figures 4(A)-(D) display estimated quantile functions where their corresponding exposure medians are at the 25
, 50
, 75
, and 95
percentiles of the estimated medians across all ZCTAs and days. Dashed horizontal lines mark mean exposure calculated from SHEDS data used for estimating the quantile functions. These quantile functions were selected to represent different exposure contrasts across ZCTAs and days. Estimates of relative risks associated with changes in selected estimated quantile functions from the proposed and the conventional models are shown in Figures 4(E)-(H). Two contrasts were selected to represent different exposure effects: (1) comparing quantile functions with medians at the 75
and 25
percentiles to represent typical exposure contrast and (2) comparing quantile functions with medians at the 95
and 50
percentiles to represent more extreme exposure effects.
FIGURE 4.
Short-term associations between ED visits associated with 2 observed exposure distributions defined by the exposure median (75
versus 25
and 95
versus 50
percentiles of the distribution of estimated medians across all ZCTAs and days). Results obtained from models using the mean concentrations as the exposure are also shown as_whz = asthma or wheeze, resp = respiratory disease, and cvd = cardiovascular disease.
For short-term associations between ambient concentrations of CO and respiratory ED visits and cardiovascular disease ED visits, as shown in Figures 3(B) and (C), the proposed model including quantile functions was preferred, and we found that effects of CO at lower quantile levels are more pronounced. The corresponding short-term associations for the 2 selected contrasts are presented in Figure 4(E). We observed that simply using average concentrations of CO as the covariate in the health model underestimated relative risk of CO on cardiovascular disease ED visits and respiratory disease ED visits compared with the proposed model that considers the entire exposure distributions. Differences in the estimated number of ED visits attributed to the exposure to air pollution are also present. For example, Figures 5(A) and (B) illustrate differences in the number of exposure-attributable ED visits by ZCTA when effects of air pollutants vary by their quantile levels. Specifically, the total number of cardiovascular ED visits attributed to CO exposure was estimated to be 798 (95% CI: 74-3240) using the proposed model, while the mean model yielded an estimate of 640 (95% CI: -80 to 1343). In contrast, for PM
and asthma or wheeze ED visits where health effects are invariant with respect to quantile levels, the proposed model and the conventional model led to similar results (Figure 5(C)).
FIGURE 5.
Relative differences of the ZCTA-level number of exposure-attributable ED visits between using the proposed exposure-quantile model and the conventional mean-exposure model (as_whz = asthma or wheeze ED visits, resp = respiratory disease ED visits, cvd = cardiovascular disease ED visits).
6. DISCUSSION
In this work, we propose a scalar-on-quantile-function approach to fully characterize effects of environmental exposures on aggregate health outcomes by treating exposure quantile functions as functional covariates. Compared to methods which solely include summary statistics of personal exposures as the exposure metric of interest, our approach accounts for within-group exposure heterogeneity and allows more flexible associations between exposures and aggregate health outcomes. In addition, parametric distribution assumptions on exposure distributions are not necessary in our approach. With the proposed Bayesian 2-stage estimation procedure, estimates of health effects can be obtained while incorporating uncertainties in the estimation of exposure quantile functions. This 2-stage procedure also alleviates the computation burden when associations between 1 exposure and multiple health outcomes are examined, compared to approaches that jointly model exposure and health data.
Applying the proposed model to the motivating ED visits data in Atlanta, we identified novel short-term associations between ambient air pollution concentrations and ED visits, which were masked when daily population average exposures were linked to ED visits. For example, results suggest that effects of ambient concentrations of CO on respiratory disease ED visits and cardiovascular disease ED visits are highest at lower quantile levels. These new findings may be important for identifying subpopulations most vulnerable to ambient air pollution. For the majority of pollutant and outcome pairs, we found robust and positive associations comparable to the conventional approach, which strengthens prior evidence on the negative health effects of air pollution.
In the real data application, we noted that the widths of the 95% CIs of
vary by quantile levels. For example, in Figure 3(D), (F), and (J), the widths are smaller around the median. We found that this probably relates to the property of the orthonormal Bernstein polynomials, which were used to expand
. Based on empirical results, we observed that 2 estimated basis coefficients were negatively correlated when orthonormal Bernstein polynomials of degree 1 were used (ie,
was expanded using 2 basis functions). We further found that the maximum of the product of the 2 basis functions occurred around the median. It is easy to show that as this product increases, the variance of the estimated
decreases when 2 basis coefficients are negatively correlated. This explains the narrowest width that we observed around the median. Similar arguments could apply to the cases in which more than 2 basis functions are used for the expansion of
.
It is noteworthy that the interpretation of
requires extra caution. Specifically, we suggest to not interpret
directly as the change in log mean of the aggregate outcome associated with 1 unit increase at
th percentile while keeping all other percentiles the same. This is because in practice, it is not realistic to have a scenario in which the quantile function only changes at a specific percentile level while all other percentile levels remain unchanged. To obtain interpretable results from the proposed model, we recommend computing relative risk associated with changes in 2 representative quantile functions as presented in Figures 4(E)-(H).
The proposed model was motivated by the SHEDS data, which provide simulated personal exposure to different air pollutants in Atlanta over 4 years. While personal exposures are not widely available in large-scale population-based epidemiological studies, the proposed model can be applied to more general settings. Specifically, the proposed approach is applicable for scenarios where within group exposure heterogeneity exists and can be characterized using quantile functions. For example, in the scenario where environmental exposures are predicted using spatial-temporal models at finer spatial resolutions compared to spatial resolution of the health data, our proposed approach can be applied with exposure quantile functions derived from predicted exposures and population density. In addition, the health outcome does not have to be restricted to aggregated counts. For example, in the application of analyzing birthweight data in North Carolina conducted by Berrocal et al. (2011), the effects of an individual’s exposure distribution can be assessed using our approach.
In this work, we focus on single environmental exposure. One further extension of the proposed approach is to simultaneously examine effects of multiple exposures on aggregate health outcome. A possible strategy is to incorporate quantile surface of exposures as the functional predictor in the health model, which might account for correlations between exposures. Additionally, in our real data application, only the same-day health effects were explored. However, in practice, environmental exposures can exhibit delayed effects. This warrants a future research that is extending the proposed model to incorporate exposure quantile functions at different lags. One can consider to add an extra dimension of time in both
and
. In the real data analysis, we found that the estimation of exposure quantile functions is sensitive to outliers. As a result, larger uncertainties of estimated quantile functions are observed in the distribution tails and the estimation of health effects may be affected. To mitigate the impact of outliers on the exposure quantile estimation, one could consider introducing parametric methods for characterizing tails of exposure distributions (Zhou et al., 2012).
Supplementary Material
Tables referenced in Sections 2, 4, and 5.2, the basis functions introduced in Section 3.2.2 used for expanding exposure quantile functions, and details of MCMC algorithms referenced in Sections 3 and 4 are available with this paper at the Biometrics website on Oxford Academic. The proposed method is implemented in an R package nbRegQF, which can be accessed via the GitHub site: https://github.com/YZHA-yuzi/nbRegQF. In addition, R codes for the simulation studies included in Section 4 are posted online with this paper. They are also available on the following GitHub link: https://github.com/YZHA-yuzi/nbRegQF_reproducibility.
Acknowledgement
We thank Lisa Baxter and Kathie Dionisio from the US EPA for providing the SHEDS exposure data. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Yuzi Zhang: Formal analysis, Methodology, Software, Validation, Writing - original draft. Howard H. Chang: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing - review & editing. Joshua L. Warren: Writing - review & editing. Stefanie T. Ebelt: Writing - review & editing.
Contributor Information
Yuzi Zhang, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, United States.
Howard H Chang, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, United States.
Joshua L Warren, Department of Biostatistics, Yale University, New Haven, CT 06511, United States.
Stefanie T Ebelt, Gangarosa Department of Environmental Health, Emory University, Atlanta, GA 30322, United States.
FUNDING
The work is supported by grant R01ES027892 and R01ES028346 from the National Institutes of Environmental Health.
CONFLICT OF INTEREST
None declared.
DATA AVAILABILITY
The real data underlying the results presented in this article cannot be shared publicly because of restrictions that apply to protected health information (PHI).
References
- Alhanti B. A., Chang H. H., Winquist A., Mulholland J. A., Darrow L. A., Sarnat S. E. (2016). Ambient air pollution and emergency department visits for asthma: a multi-city assessment of effect modification by age. Journal of Exposure Science & Environmental Epidemiology, 26, 180–188. [DOI] [PubMed] [Google Scholar]
- Bekkar B., Pacheco S., Basu R., DeNicola N. (2020). Association of air pollution and heat exposure with preterm birth, low birth weight, and stillbirth in the US: a systematic review. JAMA Network Open, 3, e208243–e208243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bellucci M. A. (2014). On the explicit representation of orthonormal Bernstein polynomials. arXiv:1404.2293, (Accessed 25 January 2024).
- Berrocal V. J., Gelfand A. E., Holland D. M., Burke J., Miranda M. L. (2011). On the use of a PM2.5 exposure simulator to explain birthweight. Environmetrics, 22, 553–571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhaskaran K., Gasparrini A., Hajat S., Smeeth L., Armstrong B. (2013). Time series regression studies in environmental epidemiology. International Journal of Epidemiology, 42, 1187–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boogaard H., Walker K., Cohen A. J. (2019). Air pollution: the emergence of a major global health risk factor. International Health, 11, 417–421. [DOI] [PubMed] [Google Scholar]
- Brockhaus S., Scheipl F., Hothorn T., Greven S. (2015). The functional linear array model. Statistical Modelling, 15, 279–300. [Google Scholar]
- Burke J. M., Zufall M. J., OeZKAYNAK H. (2001). A population exposure model for particulate matter: case study results for PM2.5 in Philadelphia, PA. Journal of Exposure Science & Environmental Epidemiology, 11, 470–489. [DOI] [PubMed] [Google Scholar]
- Calder C. A., Holloman C. H., Bortnick S. M., Strauss W., Morara M. (2008). Relating ambient particulate matter concentration levels to mortality using an exposure simulator. Journal of the American Statistical Association, 103, 137–148. [Google Scholar]
- Carroll R. J., Ruppert D., Stefanski L. A., Crainiceanu C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. New York: Chapman and Hall/CRC. [Google Scholar]
- Chang H. H., Fuentes M., Frey H. C. (2012). Time series analysis of personal exposure to ambient air pollution and mortality using an exposure simulator. Journal of Exposure Science & Environmental Epidemiology, 22, 483–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Comess S., Chang H. H., Warren J. L. (2024). A Bayesian framework for incorporating exposure uncertainty into health analyses with application to air pollution and stillbirth. Biostatistics, 25,20–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dionisio K. L., Isakov V., Baxter L. K., Sarnat J. A., Sarnat S. E., Burke J. et al. (2013). Development and evaluation of alternative approaches for exposure assessment of multiple air pollutants in Atlanta, Georgia. Journal of Exposure Science & Environmental Epidemiology, 23, 581–592. [DOI] [PubMed] [Google Scholar]
- Dominici F., Zeger S. L., Samet J. M. (2000). A measurement error model for time-series studies of air pollution and mortality. Biostatistics, 1, 157–175. [DOI] [PubMed] [Google Scholar]
- Guo Y., Gasparrini A., Armstrong B. G., Tawatsupa B., Tobias A., Lavigne E. et al. (2017). Heat wave and mortality: a multicountry, multicommunity study. Environmental Health Perspectives, 125, 087006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang G., Lee D., Scott E. M. (2018). Multivariate space-time modelling of multiple air pollutants and their health effects accounting for exposure uncertainty. Statistics in Medicine, 37, 1134–1148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jenkins P. (1996). Personal exposure to airborne particles and metals: results from the particle team study in Riverside, California. Journal of Exposure Anahsis and Environmental Epidemiolog, 6, 57. [PubMed] [Google Scholar]
- Jerrett M., Arain A., Kanaroglou P., Beckerman B., Potoglou D., Sahsuvaroglu T. et al. (2005). A review and evaluation of intraurban air pollution exposure models. Journal of Exposure Science & Environmental Epidemiology, 15, 185–204. [DOI] [PubMed] [Google Scholar]
- Jerrett M., Burnett R. T., Kanaroglou P., Eyles J., Finkelstein N., Giovis C. et al. (2001). A GIS–environmental justice analysis of particulate air pollution in Hamilton, Canada. Environment and Planning A, 33, 955–973. [Google Scholar]
- Landrigan P. J. (2017). Air pollution and health. The Lancet Public Health, 2, e4–e5. [DOI] [PubMed] [Google Scholar]
- Lee D., Mukhopadhyay S., Rushworth A., Sahu S. K. (2017). A rigorous statistical framework for spatio-temporal pollution prediction and estimation of its long-term impact on health. Biostatistics, 18, 370–385. [DOI] [PubMed] [Google Scholar]
- Leiva V., Barros M., Paula G. A., Sanhueza A. (2008). Generalized Birnbaum-Saunders distributions applied to air pollutant concentration. Environmetrics: The Official Journal of the International Environmetrics Society, 19, 235–249. [Google Scholar]
- Li M., Wang K., Maity A., Staicu A.-M. (2022). Inference in functional linear quantile regression. Journal of Multivariate Analysis, 190, 104985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morris J. S. (2015). Functional regression. Annual Review of Statistics and Its Application, 2, 321–359. [Google Scholar]
- Neelon B. (2019). Bayesian zero-inflated negative binomial regression based on pólya-gamma mixtures. Bayesian Analysis, 14, 829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Özkaynak H., Xue J., Weker R., Butler D., Koutrakis P. (1996). Particle team (pteam) study: analysis of the data. Final report, Vol. 3, Technical report. Boston, MA: School of Public Health, Harvard University. [Google Scholar]
- Polson N. G., Scott J. G., Windle J. (2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association, 108, 1339–1349. [Google Scholar]
- Reich B. J. (2012). Spatiotemporal quantile regression for detecting distributional changes in environmental processes. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61, 535–553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reich B. J., Fuentes M., Burke J. (2009). Analysis of the effects of ultrafine particulate matter while accounting for human exposure. Environmetrics: The official journal of the International Environmetrics Society, 20, 131–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richardson S., Best N. (2003). Bayesian hierarchical models in ecological studies of health-environment effects. Environmetrics: The official journal of the International Environmetrics Society, 14, 129–147. [Google Scholar]
- Richmond-Bryant J., Long T. C. (2020). Influence of exposure measurement errors on results from epidemiologic studies of different designs. Journal of Exposure Science & Environmental Epidemiology, 30, 420–429. [DOI] [PubMed] [Google Scholar]
- Sarnat S. E., Sarnat J. A., Mulholland J., Isakov V., Özkaynak H., Chang H. H. et al. (2013). Application of alternative spatiotemporal metrics of ambient air pollution exposure in a time-series epidemiological study in Atlanta. Journal of Exposure Science & Environmental Epidemiology, 23, 593–605. [DOI] [PubMed] [Google Scholar]
- Sheppard L. (2003). Insights on bias and information in group-level studies. Biostatistics, 4, 265–278. [DOI] [PubMed] [Google Scholar]
- Steinle S., Reis S., Sabel C. E., Semple S., Twigg M. M., Braban C. F. et al. (2015). Personal exposure monitoring of PM2.5 in indoor and outdoor microenvironments. Science of the Total Environment, 508, 383–394. [DOI] [PubMed] [Google Scholar]
- Sugg M. M., Fuhrmann C. M., Runkle J. D. (2018). Temporal and spatial variation in personal ambient temperatures for outdoor working populations in the southeastern USA. International Journal of Biometeorology, 62, 1521–1534. [DOI] [PubMed] [Google Scholar]
- Thornton P., Thornton M., Mayer B.W., Wei Y., Devarakonda R., Vose R.S. et al. (2016). Daymet: Daily Surface Weather Data on a 1-km Grid for North America, Version 3. Oak Ridge Tennessee, USA: ORNL DAAC. 10.3334/ORNLDAAC/1328. (Accessed 15 July 2021). [DOI] [Google Scholar]
- Watanabe S., Opper M. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11,3571–3594. [Google Scholar]
- Yang H., Baladandayuthapani V., Rao A. U., Morris J. S. (2020). Quantile function on scalar regression analysis for distributional data. Journal of the American Statistical Association, 115, 90–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoo E.-H., Eum Y., Roberts J. E., Gao Q., Chen K. (2021). Association between extreme temperatures and emergency room visits related to mental disorders: a multi-region time-series study in New York, USA. Science of The Total Environment, 792, 148246. [DOI] [PubMed] [Google Scholar]
- Zhou J., Chang H. H., Fuentes M. (2012). Estimating the health impact of climate change with calibrated climate model output. Journal of Agricultural, Biological, and Environmental Statistics, 17, 377–394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zidek J. V., Shaddick G., White R., Meloche J., Chatfield C. (2005). Using a probabilistic model (pCNEM) to estimate personal exposure to air pollution. Environmetrics: The Official Journal of the International Environmetrics Society, 16, 481–493. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Tables referenced in Sections 2, 4, and 5.2, the basis functions introduced in Section 3.2.2 used for expanding exposure quantile functions, and details of MCMC algorithms referenced in Sections 3 and 4 are available with this paper at the Biometrics website on Oxford Academic. The proposed method is implemented in an R package nbRegQF, which can be accessed via the GitHub site: https://github.com/YZHA-yuzi/nbRegQF. In addition, R codes for the simulation studies included in Section 4 are posted online with this paper. They are also available on the following GitHub link: https://github.com/YZHA-yuzi/nbRegQF_reproducibility.
Data Availability Statement
The real data underlying the results presented in this article cannot be shared publicly because of restrictions that apply to protected health information (PHI).













