Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

Daniel McGuire; Havell Markus; Lina Yang; Jingyu Xu; Austin Montgomery; Arthur Berg; Qunhua Li; Laura Carrel; Dajiang J Liu; Bibo Jiang

doi:10.1038/s41467-024-49566-6

. 2024 Jun 25;15:5357. doi: 10.1038/s41467-024-49566-6

Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

Daniel McGuire ^1,^#, Havell Markus ^2,^3,^4,^#, Lina Yang ¹, Jingyu Xu ¹, Austin Montgomery ², Arthur Berg ¹, Qunhua Li ⁵, Laura Carrel ⁶, Dajiang J Liu ^1,^✉, Bibo Jiang ^1,^✉

PMCID: PMC11199552 PMID: 38918381

Abstract

Large national-level electronic health record (EHR) datasets offer new opportunities for disentangling the role of genes and environment through deep phenotype information and approximate pedigree structures. Here we use the approximate geographical locations of patients as a proxy for spatially correlated community-level environmental risk factors. We develop a spatial mixed linear effect (SMILE) model that incorporates both genetics and environmental contribution. We extract EHR and geographical locations from 257,620 nuclear families and compile 1083 disease outcome measurements from the MarketScan dataset. We augment the EHR with publicly available environmental data, including levels of particulate matter 2.5 (PM_2.5), nitrogen dioxide (NO₂), climate, and sociodemographic data. We refine the estimates of genetic heritability and quantify community-level environmental contributions. We also use wind speed and direction as instrumental variables to assess the causal effects of air pollution. In total, we find PM_2.5 or NO₂ have statistically significant causal effects on 135 diseases, including respiratory, musculoskeletal, digestive, metabolic, and sleep disorders, where PM_2.5 and NO₂ tend to affect biologically distinct disease categories. These analyses showcase several robust strategies for jointly modeling genetic and environmental effects on disease risk using large EHR datasets and will benefit upcoming biobank studies in the era of precision medicine.

Subject terms: Genome-wide association studies, Epidemiology

Large national-level electronic health record datasets offer new opportunities for disentangling the roles of genes and environment in human diseases. Here, the authors propose a spatial mixed linear effect model (SMILE) to dissect genetic and environmental risk factors for diseases and assess the causality of air pollutants in an insurance claim dataset with 50 million individuals.

Introduction

It is widely known that most complex traits are jointly influenced by genetics and environment. Yet, the extent to which genetic or environmental factors contribute to complex traits is much less understood and often subject to contentious debate¹, possibly due to the lack of large and high-quality datasets containing both genetic data and environmental exposures. Quantifying the genetic and environmental contributions to human disease is critical to understanding the underlying biology, performing accurate risk predictions, and designing effective preventive and therapeutic interventions.

Traditionally, family-based studies and variance components models have been used to partition phenotypic variance into genetic and environmental components, where health outcome similarities among relatives are regressed over measures of genetic relatedness and shared environmental exposure^2,3. In these studies, unmeasured environmental exposures across different families are often assumed to be independent. However, a number of environmental risk factors are shared between different families in the same geographic area and are spatially correlated⁴. Examples include air pollution, climate, and sociodemographic characteristics such as average levels of education and income^5,6. Unmodeled community-level environment effects could lead to upwardly biased estimates of genetic heritability, as within-family phenotypic correlation due to shared community-level environment may be falsely attributed to genetics⁴. Twin studies may also be subject to the impact of environmental confounding, and often do not estimate the contribution of community-level environment.

Genome-wide association studies (GWAS) and linear mixed models with unrelated individuals have been used to estimate narrow-sense heritability that is captured by the genotyped SNPs (chip heritability)⁷. GWAS using unrelated individuals often achieve much larger sample sizes and are less likely to be confounded by shared environment compared to family-based studies⁸. However, recent research has shown that geographical confounding can bias estimates of chip heritability as well⁹. Besides, these chip heritability estimates are conceptually different from that of family-based studies and are sensitive to the assumptions on the allele frequencies, effect sizes, and levels of linkage disequilibrium between genotyped SNPs and the true causal variants⁸. Standard linear mixed models in GWAS often do not model community-level shared environmental variance and hence do not quantify its contributions to disease. There are also existing works that seek to estimate environmental impacts on diseases but do not account for genetic relatedness, and hence do not provide joint estimates of heritability and environmental contribution either^3,10–15.

In this work, to address the aforementioned challenges and fill in the knowledge gap, we describe a spatial mixed linear effect (SMILE) model that jointly estimates genetic heritability and environmental components of disease risk using geospatial locations of the study participants as a proxy for community-level environmental risk factors. The SMILE model helps characterize geographic variation in disease risk, control for additional sources of within-family correlation, and reduce the bias of estimated heritability. We apply SMILE to the MarketScan dataset¹⁶, a large insurance billing database with electronic health records (EHR) from more than 50 million individuals to assess the contribution of genetic and environmental factors to 1083 human diseases. To further assess if environmental risk factors are causally linked to disease, we augmented the MarketScan dataset with publicly available environmental data, including levels of particulate matter 2.5 (PM_2.5) and nitrogen dioxide (NO₂), climate, and socioeconomic status. We apply a rigorous causal inference framework to assess the roles of pollutants PM_2.5 and NO₂ for the phenome using wind speed and direction as instrumental variables.

Results data overview

We used the IBM MarketScan health insurance claims database to assemble a large quality-filtered cohort of 257,620 nuclear families with parents and children (Methods). The MarketScan database is a de-identified compilation of patient billing code records from employer-based health insurance policies in the United States. Family structure was inferred based on the relationship of each member to the primary enrollee on the policy. Members indicated as either “Employee” or “Spouse” were deemed as parents and those indicated as “Child/Other” were deemed as children in the family¹⁷. We analyze nuclear families who were enrolled in the database for at least 6 years between 2005 and 2017, and for whom all children are at least 10 years old at the time of entry into the database. As we demonstrate in the Results and Supplementary Methods, the impact of mis-specified familial relationship and the length of enrollment in the database have minimal impact on the estimates of variance components.

A summary of the available demographic characteristics of ascertained families is provided in Table 1. We then mapped the inpatient and outpatient (International Classification of Diseases version 9 and 10 (ICD-9/ICD-10) billing code records between 2005 to 2017 to PheWAS codes¹³, which represent biologically/medically more meaningful phenotypes. We also include several individual-level covariates including the year of birth and sex, as well as the approximate location in terms of U.S. County or Metropolitan Statistical Area (MSA) in our analyses.

Table 1.

Demographic description of ascertained nuclear families in the Marketscan database

Total # of families	257,620
Unique family locations (County or MSA)	3,229
Median number of families per location (IQR)	15 (5,50)
Median months enrolled (IQR)	84 (72,102)
Median age at first year of enrollment (IQR)
Overall	23.5 (15,46)
Father	47 (44,51)
Mother	45 (42,49)
Children	15 (13,17)
Median age difference (IQR)
Father – Mother	1.907 (0,4)
Oldest Sibling – Youngest Sibling	3.061 (2,4)
% Primary enrollee by parent
Father	67.3%
Mother	32.7%
Fraction of smokers
Overall	66,723 (6.5%)
Father	23,457 (9.1%)
Mother	17,643 (6.8%)
Children	25,623 (5%)

Open in a new tab

IQR inter-quartile range

Multiple external datasets of community-level risk factors were also incorporated, which were assigned to each individual based on their location. These include demographic data extracted from the 2015 American Community Survey 5-year estimates¹⁸, satellite-derived measurements of air pollution including particulate matter 2.5 (PM_2.5)^19,20 (Supplementary Fig. 1) and nitrogen dioxide (NO₂) (Supplementary Fig. 2-3)^21,22. We also integrated wind direction and wind speed data²³ as instrumental variables for causal inference (see Methods for details).

Statistical methodology overview

In the SMILE model, we incorporate random effects to capture phenotypic variation attributable to genetic relatedness, shared family environment, and shared community-level environment. The full SMILE model is specified by

Y = X π + u_{g} + Z_{s} u_{s} + Z_{par} u_{par} + Z_{child} u_{child} + ϵ

In this model:

$Y$ is the vector of 0-1’s for disease status, with 1 being the diseased and 0 being the normal.
$X$ is the design matrix for fixed effect individual-level covariates with effects $π$ .
$u_{g}$ is the vector of genetic random effects, whose correlation is determined by genetic relatedness. Individuals in different families are assumed to be genetically unrelated.
$u_{par}$ and $u_{child}$ are the vectors of random effects for the shared parental and children-level family environment. Even though they live in the same household, parents and children may share distinct environment, including diet patterns, exercise levels, hours of sleep, and exposures at work or school. $u_{par}$ and $u_{child}$ each capture the distinct environmental exposures that are shared between parents and between children.
$Z_{par}$ and $Z_{child}$ are the design matrices that link each individual to their corresponding within-family parent- or child-shared environment random effect.
$u_{s}$ is the random effect for the community-level shared environment. Families from the same location share the same random effect. We assume that the random effects from neighboring locations follow an independent normal distribution (IND), or conditional autoregressive (CAR), or simultaneous autoregressive (SAR) distribution. We chose these model specifications as they cover a wide range of scenarios and are computationally feasible for large-scale datasets.
$Z_{s}$ is the design matrix linking each individual to his/her corresponding spatial random effects.

We also extended the model in a two-stage regression framework (SMILE-2) to assess the causal effects of air pollution on the phenome, which is conceptualized in Supplementary Fig. 4. We use wind speed and direction as instrumental variables²⁴. Wind speed and direction have previously been used as instrumental variables for various pollution exposures^25–28. They are unlikely to have any direct causal effect on a disease phenotype, but are strong predictors of local pollution levels, as the pollution level in a given location is a mixture of locally produced and transported air pollution by the wind from its original source^28,29.

It is easy to verify that the wind speed and direction satisfy the three primary assumptions of instrumental variables in two-stage regression analyses²⁴:

Instrument relevance: the instrument is correlated with the pollutant $P$ , i.e., $cov (Z, P) \neq 0$ , where $Z$ collectively denotes for wind speed and direction.
Instrument exogeneity: The instrument (averaged wind-direction) is uncorrelated with other confounders (measured or unmeasured) in the second stage model.
The averaged wind instruments $Z$ have no direct effect on the disease phenotype $Y$ .

In the first stage regression, we regress the pollution levels over the instrumental variables. In the second stage model, the SMILE-2 model tests for the causal effect of pollution ( $β$ ) using the predicted pollution level ( $\tilde{P}$ ) from the first stage model as input, i.e.,

Y = \tilde{P} β + X π_{2} + u_{g} + Z_{par} u_{par} + Z_{child} u_{child} + ϵ

More details on SMILE and SMILE-2 models can also be found in Methods and the Supplementary Methods.

SMILE model yields more accurate heritability and environmental variance component estimates

We conducted extensive simulations to assess the accuracy of the variance component estimates when models are correctly- or mis-specified (Methods). To make sure that our simulation reproduces realistic spatial distributions of family locations, disease prevalence, risk factors, and confounders, we sampled nuclear families with replacement from the available locations in the MarketScan dataset based on the 257,620 families used in data analysis. We varied the values of different variance components in the simulation and used CAR covariance structure for simulating spatial random effects as it best fits the data (shown in the sections below). For each combination of variance components, we simulated the underlying liability and created binary phenotypes under the liability threshold model with varying disease prevalence. We compared sub-models with different combinations of random effects in the simulation. A total of eight models were fitted in the analysis, including SMILE (GPC + S), GP + S, GC + S, GPC, PC + S, PC, G + S, and S, where we use G, P, C, and S to denote the genetic, parental, children’s, or spatial community-level variance components. We used Bayesian Information Criterion (BIC) to determine the best fitting model.

We found that BIC chose the model with correctly specified variance components in 71.1% of the replicates with 50,000 quad-families, and 81.8% of replicates with 250,000 quad-families (Fig. 1A). The estimates became more precise as sample size and prevalence increase (Supplementary Data 1). We found that in the presence of community-level spatial effects, the models that do not account for community-level effects produce upwardly biased heritability estimates, as family members who live in the same location have additional phenotypic correlation which may be falsely attributed to genetics (Fig. 1B, Supplementary Data 1). Our results show that the extent of upward bias in heritability increases with the size of the spatial variance components. Importantly, the full SMILE model yielded minimal or near-minimal bias and mean squared error (MSE) for all variance component estimates regardless of the underlying model (Fig. 1C, Supplementary Data 1). For instance, if the true model does not contain parental shared family environment, SMILE still gave unbiased estimates for all variance components. For this reason, we used the full model with GPC + S variance components in our analysis of MarketScan data, as it eases the computational burden of estimating multiple models without compromising the estimation accuracy. We also verified that our chosen parameters for simulations reflect the parameters estimated from the MarketScan and that our simulation setting is realistic (Supplementary Data 2). We lastly conducted simulations with family relationship errors (e.g., when stepchildren and adopted children are coded as biological children for both parents) to assess the robustness of heritability estimates (see Supplementary Methods for more details). Overall, we observed that the heritability estimates from SMILE models are robust. With noisy familial relationships, the confidence intervals of the estimated variance components still overlap the true values and the bias is small (Supplementary Fig. 5 and Supplementary Data 3).

Fig. 1 — We conducted comprehensive simulations to evaluate the SMILE model for estimating heritability and environmental variance components under different model specifications. We simulated data under models with different combinations of genetic (G), parental (P), children (C) and spatial community-level (S) environment variance components. When analyzing simulated data, Bayesian information criterion (BIC) was used to determine the best model. In a, we display the fraction of replicates where each model was selected, as reflected by the size of the data point. In b, we compare GPC and GPC + S models as the amount of true community-level spatial variance increases. Each dot represents the difference in estimated genetic heritability and true simulated heritability across 5 simulation replicates under 60 different scenarios. In c we showed the accuracy for the heritability estimates from SMILE under different true models with varying numbers of variance components. Each dot represents the estimated genetic heritability across 5 simulation replicates under 60 different scenarios. In b and c, minima and maxima values (excluding outliers) are represented by the lower- and upper-bound of the whiskers. Median value is represented by the bold line in the middle. First and third quartiles are represented by the lower- and upper-bound of the box. All panels represent simulations using 250,000 nuclear families.

SMILE-2 model more powerfully identifies pollution causal effects

Similar to variance component simulation, we sampled with replacement families and their covariates from the MarketScan dataset, so that all confounders and the covariates of interest (i.e., pollution levels) maintain their realistic correlations. To assess the type-I error and power for causal inference, we simulated the pollution effects of either NO₂ or PM_2.5, varying the relative risk (RR). Based on estimated variance components and assumed causal effects, we simulated disease liabilities, which were dichotomized to get binary disease status. For each PheWAS code, we simulated six replicates and ensured that each PheWAS code-based phenotype was selected under both the null (RR = 1) and alternative (RR > 1) hypotheses at least once.

To illustrate the power for SMILE-2, we also compared the results of SMILE-2 with a standard two-stage regression model of independent individuals with fixed effects only (IND-FE).

Both IND-FE and SMILE-2 Models produced calibrated Type-I error rates under the null hypothesis for each significance threshold (Fig. 2A). The SMILE-2 model was substantially more powerful than IND-FE models, particularly when disease prevalence is low, or when the disease prevalence is moderate and the causal effect (measured by RR) is large (Fig. 2A), as SMILE-2 incorporates additional related samples. MSEs of the estimated log-odds ratios are also lower for SMILE-2 compared to IND-FE (Fig. 2B). These results indicate that modeling genetically related individuals in insurance claim data using the SMILE-2 model enlarges sample size and improves the accuracy of causal effect estimates and the power of causal inference. We also conducted simulations to assess how the causal effects of pollution may be affected if the pollution and wind measurements are noisy (see Supplementary Methods for more details). We observed that causal effect estimates by SMILE-2 remain unbiased when noisy pollution and wind measurements were used (Supplementary Fig. 6) and the power is only minimally reduced.

Estimation and robustness of genetic heritability and spatial variance components for 1083 traits

We used the SMILE model to analyze 1083 binary diseases as defined by the PheWAS code (Supplementary Data 2). Comparing the GPC and SMILE models, we found that BIC values were generally smaller for the full SMILE model (with spatial random effects) in 1021/1,083(94.3%) of phenotypes, which suggests that modeling community-level environment improves model fitting. We compared estimated genetic heritability from the best models chosen for each phenotype with or without spatial random effects (Fig. 3A). When the spatial variance component was added to the model, the estimated heritability generally decreases, with the median decrease being 0.03 and inter-quartile range (IQR) being (0.018, 0.051). This verifies that many complex traits are influenced by the shared community-level environment, and that the failure to model the shared community-level environment could lead to inflated heritability estimates.

Fig. 3 — In a, we labeled the top 30 phenotypes with largest likelihood ratio increase after the addition of a community-location level spatial variance component. As expected from the theory, heritability estimates decrease on average by 4.3% after accounting for shared community-level environment. The models with the best BIC values are labeled and the size of the data points represents the magnitude of community-level environmental variance. In b and c, we compared variance estimates under independent normal distribution (IND), conditional autoregressive (CAR), and simultaneous autoregressive (SAR) covariance structures. We showed that regardless of the spatial covariance matrix used, (b) the spatial variance explained is very similar and (c) the estimated heritability is nearly identical.

Among the 1,021 traits for which the full SMILE model was chosen as the best model, a CAR covariance structure was selected for 783 (76.7%) traits, compared to 203 (19.9%) for SAR and 35 (3.4%) for uncorrelated covariance structures. We compared the heritability and spatial variance component estimates for SAR and IND against the CAR models for each phenotype (Fig. 3B, C). Interestingly, the estimated heritability was virtually identical regardless of the correlation structure of the spatial random effect (mean absolute difference compared to CAR = 0.002, with standard deviation 0.003). Differences in the spatial variance component estimates were also small (mean absolute difference compared to CAR = 0.002 with standard deviation 0.002). This is an indication that the estimated heritability and community-level environmental effects are robust to mis-specified correlation structure of spatial effects, similar to what was observed for heritability estimates in standard linear mixed models³⁰.

We also investigated whether the length of enrollment of study participants influences our phenotype definitions and the estimates of variance components in the SMILE model (see Supplementary Methods for more details). In brief, we compare the variance components estimates using families enrolled 6-7 years (149,710 families) and using families enrolled for 10–12 years (39,247 families) for all 1083 phenotypes. Overall, we observed a strong correlation for all variance components (Supplementary Fig. 7 and Supplementary Data 4) suggesting reduced length of enrollment has minimal impact on variance component estimates in the MarketScan data.

Landscape of heritability and community-level environment

We further stratified the diseases into 16 categories based upon their biological functions as designated by the PheWAS code mapping¹³ (Fig. 4). We examined the distribution of heritability and explainable community-level spatial variation for traits within each category.

We highlighted the diseases with the largest genetic and community-level environment variance components in each category (Fig. 4A, B). Hematopoietic traits and congenital anomalies had the highest average heritability compared to other trait categories, which are concordant with other genetic studies on these traits³¹. Traits with the highest community location-level spatial variance components included diseases related to parasitic infections (e.g., Lyme disease), and allergic reactions (e.g., contact dermatitis due to plants and dermatitis due to solar radiation). For some of these diseases, the connections to spatial environment and community are clear. For example, the areas with a high incidence of Lyme disease occur primarily in the upper Midwest and northeastern regions of the United States that are more rural³². Allergic reactions may be triggered by air pollution or pollen, both of which are spatially correlated and can be captured by spatial random effects³³.

We compared our heritability estimates for diseases defined by PheWAS codes to the heritability estimates from several previously published studies (Table 2):

An independent study using MarketScan database MS1¹⁴, but analyzed without modeling the shared community-level environment;
A study that repurposed EHR data from the New York State NY¹⁵;
A study CaTCH³ that analyzed twins from EHR data to estimate genetic and environmental contributions, and
Heritability estimates based upon GWAS summary statistics from UK Biobank (LDSC-UKB³⁴).

Table 2.

Correlations of SMILE results with heritability estimates from published studies

Cohort	Scenario	N	H² correlation between SMILE and published study (P-value)		Percentage of traits with overlapping conference Interval		Mean squared differences between SMILE and published estimates of H²		Median difference in H² (SMILE – published estimates)
Cohort	Scenario	N	GPC	SMILE	GPC	SMILE	GPC	SMILE	GPC	SMILE
CaTCH³	All	540	0.11 (0.014)	0.12 (0.0040)	0.43	0.42	0.060	0.053	0.002	−0.031
CaTCH³	K > 1%	405	0.19 (8.7 × 10⁻⁵)	0.21(2.3 × 10⁻⁵)	0.42	0.41	0.049	0.044	−0.013	−0.045
LDSC-UKB^34,79**	All	68	0.18 (0.14)	0.16 (0.20)	0.75	0.79	0.11	0.099	0.189	0.169
LDSC-UKB^34,79**	K > 1%	44	0.33 (0.029)	0.30 (0.050)	0.66	0.73	0.12	0.098	0.205	0.169
MS¹⁴	All	63	0.79 (1.7 × 10⁻¹⁴)	0.83 (2.7 × 10⁻¹⁷)	0.44	0.26	0.016	0.020	−0.062	−0.101
MS¹⁴	K > 1%	52	0.76 (5.77 × 10⁻¹¹)	0.79 (4.7 × 10⁻¹²)	0.41	0.2	0.017	0.021	−0.064	−0.106
NY¹⁵	All	33	0.57 (0.00051)	0.52 (0.0015)	0.7	0.65	0.029	0.032	−0.009	−0.049
NY¹⁵	K > 1%	31	0.48 (0.0069)	0.48 (0.0069)	0.68	0.68	0.028	0.030	−0.009	−0.053

Open in a new tab

( $K > 0.01$ indicates traits with an observed prevalence greater than 1%.) All p-values are for two-sided hypothesis tests.

^**The NY and LDSC-UKB studies are based on ICD-9/10 codes. It is difficult to directly compare ICD code defined and PheWAS code defined phenotypes. For this reason, we restricted our comparisons to ICD-9/10 codes which could be mapped to a unique PheWAS code.

The MS1, NY, and CaTCH studies are family-based and estimate narrow-sense heritability while LDSC-UKB estimates chip heritability. We found the SMILE estimates are significantly correlated with published studies, but generally yielded smaller estimates of heritability than the GPC model and the other family-based studies, i.e., NY, MS1, and CaTCH. This is consistent with our simulation results, indicating the shared community-level environmental risk, when left unaccounted for, could add to the upward bias in the heritability estimates from family-based studies. More details on the comparison can be found in the Supplementary Methods.

For diseases with strong environmental contributions, our SMILE model likely offers much refined estimates of heritability. For example, our heritability estimates for type 2 diabetes (T2D) decreased from 37.7% to 28.4% after accounting for spatial community effects. Even after redefining T2D cases using both ICD diagnostic codes and T2D medication codes³⁵ (see Supplementary Methods for more details), the heritability estimate only increased to 31% (Supplementary Data 5-6). Both estimates are lower than a majority of previous estimates from family studies of T2D, i.e., several previous studies have produced heritability estimates in the range of (0.26–0.69)^36–40. On the other hand, the result is more concordant with a recent large-scale study analyzing UK Biobank participants³⁴ which consists of primarily unrelated individuals. It obtained heritability estimates ranging from 19.6% to 33.2% depending on model specification using whole-genome data and whether rare or low-frequency SNPs were included for estimation⁴¹. Along similar lines, our heritability estimate for obesity decreased from 53.1% to 46.3% when adjusted for spatial community effects, while classical twin studies reported estimates as high as 70%⁴². The spatial community-level random effects for T2D showed a strong correlation with those for obesity $(\hat{ρ} = 0.67)$ , and a number of other lipid metabolism-related traits (i.e., Hyperlipidemia $\hat{ρ} = 0.75$ , Hypercholesterolemia $\hat{ρ} = 0.61$ ). Obesity is well known to be the number one leading risk factor for T2D⁴³. The correlations in community-level environment effects underscore the well-known shared etiology of T2D and obesity attributable to environmental factors^44,45, and increase our confidence in the validity of the effects captured by the community-level spatial variance component.

Extensive correlation between spatial random effects and environmental risk factors

Spatial random effects can capture a wide range of environmental risk factors, including many that are not often measured or controlled for in genetic studies. To gain a better understanding of what underlies our estimates of community-level risk, we integrated potential community-level environmental risk factors (CLERF) from external data sources into the MarketScan dataset according to the county or MSA locations. The additional CLERF variables include averaged minimum and maximum monthly temperature and precipitation levels, averaged PM_2.5 and NO₂ air pollution, as well as sociodemographic variables of median income, population density, poverty rates, education levels, and racial distributions at the county or MSA level from the 2015 ACS community survey¹⁸ (Supplementary Data 7-8). We calculated the total community-level environmental contribution for each disease at each MarketScan location using the best linear unbiased predictor (BLUP) of the spatial random effect from the SMILE model. As the external CLERF variables are not included as covariates in the SMILE model, we regressed the BLUPs over these risk factors to assess their impact.

We calculated the correlation between CLERF and BLUPs for 115 diseases with a prevalence of at least 2% and with an estimated spatial variance of at least 2% (Fig. 4C). For a majority of diseases, we observed that increased disease risk is correlated with indicators of lower socioeconomic status (SES), such as lower median income, the percentage of individuals with high school as the highest education, and poverty rate. Examples of lower SES-associated diseases included obesity, diabetes, chronic liver disease, chronic obstructive pulmonary disease (COPD), influenza, and fever. Interestingly, several traits were observed to be associated with higher SES, including benign neoplasms of the skin, hemorrhoids, and adjustment reaction (a more severe reaction than expected following a stressful event). We speculate that these findings may be attributable to disparities in education and access to healthcare for lower SES groups. For example, previous research has found that low SES is associated with more advanced melanoma at diagnosis, and that individuals with lower SES were less likely to be concerned about melanoma risks, or seeking screening and treatment by their physicians^46,47, explaining why higher SES groups would be more likely to have higher reported neoplasm incidences. Multiple prospective studies have noted that low SES is associated with poor mental health outcomes following stressful events^48,49. However, just as other observation studies, our sample is an observational scan of EHR databases, the inverse relationship we observe in our study between SES and adjustment reaction may be due to ascertainment. Similar explanations may underlie the association between higher SES and hemorrhoids, which has been noted in previous studies⁵⁰.

Estimating causal effects of air pollution across 1083 phenotypes

We used SMILE-2 to assess the causal effects of PM_2.5 and NO₂ air pollution on 1083 PheWAS disease codes with an observed prevalence of at least 0.1%.

The distribution for satellite-inferred estimates of PM_2.5 and NO₂ air pollution at the centroids of each MarketScan location were shown in Supplementary Figs. 1–3 (Methods), which was based upon averages across all years. The long-term averaged wind speed and direction information at each location were shown in Supplementary Fig. 8, which were used as instrumental variables for pollution to reduce the correlation between pollution levels and any unobserved confounding effects (Methods).

For each disease, we first applied SMILE-2 to analyze PM_2.5 and NO₂ air pollutants, and then also analyzed a constructed total pollutant level based upon the sum of the standardized PM_2.5 and NO₂ levels $({\tilde{P}}_{S U M})$ . After Bonferroni correction, we found 135/1083 (12.5%) of the phenotypes to have a significant association with either PM_2.5 (Fig. 5A), NO₂ (Fig. 5B), or ${\tilde{P}}_{S U M}$ (Fig. 5C) air pollution. The estimated causal effect is positive for 105 of 135 significant traits (77.8%), indicating elevated pollution levels increase disease risk.

Fig. 5 — We plotted the estimated odds ratios against −log₁₀(p-values) for the causal effects of (a) PM_2.5, (b) NO₂, and (c) the sum of the two pollutants ( ${\tilde{P}}_{S U M}$ ). All p-values are for two-sided hypothesis tests and are unadjusted for multiple comparisons. Dashed horizontal line represents Bonferroni significance threshold for testing 1083 diseases and three pollutants.

Among the significant causal effects, we found that the two pollutants affect different classes of diseases. For example, diseases significantly associated with PM_2.5 but not with NO₂ included multiple sleep disorders (hypersomnia ( ${\hat{O R}}_{P M_{2.5}} = 1.13, P = 5.4 \times 10^{- 17}$ ), obstructive sleep apnea ( ${\hat{O R}}_{P M_{2.5}} = 1.04, P = 1 . \times 10^{- 8}$ ), parasomnia ( ${\hat{O R}}_{P M_{2.5}} = 1.07, P = 1.2 \times 10^{- 6}$ ), narcolepsy ( ${\hat{O R}}_{P M_{2.5}} = 1.13, P = 5.3 \times 10^{- 9}$ )), respiratory infections⁵¹ (acute sinusitis ( ${\hat{O R}}_{P M_{2.5}} = 1.07, P = 3.8 \times 10^{- 24}$ ), acute bronchitis and bronchiolitis ( ${\hat{O R}}_{P M_{2.5}} = 1.05, P = 1.1 \times 10^{- 18}$ )), ear infections⁵² (otitis media ( ${\hat{O R}}_{P M_{2.5}} = 1.04, P = 1.2 \times 10^{- 23}$ ), and attention deficit hyperactivity disorder (ADHD)⁵³ ( ${\hat{O R}}_{P M_{2.5}} = 1.04, P = 1.4 \times 10^{- 7}$ ), (Fig. 5A).

At the same time, diseases associated with NO₂ pollution but not with PM_2.5 highlighted distinct symptomatology including multiple gastro-intestine-related disorders⁵⁴ (Gastritis ( ${\hat{O R}}_{N O_{2}} = 1.1, P = 4.4 \times 10^{- 7}$ ), IBS ( ${\hat{O R}}_{N O_{2}} = 1.1, P = 8.1 \times 10^{- 9}$ )), as well as both type I and type 2 diabetes⁵⁵ ( ${\hat{O R}}_{N O_{2}} = 1.15, 1.17$ $and$ $P = 2.5 \times 10^{- 6}, 2.3 \times 10^{- 6}$ respectively) (Fig. 5B). Additionally, we found several lipid metabolism associated diseases are causally linked with NO₂ (e.g., hyperlipidemia ${\hat{O R}}_{{N O}_{2}} = 1.09, P = 1.3 \times 10^{- 7}$ and hypercholesterolemia ${\hat{O R}}_{{N O}_{2}} = 1.10, P = 7.5 \times 10^{- 6}$ ). This is concordant with discoveries from several previous studies in the Chinese populations^56,57, with other research indicating that NO₂ may play a role in the regulation of lipid metabolism and may promote the formation of fatty plaque in arteries^58–60. Compared to NO₂, PM_2.5 may have a more damaging direct effect on lung function, and it has been shown that PM_2.5 can cause inflammation and a weakened immune-system defense, leaving the respiratory system prone to infection⁶¹.

We also rediscovered associations with diseases of low prevalence, which have primarily only been studied in relation to air pollution for specific subpopulations. For example, previous studies showed that cystic fibrosis patients exposed to high levels of PM_2.5 air pollution are more likely to develop methicillin-resistant Staphylococcus aureus (MRSA), which is an antibiotic-resistant infection⁶². In our study, we recapitulate the causal relationship between MRSA and PM_2.5 in the general population as well $({\hat{O R}}_{P M 2.5} = 1.05, P = 5.1 \times 10^{- 6})$ . The causal effect of PM_2.5 on MRSA in the general population underscores the important link between air pollution and infectious diseases from a public health perspective.

Discussion

In this article, we develop the SMILE model to jointly quantify the contributions of genetics and correlated community level shared environment on disease phenotype variation. We applied the method to analyze insurance claim data using the MarketScan dataset with more than 50 million individuals. We refined the estimates of genetic heritability and community-level environmental variance components. We also quantified the causal effects of air pollutants PM_2.5 and NO₂ for 1,083 diseases.

The refined heritability estimates by SMILE may help reconcile the discrepancy between heritability estimates from family studies and GWAS using unrelated individuals⁶³, as it helps correct for the upward bias induced by the correlated community-level environment in family-based variance components models. SMILE does not need genotype data as input, making it uniquely suitable for analyzing insurance claim data without genetic information. It also differs from genome wide interaction studies (GWIS), which uses explicitly measured environmental variables and genetic information to identify genetic variants interacting with environment. As it is virtually impossible to measure all environmental risk factors to the trait variation, GWIS may share the same limitations of GWAS where unmeasured environmental risk factors may confound heritability estimates. GWIS is also not applicable to insurance claim data as it needs genotype information. In contrast, SMILE captures the contributions of spatially correlated environmental risk factors to the trait variation without having to explicitly measure each environmental risk factor individually. Thus, SMILE complements GWIS and is essential for deriving more accurate variance components estimates in the presence of unmeasured environmental exposures.

Our comprehensive catalog of heritability estimates derived from EHR-based phenotypes offers a unique reference to quantify environmental contribution and assess the “missing heritability” for complex diseases. To ensure the validity of our results, we have conducted comprehensive robustness analyses and simulations that suggest our results are robust against different phenotype definitions, mis-specified pedigrees, and measurement errors in wind/pollution levels. These robustness analyses ensures the usefulness of SMILE and its extensions to insurance claim data and national EHR-based biobanks, such as UK Biobank⁶⁴ and All of Us⁶⁵.

Another contribution of our work is that we showed that different pollutants in the air may have distinct causal effects on different diseases. This contrasts many epidemiological studies, where different sources of ambient air pollution (i.e., PM_2.5, NO₂, ground-level ozone, and carbon monoxide) are combined into an aggregate measure of air quality, not distinguishing the specific mechanisms by which individual pollutants impact disease. These results would be useful for generating hypotheses for follow-up analyses.

There are several aspects of our analyses that warrant discussion. First, MarketScan database only includes individuals with employer sponsored insurance policies. As such, low-income families may not be well-represented¹⁴. While the results are valid for the population the dataset represents, it is important to exercise caution when extrapolating our findings to different populations.

Second, as an insurance claim database, MarketScan data only includes medical information during a limited period of time¹⁴. For example, the data on children is only up to the age of 26, which is the maximum age a child can be covered by their parent’s health insurance in the US. Thus, the prevalence for late-onset conditions may be lower in children when compared to parents. We account for this by (1) limiting our analysis to families enrolled in the database for at least 6 years, (2) limiting our analysis to families where all children are at least 10 years old at the time of entry into the database, and (3) excluding families where the age at enrollment of the youngest family member was less than the 5th percentile of the age of diagnosis for the phenotype of interest and (4) including age and age² as covariates in both SMILE and SMILE-2 to account for the impact of age. Despite the limitation of the datasets, our heritability estimates are comparable to estimates obtained from other data types (Table 2), which demonstrate the effectiveness of our filtering criteria and the validity of our results.

Third, EHR-derived phenotypes may not be completely accurate. For example, substance abuse disorder cases may be under-represented if a substantial proportion of those affected do not seek medical treatment. For some diseases, the presence of a medical diagnostic code may highlight the differences in healthcare-seeking behavior rather than a true representation of disease prevalence. Poor coding practices for various traits may have negative net impacts on public health research, and some research has provided evidence that EHR documentation is heterogeneous across medical providers, practices, and physicians, including the documentation of diagnostic codes⁶⁶. In this regard, spatial random effects may be viewed as a potentially effective way of controlling for these biases, similar to the use of a linear mixed model in GWAS to account for unexplained population structure⁶⁷.

Fourth, causal inference results should be carefully interpreted. The Bradford-Hill criterion⁶⁸ is a standard benchmark for assessing causality in epidemiological studies, and should be considered for all causal inferences. Temporality is one assumption, which states that the pollution exposure precedes the disease. Here, we used a single long-term average of PM_2.5 and NO₂ air pollution at MarketScan participant locations as the measurement of pollution exposure⁶⁹. Therefore, the analysis implicitly assumes that the average level of pollution is representative of the pollution exposure for individuals at each location. While this assumption is valid during the period where our data was collected, environmental factors may experience transient changes. For example, during the COVID-19 pandemic, NO₂ levels were significantly reduced but PM_2.5 levels remained similar to other times. This is because NO₂ is emitted during fuel combustion of all motor vehicles and airplanes, but PM_2.5 is mainly emitted by diesel commercial vehicles and remained largely unchanged by quarantine restrictions⁷⁰. Understanding how the transient changes in individual pollutants impact diseases in future works may lead to more effective and better-informed environmental policies and air quality regulations.

When causal effects are observed, it is also helpful to further assess potential explanations and biological feasibility. We observed several significant associations where air pollution was negatively associated with diseases that potentially originate from sexually transmitted infections (painful urination, viral hepatitis) (Fig. 5). This correlation is consistent with increased sexually transmitted infections across rural areas in the U.S. compared to more heavily polluted regions, which may stem more directly from the lack of access to public health resources and social conservatism⁷¹.

We also envision several exciting areas for future research which exploit the pedigree structures, deep phenotyping, and massive sample sizes of EHR datasets. For one, computational constraints often require dichotomizing diseases into binary traits indicating whether the trait was “ever” or “never” observed for a patient, yet EHR records are inherently longitudinal in nature. Modeling strategies that can account for time-to-event outcomes and/or recurrent events (such as common colds, broken bones, or infections) may yield greater insights into the etiology of certain diseases. Similarly, modeling air pollution changes over time could also yield additional findings. Incorporating spatial random effects and modeling correlated community-level environment could also lead to applications outside the scope of this paper, e.g., improving the power of genetic association tests in national biobanks⁷².

Together, our methods for modeling spatially dependent community-level environmental risk open new venues to analyze national biobanks and explore the genetic architecture of complex traits. Our improved estimates of heritability, environmental contribution, and causal effects for air pollution across the phenome offer a valuable foundation upon which future studies may be built.

Methods

Here we describe the SMILE model for quantifying the genetic and community-level environment contribution. The extension of the SMILE-2 model for assessing the causal effect for air pollution, the description of the datasets used, and the simulation analyses are left to the Supplementary Methods.

SMILE model of genetic heritability, family environment, and community location random effects

We developed the SMILE model to jointly characterize the genetic, family-level, and community-level environmental variance components. We also refer to the model as GPC + S according to the variance components included. We included age, the number of months enrolled in the dataset, and the indicator variables for sex, and the first year of enrollment as individual-level fixed-effect covariates. A total of $N_{F}$ nuclear families (with $N$ individuals) were used in the analysis. The full SMILE model (i.e., GPC + S) may then be specified as

Y = X π + u_{g} + Z_{s} u_{s} + Z_{par} u_{par} + Z_{child} u_{child} + ϵ

where $Y = (y_{1}, \dots, y_{N})$ is the vector of case-control status. To facilitate the presentation of the method, we assume that $Y$ is arranged by families, and within each family, the phenotypes are arranged in the order of father, mother, and children. $X$ denotes the design matrix for the fixed effect individual-level covariates with effect $π$ . $u_{g}, u_{par}, u_{child}$ , and $u_{s}$ are respectively the genetic, shared parental, and children’s environmental random effects, and community-level spatial random effects. $Z_{s}, Z_{par}$ , and $Z_{child}$ denote the indicator matrices, mapping each individual to their corresponding random effects in $u_{s}, u_{par}$ , and $u_{child}$ .

More specifically, $u_{par} = (u_{p a r, 1,1} \dots, u_{p a r, N_{F}, 1}, u_{p a r, N_{F}, 2}, \dots)$ and $u_{child} = (u_{c h i l d, f, 1}, \dots, u_{c h i l d, N_{F}, 1}, u_{c h i l d, N_{F}, 2}, \dots)$ are vectors of independent and identically distributed normal random variables. In family $f$ , the parents share random effect $u_{p a r, f, 1}$ and the children have random effects $u_{p a r, f, 2}$ , $u_{p a r, f, 3}$ , etc. Similarly, in family $f$ , children share random effects $u_{c h i l d, f, 3}$ , while parents have random effects $u_{c h i l d, f, 1}$ and $u_{c h i l d, f, 2}$ , as children and parents may have different environmental exposures.

Within each family, the correlation between genetic random effects is determined by kinship matrix $G$ . In the example of a quad family (nuclear family with 2 children), the kinship matrix is given by:

G = (\begin{matrix} 1 & 0 & 0.5 & 0.5 \\ 0 & 1 & 0.5 & 0.5 \\ 0.5 & 0.5 & 1 & 0.5 \\ 0.5 & 0.5 & 0.5 & 1 \end{matrix})

where the first two rows represent parents and the last two rows represent children in the family. Each entry in the matrix represents the genetic kinship between corresponding individuals in the family.

The genetic random effects across all families satisfy:

u_{g} ~ N (0, σ_{g}^{2} blk_diag (G_{1}, \dots, G_{N_{F}}))

where $blk_diag$ represents block diagonal matrices.

The community-level spatial random effect $u_{s} = (u_{s, 1}, \dots, u_{s, L})$ is a vector of length $L$ , with individuals located in location $l$ having random effect $u_{s, l}$ and $L$ being the number of unique MarketScan county or MSA locations). To model the spatial dependence between families, we considered conditional autoregressive (CAR), simultaneous autoregressive (SAR), and independent (IND) covariance matrices for community-level spatial random effects, as they cover a range of scenarios, and it is computationally feasible to apply them to large datasets. Specifically, under CAR, SAR or IND models, the spatial random effects follow:

CAR : u_{s} ~ MVN (0, Σ_{CAR})

SAR : u_{s} ~ MVN (0, Σ_{SAR})

IND : u_{s} ~ MVN (0, σ_{s}^{2} I)

To describe the covariance matrix of the spatial random effects (i.e., $Σ_{CAR}$ and $Σ_{SAR}$ ), we first define the weight matrix $W$ as a $L \times L$ symmetric matrix. $W$ has diagonal entries of 0 and off-diagonal entries of 1 for pairs of locations that share a common border. MSA’s were considered as ‘sharing borders’ with the counties they encompass. Adjacencies between counties were identified using the R package spdep⁷³. $W_{+}$ is obtained from $W$ by standardizing its rows, so that the entries from each row add up to 1 (according to the definition of $W$ , the normalizing factor $\sum_{j} W_{i j}$ equals the number of ‘neighbors’ to location $i$ ). $M$ is a diagonal matrix with diagonal elements $M_{i i} = {(\sum_{j} W_{i j})}^{- 1}$ .

Using the defined weight matrices, the covariance matrices for CAR and SAR models are specified as

Σ_{CAR} = σ_{s}^{2} {(M^{- \frac{1}{2}} (I - ρ M^{- \frac{1}{2}} W_{+} M^{\frac{1}{2}}) M^{- \frac{1}{2}})}^{- 1}

Σ_{SAR} = σ_{s}^{2} {((I - ρ M^{- \frac{1}{2}} W_{+} M^{\frac{1}{2}}) M^{- 1} (I - ρ M^{\frac{1}{2}} W_{+}^{T} M^{- \frac{1}{2}}))}^{- 1}

The covariance matrices for CAR and SAR models have unequal diagonal elements. For a given estimated parameter $σ_{s}^{2}$ , the spatial random effects explain different amounts of phenotypic variance for individuals at different locations. In order to better quantify the phenotypic variance explained by spatial random effects, we calculate a Gower factor^74–76

{\tilde{σ}}_{s}^{2} = \frac{t r [Z_{s} (I - {\frac{1}{N} 11}^{T}) Σ_{S} Z_{s}^{T}]}{N - 1} 8

The Gower factor can be considered as the averaged variance of spatial random effects across individuals.

For all traits, we report ${\tilde{σ}}_{s}^{2}$ as the phenotypic variance contributed by spatially-correlated community-level environment.

Conversion of variance components from observed scale to liability scale

All disease outcomes are binary. We use linear regression models to estimate variance components on the observed scale. This has been a widely used approach in human genetics and is computationally efficient compared to generalized linear mixed models for large datasets⁷⁷. To facilitate the comparison of estimates across diseases with different prevalence, we will convert them to liability scale. The details for the conversion are given in the Supplementary Methods. As we demonstrate in Results and Fig. 1, the conversion yields unbiased results across different scenarios.

Fitting the model with laplace approximation

We make extensive use of the R package TMB⁷⁸ to estimate model parameters, which relies on Automatic Differentiation software to calculate the gradients of the objective function obtained by Laplace approximation. More details can be found in the Supplementary Methods.

Two stage regression model for causal inference of PM_2.5 and NO₂ air pollution

We extend SMILE model into a two-stage regression model SMILE-2 for assessing the causality of air pollution levels using wind speed and direction as instrument variables. Details are shown in the Supplementary Methods.

Ethical approval

This study is deemed non-human subject research and approved by Penn State College of Medicine IRB.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Supplementary Information^{(5.3MB, pdf)}

Peer Review File^{(352.9KB, pdf)}

41467_2024_49566_MOESM3_ESM.pdf^{(41.4KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1-8^{(2.5MB, xlsx)}

Reporting Summary^{(2.3MB, pdf)}

Acknowledgements

Some of the materials employed in this work were provided by the Center for Applied Studies in Health Economics (CASHE) at Penn State University College of Medicine. H.M. is funded by Computation, Bioinformatics, and Statistics (CBIOS) NIH-sponsored Research Training Grant (5T32GM102057-10) and NIH F30 Ruth L. Kirschstein National Research Service Award Individual Predoctoral MD/PhD Fellowship Award by the National Institute of General Medical Sciences (F30GM151848). D.J.L. is also supported by NIH grants R01ES036042, R01HG011035, and R01AI174108 and by the Artificial Intelligence and Biomedical Informatics pilot funding program from the Penn State College of Medicine.

Author contributions

D.M., D.J.L., and B.J. conceived the study. D.M. and H.M. led the data analysis. D.M., H.M., L.Y., J.X., and A.M. conducted analyses. A.B., Q.L., L.C., D.J.L., and B.J. helped with data interpretation. D.M., H.M., D.J.L., and B.J. prepared the manuscript. All authors contributed to manuscript editing and approved the manuscript.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Data availability

We provide all the data that support the findings of this study in this published article (and its supplementary information files). The raw data from Truven MarketScan are available for licensed users. A user license could be obtained by following the instructions at https://marketscan.truvenhealth.com/marketscanportal/. Multiple external datasets of community-level risk factors were incorporated in this study. This included demographic data from the 2015 American Community Survey 5-year estimates¹⁸, satellite-derived measurements of air pollution including PM_2.5^19,20 and NO₂^21,22, and wind direction and wind speed data²³.

Code availability

The software package implementing the SMILE and SMILE-2 model is available at https://github.com/dan11mcguire/smile and the linked Zenodo repository (10.5281/zenodo.11081928).

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Daniel McGuire, Havell Markus.

Contributor Information

Dajiang J. Liu, Email: dajiang.liu@psu.edu

Bibo Jiang, Email: bjiang@phs.psu.edu.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-024-49566-6.

References

1.Caspi A, Moffitt TE. Gene-environment interactions in psychiatry: joining forces with neuroscience. Nat. Rev. Neurosci. 2006;7:583–590. doi: 10.1038/nrn1925. [DOI] [PubMed] [Google Scholar]
2.Falconer DS. The inheritance of liability to diseases with variable age of onset, with particular reference to diabetes mellitus. Ann. Hum. Genet. 1967;31:1–20. doi: 10.1111/j.1469-1809.1967.tb02015.x. [DOI] [PubMed] [Google Scholar]
3.Lakhani CM, et al. Repurposing large health insurance claims data to estimate genetic and environmental contributions in 560 phenotypes. Nat. Genet. 2019;51:327–334. doi: 10.1038/s41588-018-0313-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Tenesa A, Haley CS. The heritability of human disease: estimation, uses and abuses. Nat. Rev. Genet. 2013;14:139–149. doi: 10.1038/nrg3377. [DOI] [PubMed] [Google Scholar]
5.Braveman P, Gottlieb L. The social determinants of health: it’s time to consider the causes of the causes. Pub. Health Rep. 2014;129:19–31. doi: 10.1177/00333549141291S206. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kivimäki M, et al. Association between socioeconomic status and the development of mental and physical health conditions in adulthood: a multi-cohort study. Lancet Pub. Health. 2020;5:e140–e149. doi: 10.1016/S2468-2667(19)30248-8. [DOI] [PubMed] [Google Scholar]
7.Yang J, et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Evans LM, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 2018;50:737–745. doi: 10.1038/s41588-018-0108-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Abdellaoui, A., Verweij, K. J. H., and Nivard, M. G. Geographic Confounding in Genome-Wide Association Studies. bioRxiv, 2021: 2021.03.18.435971.
10.Khan A, et al. Environmental pollution is associated with increased risk of psychiatric disorders in the US and Denmark. PLOS Biol. 2019;17:e3000353. doi: 10.1371/journal.pbio.3000353. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Kim E, et al. The Evolving Use of Electronic Health Records (EHR) for Research. Semin Radiat. Oncol. 2019;29:354–361. doi: 10.1016/j.semradonc.2019.05.010. [DOI] [PubMed] [Google Scholar]
12.Nordo AH, et al. Use of EHRs data for clinical research: Historical progress and current applications. Learn Health Syst. 2019;3:e10076. doi: 10.1002/lrh2.10076. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Denny JC, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 2013;31:1102–1110. doi: 10.1038/nbt.2749. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wang K, et al. Classification of common human diseases derived from shared genetic and environmental determinants. Nat. Genet. 2017;49:1319–1325. doi: 10.1038/ng.3931. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Polubriaginof FCG, et al. Disease Heritability Inferred from Familial Relationships Reported in Medical Records. Cell. 2018;173:1692–1704.e11. doi: 10.1016/j.cell.2018.04.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Quint, J. B. Health research data for the real world: the MarketScan databases. Ann Arbor, MI: Truven Health Analytics (2015).
17.Truven Health Analytics. Commercial Claims and Encounters Medicare Supplemental. 2016; Available from: https://theclearcenter.org/wp-content/uploads/2020/01/IBM-MarketScan-User-Guide.pdf.
18.U. S. Census Bureau, American Community Survey 5-Year Estimates, in tidycensus: Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames. R package version 0.9.9.5. 2015: https://CRAN.R-project.org/package=tidycensus.
19.van Donkelaar A, et al. Global Estimates of Fine Particulate Matter Using a Combined Geophysical-Statistical Method with Information from Satellites. Environ. Sci. Technol. 2016;50:3762. doi: 10.1021/acs.est.5b05833. [DOI] [PubMed] [Google Scholar]
20.van Donkelaar, A., et al., Global Annual PM2.5 Grids from MODIS, MISR and SeaWiFS Aerosol Optical Depth (AOD) with GWR, 1998-2016. 2018, NASA Socioeconomic Data and Applications Center (SEDAC): Palisades, NY.
21.Geddes JA, et al. Long-term Trends Worldwide in Ambient NO2 Concentrations Inferred from Satellite Observations for Exposure Assessment. Environ. Health Perspect. 2016;124:281–289. doi: 10.1289/ehp.1409567. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Geddes, J. A., et al., Global 3-Year Running Mean Ground-Level Nitrogen Dioxide (NO2) Grids from GOME, SCIAMACHY and GOME-2. 2017, NASA Socioeconomic Data and Applications Center (SEDAC): Palisades, NY.
23.National Oceanic and Atmospheric Administration, U.S. Wind Climatology U-Component, V-Component, Mean Wind Speed Monthly Datasets. National Oceanic and Atmospheric Administration: ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/.
24.Baiocchi M, Cheng J, Small DS. Instrumental variable methods for causal inference. Stat. Med. 2014;33:2297–2340. doi: 10.1002/sim.6128. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Anderson, M. L. As the Wind Blows: The Effects of Long-Term Exposure to Air Pollution on Mortality. J. Eur. Economic Association.18, 1886–1927 (2019). [DOI] [PMC free article] [PubMed]
26.Herrnstadt, E. & Muehlegger, E. Air Pollution and Criminal Activity: Evidence from Chicago Microdata. National Bureau of Economic Research Working Papers. 21787, 1–41 (2015).
27.Schlenker W, Walker WR. Airports, Air Pollution, and Contemporaneous Health. Rev. Economic Stud. 2015;83:768–809. doi: 10.1093/restud/rdv043. [DOI] [Google Scholar]
28.Deryugina T, et al. The Mortality and Medical Costs of Air Pollution: Evidence from Changes in Wind Direction. Am. Econ. Rev. 2019;109:4178–4219. doi: 10.1257/aer.20180279. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zhang Q, et al. Transboundary health impacts of transported global air pollution and international trade. Nature. 2017;543:705–709. doi: 10.1038/nature21712. [DOI] [PubMed] [Google Scholar]
30.Jiang, J., Li, C., Paul, D., Yang, C. & Zhao, H. On high-dimensional misspecified mixed model analysis in genome-wide association study. Ann. Statist.44, 2127–60 (2016).
31.Bao EL, Cheng AN, Sankaran VG. The genetics of human hematopoiesis and its disruption in disease. EMBO Mol. Med. 2019;11:e10316. doi: 10.15252/emmm.201910316. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kugeler KJ, et al. Geographic Distribution and Expansion of Human Lyme Disease, United States. Emerg. Infect. Dis. 2015;21:1455–1457. doi: 10.3201/eid2108.141878. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Gilles S, et al. The role of environmental factors in allergy: A critical reappraisal. Exp. Dermatol. 2018;27:1193–1200. doi: 10.1111/exd.13769. [DOI] [PubMed] [Google Scholar]
34.Sudlow C, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Upadhyaya SG, et al. Automated Diabetes Case Identification Using Electronic Health Record Data at a Tertiary Care Facility. Mayo Clin. Proc. Innov. Qual. Outcomes. 2017;1:100–110. doi: 10.1016/j.mayocpiqo.2017.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Almgren P, et al. Heritability and familiality of type 2 diabetes and related quantitative traits in the Botnia Study. Diabetologia. 2011;54:2811–2819. doi: 10.1007/s00125-011-2267-5. [DOI] [PubMed] [Google Scholar]
37.Kaprio J, et al. Concordance for type 1 (insulin-dependent) and type 2 (non-insulin-dependent) diabetes mellitus in a population-based cohort of twins in Finland. Diabetologia. 1992;35:1060–1067. doi: 10.1007/BF02221682. [DOI] [PubMed] [Google Scholar]
38.Poulsen P, et al. Heritability of type II (non-insulin-dependent) diabetes mellitus and abnormal glucose tolerance–a population-based twin study. Diabetologia. 1999;42:139–145. doi: 10.1007/s001250051131. [DOI] [PubMed] [Google Scholar]
39.Pilia G, et al. Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS Genet. 2006;2:e132. doi: 10.1371/journal.pgen.0020132. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Lohmueller KE, et al. Whole-exome sequencing of 2,000 Danish individuals and the role of rare coding variants in type 2 diabetes. Am. J. Hum. Genet. 2013;93:1072–1086. doi: 10.1016/j.ajhg.2013.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Xue A, et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat. Commun. 2941;9:2018. doi: 10.1038/s41467-018-04951-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.O’Rahilly S, Farooqi IS. Human obesity: a heritable neurobehavioral disorder that is highly sensitive to environmental conditions. Diabetes. 2008;57:2905–2910. doi: 10.2337/db08-0210. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Barnes AS. The epidemic of obesity and diabetes: trends and treatments. Tex. Heart Inst. J. 2011;38:142–144. [PMC free article] [PubMed] [Google Scholar]
44.Agardh E, et al. Type 2 diabetes incidence and socio-economic position: a systematic review and meta-analysis. Int J. Epidemiol. 2011;40:804–818. doi: 10.1093/ije/dyr029. [DOI] [PubMed] [Google Scholar]
45.Gary-Webb TL, Suglia SF, Tehranifar P. Social epidemiology of diabetes and associated conditions. Curr. Diab. Rep. 2013;13:850–859. doi: 10.1007/s11892-013-0427-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Pollitt RA, et al. Examining the pathways linking lower socioeconomic status and advanced melanoma. Cancer. 2012;118:4004–4013. doi: 10.1002/cncr.26706. [DOI] [PubMed] [Google Scholar]
47.Wich LG, et al. Impact of socioeconomic status and sociodemographic factors on melanoma presentation among ethnic minorities. J. Community Health. 2011;36:461–468. doi: 10.1007/s10900-010-9328-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Lantz PM, House JS, Mero RP, Williams DR. Stress, life events, and socioeconomic disparities in health. J. Health Soc. Behav. 2005;46:274–288. doi: 10.1177/002214650504600305. [DOI] [PubMed] [Google Scholar]
49.Reiss F, et al. Socioeconomic status, stressful life situations and mental health problems in children and adolescents: Results of the German BELLA cohort-study. PLoS one. 2019;14:e0213700. doi: 10.1371/journal.pone.0213700. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Johanson JF, Sonnenberg A. The prevalence of hemorrhoids and chronic constipation: an epidemiologic study. Gastroenterology. 1990;98:380–386. doi: 10.1016/0016-5085(90)90828-O. [DOI] [PubMed] [Google Scholar]
51.Ciencewicki J, Jaspers I. Air pollution and respiratory viral infection. Inhal. Toxicol. 2007;19:1135–1146. doi: 10.1080/08958370701665434. [DOI] [PubMed] [Google Scholar]
52.Bowatte, G. et al. Air Pollution and Otitis Media in Children: A Systematic Review of Literature. Int. J. Environ. Res. Pub. Health.15, 257 (2018). [DOI] [PMC free article] [PubMed]
53.Donzelli, G. et al. Particulate Matter Exposure and Attention-Deficit/Hyperactivity Disorder in Children: A Systematic Review of Epidemiological Studies. Int. J. Environ. Res. Pub. Health. 17, 67 (2019). [DOI] [PMC free article] [PubMed]
54.Beamish LA, Osornio-Vargas AR, Wine E. Air pollution: An environmental factor contributing to intestinal disease. J. Crohn’s Colitis. 2011;5:279–286. doi: 10.1016/j.crohns.2011.02.017. [DOI] [PubMed] [Google Scholar]
55.Meo SA, et al. Effect of environmental air pollution on type 2 diabetes mellitus. Eur. Rev. Med Pharm. Sci. 2015;19:123–128. [PubMed] [Google Scholar]
56.Li J, et al. Ambient Air Pollution Is Associated With HDL (High-Density Lipoprotein) Dysfunction in Healthy Adults. Arterioscler Thromb. Vasc. Biol. 2019;39:513–522. doi: 10.1161/ATVBAHA.118.311749. [DOI] [PubMed] [Google Scholar]
57.Mao S, et al. Long-term effects of ambient air pollutants to blood lipids and dyslipidemias in a Chinese rural population. Environ. Pollut. 2020;256:113403. doi: 10.1016/j.envpol.2019.113403. [DOI] [PubMed] [Google Scholar]
58.Takano H, et al. Nitrogen dioxide air pollution near ambient levels is an atherogenic risk primarily in obese subjects: a brief communication. Exp. Biol. Med. 2004;229:361–364. doi: 10.1177/153537020422900411. [DOI] [PubMed] [Google Scholar]
59.Chen Z, et al. Near-roadway air pollution exposure and altered fatty acid oxidation among adolescents and young adults–The interplay with obesity. Environ. Int. 2019;130:104935. doi: 10.1016/j.envint.2019.104935. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Sansbury BE, Hill BG. Regulation of obesity and insulin resistance by nitric oxide. Free Radic. Biol. Med. 2014;73:383–399. doi: 10.1016/j.freeradbiomed.2014.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Xing YF, et al. The impact of PM2.5 on the human respiratory system. J. Thorac. Dis. 2016;8:E69–E74. doi: 10.3978/j.issn.2072-1439.2016.01.19. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Psoter KJ, et al. Air pollution exposure is associated with MRSA acquisition in young U.S. children with cystic fibrosis. BMC Pulm. Med. 2017;17:106. doi: 10.1186/s12890-017-0449-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Murray J. The “All of Us” Research Program. N. Engl. J. Med. 2019;381:1884. doi: 10.1056/NEJMc1912496. [DOI] [PubMed] [Google Scholar]
66.Cohen GR, et al. Variation in Physicians’ Electronic Health Record Documentation and Potential Patient Harm from That Variation. J. Gen. Intern Med. 2019;34:2355–2367. doi: 10.1007/s11606-019-05025-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Yang J, et al. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 2014;46:100–106. doi: 10.1038/ng.2876. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Hill AB. THE ENVIRONMENT AND DISEASE: ASSOCIATION OR CAUSATION? Proc. R. Soc. Med. 1965;58:295–300. [PMC free article] [PubMed] [Google Scholar]
69.Environmental Protection Agency. Our Nation’s Air. Air Quality Improves as America Grows. [cited 2020; Available from: https://gispub.epa.gov/air/trendsreport/2020.
70.Archer, C. L. et al. Changes in air quality and human mobility in the USA during the COVID-19 pandemic. Bulletin Atm. Sci.Technol. 1, 491–514 (2020). [DOI] [PMC free article] [PubMed]
71.Pinto CN, et al. Chlamydia and gonorrhea acquisition among adolescents and young adults in Pennsylvania: A Rural and urban Comparison. Sexually Transmitted Dis. 2018;45:99–102. doi: 10.1097/OLQ.0000000000000697. [DOI] [PubMed] [Google Scholar]
72.Hujoel MLA, et al. Liability threshold modeling of case-control status and family history of disease increases association power. Nat. Genet. 2020;52:541–547. doi: 10.1038/s41588-020-0613-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Bivand RS, Wong DWS. Comparing implementations of global and local indicators of spatial association. TEST. 2018;27:716–748. doi: 10.1007/s11749-018-0599-x. [DOI] [Google Scholar]
74.Arnol D, et al. Modeling Cell-Cell Interactions from Spatial Molecular Data with Spatial Variance Component Analysis. Cell Rep. 2019;29:202–211.e6. doi: 10.1016/j.celrep.2019.08.077. [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Kostem E, Eskin E. Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Am. J. Hum. Genet. 2013;92:558–564. doi: 10.1016/j.ajhg.2013.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Searle, S. R. and Khuri, A. I., Matrix algebra useful for statistics. 2017: John Wiley & Sons.
77.Lee SH, et al. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 2011;88:294–305. doi: 10.1016/j.ajhg.2011.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Kristensen, K., et al., TMB: Automatic Differentiation and Laplace Approximation. 70: 21 (2016).
79.[cited 2020 June 22]; Available from: http://www.nealelab.is/uk-biobank.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(5.3MB, pdf)}

Peer Review File^{(352.9KB, pdf)}

41467_2024_49566_MOESM3_ESM.pdf^{(41.4KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1-8^{(2.5MB, xlsx)}

Reporting Summary^{(2.3MB, pdf)}

Data Availability Statement

The software package implementing the SMILE and SMILE-2 model is available at https://github.com/dan11mcguire/smile and the linked Zenodo repository (10.5281/zenodo.11081928).

[CR1] 1.Caspi A, Moffitt TE. Gene-environment interactions in psychiatry: joining forces with neuroscience. Nat. Rev. Neurosci. 2006;7:583–590. doi: 10.1038/nrn1925. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Falconer DS. The inheritance of liability to diseases with variable age of onset, with particular reference to diabetes mellitus. Ann. Hum. Genet. 1967;31:1–20. doi: 10.1111/j.1469-1809.1967.tb02015.x. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Lakhani CM, et al. Repurposing large health insurance claims data to estimate genetic and environmental contributions in 560 phenotypes. Nat. Genet. 2019;51:327–334. doi: 10.1038/s41588-018-0313-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Tenesa A, Haley CS. The heritability of human disease: estimation, uses and abuses. Nat. Rev. Genet. 2013;14:139–149. doi: 10.1038/nrg3377. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Braveman P, Gottlieb L. The social determinants of health: it’s time to consider the causes of the causes. Pub. Health Rep. 2014;129:19–31. doi: 10.1177/00333549141291S206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Kivimäki M, et al. Association between socioeconomic status and the development of mental and physical health conditions in adulthood: a multi-cohort study. Lancet Pub. Health. 2020;5:e140–e149. doi: 10.1016/S2468-2667(19)30248-8. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Yang J, et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Evans LM, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 2018;50:737–745. doi: 10.1038/s41588-018-0108-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Abdellaoui, A., Verweij, K. J. H., and Nivard, M. G. Geographic Confounding in Genome-Wide Association Studies. bioRxiv, 2021: 2021.03.18.435971.

[CR10] 10.Khan A, et al. Environmental pollution is associated with increased risk of psychiatric disorders in the US and Denmark. PLOS Biol. 2019;17:e3000353. doi: 10.1371/journal.pbio.3000353. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Kim E, et al. The Evolving Use of Electronic Health Records (EHR) for Research. Semin Radiat. Oncol. 2019;29:354–361. doi: 10.1016/j.semradonc.2019.05.010. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Nordo AH, et al. Use of EHRs data for clinical research: Historical progress and current applications. Learn Health Syst. 2019;3:e10076. doi: 10.1002/lrh2.10076. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Denny JC, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 2013;31:1102–1110. doi: 10.1038/nbt.2749. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Wang K, et al. Classification of common human diseases derived from shared genetic and environmental determinants. Nat. Genet. 2017;49:1319–1325. doi: 10.1038/ng.3931. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Polubriaginof FCG, et al. Disease Heritability Inferred from Familial Relationships Reported in Medical Records. Cell. 2018;173:1692–1704.e11. doi: 10.1016/j.cell.2018.04.032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Quint, J. B. Health research data for the real world: the MarketScan databases. Ann Arbor, MI: Truven Health Analytics (2015).

[CR17] 17.Truven Health Analytics. Commercial Claims and Encounters Medicare Supplemental. 2016; Available from: https://theclearcenter.org/wp-content/uploads/2020/01/IBM-MarketScan-User-Guide.pdf.

[CR18] 18.U. S. Census Bureau, American Community Survey 5-Year Estimates, in tidycensus: Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames. R package version 0.9.9.5. 2015: https://CRAN.R-project.org/package=tidycensus.

[CR19] 19.van Donkelaar A, et al. Global Estimates of Fine Particulate Matter Using a Combined Geophysical-Statistical Method with Information from Satellites. Environ. Sci. Technol. 2016;50:3762. doi: 10.1021/acs.est.5b05833. [DOI] [PubMed] [Google Scholar]

[CR20] 20.van Donkelaar, A., et al., Global Annual PM2.5 Grids from MODIS, MISR and SeaWiFS Aerosol Optical Depth (AOD) with GWR, 1998-2016. 2018, NASA Socioeconomic Data and Applications Center (SEDAC): Palisades, NY.

[CR21] 21.Geddes JA, et al. Long-term Trends Worldwide in Ambient NO2 Concentrations Inferred from Satellite Observations for Exposure Assessment. Environ. Health Perspect. 2016;124:281–289. doi: 10.1289/ehp.1409567. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Geddes, J. A., et al., Global 3-Year Running Mean Ground-Level Nitrogen Dioxide (NO2) Grids from GOME, SCIAMACHY and GOME-2. 2017, NASA Socioeconomic Data and Applications Center (SEDAC): Palisades, NY.

[CR23] 23.National Oceanic and Atmospheric Administration, U.S. Wind Climatology U-Component, V-Component, Mean Wind Speed Monthly Datasets. National Oceanic and Atmospheric Administration: ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/.

[CR24] 24.Baiocchi M, Cheng J, Small DS. Instrumental variable methods for causal inference. Stat. Med. 2014;33:2297–2340. doi: 10.1002/sim.6128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Anderson, M. L. As the Wind Blows: The Effects of Long-Term Exposure to Air Pollution on Mortality. J. Eur. Economic Association.18, 1886–1927 (2019). [DOI] [PMC free article] [PubMed]

[CR26] 26.Herrnstadt, E. & Muehlegger, E. Air Pollution and Criminal Activity: Evidence from Chicago Microdata. National Bureau of Economic Research Working Papers. 21787, 1–41 (2015).

[CR27] 27.Schlenker W, Walker WR. Airports, Air Pollution, and Contemporaneous Health. Rev. Economic Stud. 2015;83:768–809. doi: 10.1093/restud/rdv043. [DOI] [Google Scholar]

[CR28] 28.Deryugina T, et al. The Mortality and Medical Costs of Air Pollution: Evidence from Changes in Wind Direction. Am. Econ. Rev. 2019;109:4178–4219. doi: 10.1257/aer.20180279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Zhang Q, et al. Transboundary health impacts of transported global air pollution and international trade. Nature. 2017;543:705–709. doi: 10.1038/nature21712. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Jiang, J., Li, C., Paul, D., Yang, C. & Zhao, H. On high-dimensional misspecified mixed model analysis in genome-wide association study. Ann. Statist.44, 2127–60 (2016).

[CR31] 31.Bao EL, Cheng AN, Sankaran VG. The genetics of human hematopoiesis and its disruption in disease. EMBO Mol. Med. 2019;11:e10316. doi: 10.15252/emmm.201910316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Kugeler KJ, et al. Geographic Distribution and Expansion of Human Lyme Disease, United States. Emerg. Infect. Dis. 2015;21:1455–1457. doi: 10.3201/eid2108.141878. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Gilles S, et al. The role of environmental factors in allergy: A critical reappraisal. Exp. Dermatol. 2018;27:1193–1200. doi: 10.1111/exd.13769. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Sudlow C, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Upadhyaya SG, et al. Automated Diabetes Case Identification Using Electronic Health Record Data at a Tertiary Care Facility. Mayo Clin. Proc. Innov. Qual. Outcomes. 2017;1:100–110. doi: 10.1016/j.mayocpiqo.2017.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Almgren P, et al. Heritability and familiality of type 2 diabetes and related quantitative traits in the Botnia Study. Diabetologia. 2011;54:2811–2819. doi: 10.1007/s00125-011-2267-5. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Kaprio J, et al. Concordance for type 1 (insulin-dependent) and type 2 (non-insulin-dependent) diabetes mellitus in a population-based cohort of twins in Finland. Diabetologia. 1992;35:1060–1067. doi: 10.1007/BF02221682. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Poulsen P, et al. Heritability of type II (non-insulin-dependent) diabetes mellitus and abnormal glucose tolerance–a population-based twin study. Diabetologia. 1999;42:139–145. doi: 10.1007/s001250051131. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Pilia G, et al. Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS Genet. 2006;2:e132. doi: 10.1371/journal.pgen.0020132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Lohmueller KE, et al. Whole-exome sequencing of 2,000 Danish individuals and the role of rare coding variants in type 2 diabetes. Am. J. Hum. Genet. 2013;93:1072–1086. doi: 10.1016/j.ajhg.2013.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Xue A, et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat. Commun. 2941;9:2018. doi: 10.1038/s41467-018-04951-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.O’Rahilly S, Farooqi IS. Human obesity: a heritable neurobehavioral disorder that is highly sensitive to environmental conditions. Diabetes. 2008;57:2905–2910. doi: 10.2337/db08-0210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Barnes AS. The epidemic of obesity and diabetes: trends and treatments. Tex. Heart Inst. J. 2011;38:142–144. [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Agardh E, et al. Type 2 diabetes incidence and socio-economic position: a systematic review and meta-analysis. Int J. Epidemiol. 2011;40:804–818. doi: 10.1093/ije/dyr029. [DOI] [PubMed] [Google Scholar]

[CR45] 45.Gary-Webb TL, Suglia SF, Tehranifar P. Social epidemiology of diabetes and associated conditions. Curr. Diab. Rep. 2013;13:850–859. doi: 10.1007/s11892-013-0427-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Pollitt RA, et al. Examining the pathways linking lower socioeconomic status and advanced melanoma. Cancer. 2012;118:4004–4013. doi: 10.1002/cncr.26706. [DOI] [PubMed] [Google Scholar]

[CR47] 47.Wich LG, et al. Impact of socioeconomic status and sociodemographic factors on melanoma presentation among ethnic minorities. J. Community Health. 2011;36:461–468. doi: 10.1007/s10900-010-9328-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Lantz PM, House JS, Mero RP, Williams DR. Stress, life events, and socioeconomic disparities in health. J. Health Soc. Behav. 2005;46:274–288. doi: 10.1177/002214650504600305. [DOI] [PubMed] [Google Scholar]

[CR49] 49.Reiss F, et al. Socioeconomic status, stressful life situations and mental health problems in children and adolescents: Results of the German BELLA cohort-study. PLoS one. 2019;14:e0213700. doi: 10.1371/journal.pone.0213700. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR50] 50.Johanson JF, Sonnenberg A. The prevalence of hemorrhoids and chronic constipation: an epidemiologic study. Gastroenterology. 1990;98:380–386. doi: 10.1016/0016-5085(90)90828-O. [DOI] [PubMed] [Google Scholar]

[CR51] 51.Ciencewicki J, Jaspers I. Air pollution and respiratory viral infection. Inhal. Toxicol. 2007;19:1135–1146. doi: 10.1080/08958370701665434. [DOI] [PubMed] [Google Scholar]

[CR52] 52.Bowatte, G. et al. Air Pollution and Otitis Media in Children: A Systematic Review of Literature. Int. J. Environ. Res. Pub. Health.15, 257 (2018). [DOI] [PMC free article] [PubMed]

[CR53] 53.Donzelli, G. et al. Particulate Matter Exposure and Attention-Deficit/Hyperactivity Disorder in Children: A Systematic Review of Epidemiological Studies. Int. J. Environ. Res. Pub. Health. 17, 67 (2019). [DOI] [PMC free article] [PubMed]

[CR54] 54.Beamish LA, Osornio-Vargas AR, Wine E. Air pollution: An environmental factor contributing to intestinal disease. J. Crohn’s Colitis. 2011;5:279–286. doi: 10.1016/j.crohns.2011.02.017. [DOI] [PubMed] [Google Scholar]

[CR55] 55.Meo SA, et al. Effect of environmental air pollution on type 2 diabetes mellitus. Eur. Rev. Med Pharm. Sci. 2015;19:123–128. [PubMed] [Google Scholar]

[CR56] 56.Li J, et al. Ambient Air Pollution Is Associated With HDL (High-Density Lipoprotein) Dysfunction in Healthy Adults. Arterioscler Thromb. Vasc. Biol. 2019;39:513–522. doi: 10.1161/ATVBAHA.118.311749. [DOI] [PubMed] [Google Scholar]

[CR57] 57.Mao S, et al. Long-term effects of ambient air pollutants to blood lipids and dyslipidemias in a Chinese rural population. Environ. Pollut. 2020;256:113403. doi: 10.1016/j.envpol.2019.113403. [DOI] [PubMed] [Google Scholar]

[CR58] 58.Takano H, et al. Nitrogen dioxide air pollution near ambient levels is an atherogenic risk primarily in obese subjects: a brief communication. Exp. Biol. Med. 2004;229:361–364. doi: 10.1177/153537020422900411. [DOI] [PubMed] [Google Scholar]

[CR59] 59.Chen Z, et al. Near-roadway air pollution exposure and altered fatty acid oxidation among adolescents and young adults–The interplay with obesity. Environ. Int. 2019;130:104935. doi: 10.1016/j.envint.2019.104935. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR60] 60.Sansbury BE, Hill BG. Regulation of obesity and insulin resistance by nitric oxide. Free Radic. Biol. Med. 2014;73:383–399. doi: 10.1016/j.freeradbiomed.2014.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR61] 61.Xing YF, et al. The impact of PM2.5 on the human respiratory system. J. Thorac. Dis. 2016;8:E69–E74. doi: 10.3978/j.issn.2072-1439.2016.01.19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] 62.Psoter KJ, et al. Air pollution exposure is associated with MRSA acquisition in young U.S. children with cystic fibrosis. BMC Pulm. Med. 2017;17:106. doi: 10.1186/s12890-017-0449-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR63] 63.Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR64] 64.Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR65] 65.Murray J. The “All of Us” Research Program. N. Engl. J. Med. 2019;381:1884. doi: 10.1056/NEJMc1912496. [DOI] [PubMed] [Google Scholar]

[CR66] 66.Cohen GR, et al. Variation in Physicians’ Electronic Health Record Documentation and Potential Patient Harm from That Variation. J. Gen. Intern Med. 2019;34:2355–2367. doi: 10.1007/s11606-019-05025-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR67] 67.Yang J, et al. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 2014;46:100–106. doi: 10.1038/ng.2876. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR68] 68.Hill AB. THE ENVIRONMENT AND DISEASE: ASSOCIATION OR CAUSATION? Proc. R. Soc. Med. 1965;58:295–300. [PMC free article] [PubMed] [Google Scholar]

[CR69] 69.Environmental Protection Agency. Our Nation’s Air. Air Quality Improves as America Grows. [cited 2020; Available from: https://gispub.epa.gov/air/trendsreport/2020.

[CR70] 70.Archer, C. L. et al. Changes in air quality and human mobility in the USA during the COVID-19 pandemic. Bulletin Atm. Sci.Technol. 1, 491–514 (2020). [DOI] [PMC free article] [PubMed]

[CR71] 71.Pinto CN, et al. Chlamydia and gonorrhea acquisition among adolescents and young adults in Pennsylvania: A Rural and urban Comparison. Sexually Transmitted Dis. 2018;45:99–102. doi: 10.1097/OLQ.0000000000000697. [DOI] [PubMed] [Google Scholar]

[CR72] 72.Hujoel MLA, et al. Liability threshold modeling of case-control status and family history of disease increases association power. Nat. Genet. 2020;52:541–547. doi: 10.1038/s41588-020-0613-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR73] 73.Bivand RS, Wong DWS. Comparing implementations of global and local indicators of spatial association. TEST. 2018;27:716–748. doi: 10.1007/s11749-018-0599-x. [DOI] [Google Scholar]

[CR74] 74.Arnol D, et al. Modeling Cell-Cell Interactions from Spatial Molecular Data with Spatial Variance Component Analysis. Cell Rep. 2019;29:202–211.e6. doi: 10.1016/j.celrep.2019.08.077. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR75] 75.Kostem E, Eskin E. Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Am. J. Hum. Genet. 2013;92:558–564. doi: 10.1016/j.ajhg.2013.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR76] 76.Searle, S. R. and Khuri, A. I., Matrix algebra useful for statistics. 2017: John Wiley & Sons.

[CR77] 77.Lee SH, et al. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 2011;88:294–305. doi: 10.1016/j.ajhg.2011.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR78] 78.Kristensen, K., et al., TMB: Automatic Differentiation and Laplace Approximation. 70: 21 (2016).

[CR79] 79.[cited 2020 June 22]; Available from: http://www.nealelab.is/uk-biobank.

PERMALINK

Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

Daniel McGuire

Havell Markus

Lina Yang

Jingyu Xu

Austin Montgomery

Arthur Berg

Qunhua Li

Laura Carrel

Dajiang J Liu

Bibo Jiang

Abstract

Introduction

Results data overview

Table 1.

Statistical methodology overview

SMILE model yields more accurate heritability and environmental variance component estimates

Fig. 1. Variance component model simulations.

SMILE-2 model more powerfully identifies pollution causal effects

Fig. 2. SMILE-2 controls type I error and improves power over methods that ignore genetic relatedness.

Estimation and robustness of genetic heritability and spatial variance components for 1083 traits

Fig. 3. Comparison of estimated heritability with and without accounting for community effects.

Landscape of heritability and community-level environment

Fig. 4. Distribution of genetic and community effects by phenotype category.

Table 2.

Extensive correlation between spatial random effects and environmental risk factors

Estimating causal effects of air pollution across 1083 phenotypes

Fig. 5. Volcano plots of causal effect estimates from SMILE-2 model for pollution on diseases.

Discussion

Methods

SMILE model of genetic heritability, family environment, and community location random effects

Conversion of variance components from observed scale to liability scale

Fitting the model with laplace approximation

Two stage regression model for causal inference of PM2.5 and NO2 air pollution

Ethical approval

Reporting summary

Supplementary information

Acknowledgements

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Two stage regression model for causal inference of PM_2.5 and NO₂ air pollution