Abstract
Gene-environment interactions (GxE) can be fundamental in applications ranging from functional genomics to precision medicine and is a conjectured source of substantial heritability. However, unbiased methods to profile GxE genome-wide are nascent and, as we show, cannot accommodate general environment variables, modest sample sizes, heterogeneous noise, and binary traits. To address this gap, we propose a simple, unifying mixed model for gene-environment interaction (GxEMM). In simulations and theory, we show that GxEMM can dramatically improve estimates and eliminate false positives when the assumptions of existing methods fail. We apply GxEMM to a range of human and model organism datasets and find broad evidence of context-specific genetic effects, including GxSex, GxAdversity, and GxDisease interactions across thousands of clinical and molecular phenotypes. Overall, GxEMM is broadly applicable for testing and quantifying polygenic interactions, which can be useful for explaining heritability and invaluable for determining biologically relevant environments.
Keywords: GxE, heritability, genetic heterogeneity, psychiatric disease, disease subtypes, linear mixed model, G-E correlation, heteroskedasticity
Introduction
Many examples of gene-environment interaction (GxE) have been documented in humans at the level of individual genetic variants. In functional genomics, variants can have effects on expression that depend on external context,1, 2, 3, 4, 5 age,6 tissue,7 or cell type.8,9 In complex traits, some genetic variants are known to interact with air pollution,10 microbe exposure,11 or sex.12, 13, 14, 15 Variants can also interact with medical interventions and can render certain treatments ineffective.16, 17, 18
GxE tests can explain novel biology along two distinct, complementary axes. First, GxE can identify unappreciated genetic effects that elude linear models, which can increase genome-wide association study (GWAS) power and has recently received attention as a partial answer to the missing heritability question.19, 20, 21 Second, GxE tests can demonstrate that an environmental measurement is biologically trait relevant and quantify its impact, which can be important for public health and can illuminate intrinsic trait biology. More generally, genetic interaction tests can be used to assess the biological significance of any sample stratification, including (putative) phenotypic subtypes, quantitative covariates like age, or even other genetic variants.
However, investigation of GxE in nominally unrelated humans has been limited by the fact that individual variant effects are typically small. In the additive model context, genome-wide Genomic Restricted Maximum Likelihood (GREML) has been an invaluable complement to variant-based tests, powerfully characterizing aggregate polygenic effect sizes without attempting to resolve causal variants.22 Here, we propose an analogous mixed model for GxE, GxE Mixed Model (GxEMM), to characterize the aggregate polygenic contributions of GxE. Relative to single-variant GxE tests, GxEMM has higher power (for polygenic traits) but lower resolution, a tradeoff that makes it particularly useful for characterizing the biological relevance of environmental measures.
While complex in some settings, GxEMM is easily visualized for discrete environments (Figure 1). GxEMM incorporates three key, interpretable models. First, the Hom model is equivalent to standard GREML and models only the mean effects of the environment. We call the second model IID, as it allows environment-specific genetic variance and noise but assumes that these values are constant across every environment. We call the final model Free, because it allows the genetic and noise levels to freely vary between environments. We propose these three core models as a parsimonious, biologically informative representation of possible GxE models.
Although GxEMM is simplest in the case of two discrete environments, GxEMM extends naturally to accommodate arbitrary environmental covariates. This is important because previous approaches require discrete environments (e.g., MV-GREML20) or univariate environments (e.g., MRNM23). A second significant component of GxEMM is the ability to model binary traits, like case/control disease studies. Third, GxEMM accommodates modest sample sizes, which can be important for applications like functional genomics. To obtain this flexibility, we implement three methods to estimate GxEMM parameters, each extending an existing approach to GREML: REML, for large sample sizes and continuous traits; phenotype-correlation-genotype-correlation (PCGC), for binary traits;24,25 and Haseman-Elston regression (HE), for modest sample sizes, e.g., hundreds of samples.26
We also make three important theoretical contributions. First, we demonstrate that noise heterogeneity must be modeled to avoid false positive GxE (Appendix A). Second, we provide a formally identified model description, which is essential for clear inference and future extensions (Appendix B). Third, we correct a misconception that G-E correlation generally causes GxE bias in the polygenic setting (Appendix C).
In this paper, we first provide a broad overview of GxEMM and the concept of polygenic interaction and then describe the details of the GxEMM method, our simulation study, and the three real datasets we analyze. Next, we perform a series of simulations and theoretical analyses that demonstrate that GxEMM is broadly reliable. We then apply GxEMM to study sex-specific heritability in 115 outbred rat traits and find strong, widespread signals of genetic and non-genetic heterogeneity, especially in bone and glucose-tolerance traits. Next, we find significant polygenic interactions with several stress indices and quantitative environments for major depression. Finally, we analyze RNA-seq data from postmortem brain and find strong evidence for bipolar- and schizophrenia-specific cis-heritability, both on average over the transcriptome and in nine known SCZ-associated genes. We conclude with a discussion on important caveats as well as future applications and extensions.
Material and Methods
Overview of GxEMM
GxEMM builds on GREML, an additive polygenic model that estimates heritability distributed across the genome. GxEMM additionally captures the GxE-based heritability due to polygenic gene-environment interactions. Assuming discrete environments, the model for phenotype yi in environment k (i.e., ) is:
(Equation 1) |
In this model, X are covariates with fixed effects α, like age or genetic PCs. G is the genotype matrix, with additive effects β. As in GREML, we assume that β and the noise, ϵ, are i.i.d. standard normal, and the heritability is determined by the genetic and noise variances, and . GxEMM additionally captures SNP-environment interaction effects, γ, which are also assumed i.i.d. standard normal. Further, GxEMM allows environment-specific genetic (vk) and noise (wk) variances. Although we have assumed that z is discrete to simplify Equation 1 and Figure 1, GxEMM extends to general z, e.g., quantitative environments or proportional membership across discrete environments (see LMMs for Polygenic Interaction Effects (GxEMM)).
To unpack the model, imagine studying genetic effects on height across males and females. A SNP s that equally increases height in both sexes has a homogeneous effect but has no sex-specific effects , so s contributes to but not vf or vm. Conversely, a SNP that increases height only in females has and , so contributes to vf but not or vm. Finally, wf > wm means that females have higher non-genetic height variance.
We consider three distinct models for v: Hom, where all vk = 0; IID, where for all k; and Free, where v is unconstrained. We use the same models for w, although the IID model is not identified in the case of discrete z. In the sex-height example, the IID model allows sex-specific genetic effects but assumes that, on balance, male- and female-specific sizes are identical; the Free model eliminates this assumption. In this paper, we assume that the genetic effects are exchangeable across environments, i.e., we assume that and are independent for . We also theoretically describe a Full model allowing arbitrary correlations between environment-specific effects, which is richer but statistically and computationally challenging (see LMMs for Polygenic Interaction Effects (GxEMM)).
Of the many possible statistical tests available for GxEMM, we focus on a parsimonious set of biologically meaningful tests to characterize polygenic GxE:
-
•
Hom versus Null (i.e.: ?) tests whether there is any heritability.
-
•
IID versus Hom (i.e.: ?) tests polygenic GxE under homoscedasticity.
-
•
Free versus Hom (i.e.: ?) tests polygenic GxE allowing heteroscedasticity.
We implement both LR and Wald tests.
In the Free model with discrete z, each environment can have different levels of heritability and total variance. Assuming the kinship matrix is normalized, the heritability in environment k is:
(Equation 2) |
Most previous approaches for polygenic GxE in unrelated samples assume discrete environments (Table 1). The GCTA GxE model fits an IID v.27 The MV-GREML model fits Full v and Free w (and low-rank, polynomial sub-models, called RR-GREML).20 iSet fits a similar model and can even fit Full w in the case that all samples are measured in all environments.28 MetaSex fits the Hom model independently in each of two discrete environments, which implicitly resembles Free v and w.29 Finally, StructLMM allows general (and potentially high-dimensional) environments, but only fits the IID model and a single SNP.21 Complementary to GxEMM, StructLMM powerfully identifies specific SNPs with environment-dependent effects but has low resolution to determine the biological significance of specific environments.
Table 1.
Publication | GxE | IxE | Binary y | Allowed Z | Notes |
---|---|---|---|---|---|
GCTA27 | IID | unbiased | no | discrete | – |
iSet28 | full | unbiased | no | discrete | design-specific |
MetaSex29. | free | unbiased | no | discrete | not MLE |
MV-GREML20 | full | unbiased | no | discrete | – |
StructLMM21 | IID | unbiased | no | general | SNP |
RNM23 | free | biased | no | bivariate | models G-E corr. |
MRNM23 | IID | unbiased | no | univariate | models G-E corr. |
GxEMM (us) | free | unbiased | yes | general | – |
GxE gives the richest fitted genetic heterogeneity model, and IxE indicates whether the method is biased under noise heterogeneity (Appendix A). Methods without specific notes are displayed with dash (–).
Most recently, Ni et al.23 developed a Free model for v and w that allows gene-environment (G-E) correlation. In practice, Ni et al.23 implement simpler submodels, and particularly focus on RNM and MRNM. RNM can fit up to two environments but ignores G-E correlation and noise heterogeneity (Appendix A). MRNM can fit only one environment, where the IID and Free models coincide. These constraints resolve computational and statistical limitations common to all GREML-based approaches, including GxEMM. The constraints also resolve non-identification issues specific to the model formulation in Ni et al.23 By contrast, the formal model underlying GxEMM is well identified and naturally accommodates multivariate environments and genetic correlations between environments (Appendix B). For completeness, we extend the GxEMM model to accommodate G-E correlation in theory (see LMMs for Polygenic Interaction Effects (GxEMM)). Nonetheless, we theoretically demonstrate that this is unlikely to cause bias for GxEMM (Appendix C); moreover, no evidence to date has shown that polygenic G-E correlation causes GxE bias or test inflation in practice, in contrast to the single-variant case.30
Overview of GxEMM with Binary Traits
Continuous trait models are inappropriate for binary traits like disease status. To address this, generalized linear models assume the binary trait is driven by an underlying quantitative liability. Binary trait heritability is then naturally defined as the heritability of this liability. To our knowledge, no other method has been developed to estimate polygenic GxE for binary traits.
One prominent approach to binary trait heritability estimation is to treat the 0/1 binary label as continuous in REML and then rescale post hoc. This transformation is exact without covariates31 and approximately extends to mild case ascertainment.24,32,33 We attempt to extend this idea to GxE and evaluate the variance components both with and without rescaling (see GxEMM for Binary Traits).
Another method for binary trait GREML, phenotype-correlation-genotype-correlation (PCGC), uses moment-matching. When covariates are absent, PCGC amounts to comparing phenotypic and genotypic correlations, similar to HE regression for quantitative traits. PCGC extends this to incorporate covariates and preferential case ascertainment using an analytic approximation.24,25 We can directly fit GxEMM with PCGC as it allows multiple relatedness matrices.
LMMs for Polygenic Effects (GREML)
The standard linear mixed model (LMM) for GREML assumes a quantitative trait measured on N samples, . We allow Q background covariates in a matrix and L SNPs that we collect into the genotype matrix . The homogeneous LMM is:
(Equation 3) |
(Equation 4) |
We assume and estimate α as a fixed effect. In contrast, we allow and model β as a random effect. This can be motivated as a genuine prior that all SNPs have small, nonzero, i.i.d. effects, but GREML accurately estimates heritability under more realistic architectures.34,35 Columns of G are demeaned and scaled based on some assumed relationship between MAF and effect size.36,37
Marginalizing out β gives a simpler and equivalent (or almost equivalent, according to Steinsaltz et al.38) formulation of GREML:
(Equation 5) |
defining IN as the N-dimensional identity matrix and as the kinship matrix, a natural estimator of genetic similarity. Equation 5, and most expressions in this paper, assume the total phenotypic variance is 1, so that and .
The GREML model in Equation 5 is commonly fit by restricted maximum likelihood (REML), which projects out X from K and y and then fits variance components by maximum likelihood. However, moment-matching methods are increasingly common, especially LD score regression39 for very large N or meta-analyses—where implementing REML becomes challenging—or HE regression40 for small N—where REML is biased35 and computationally unstable.26
LMMs for Polygenic Interaction Effects (GxEMM)
We now assume a matrix of P environmental variables, . As for X, we allow arbitrary binary and/or continuous variables in Z and assume . The GxE mixed model (GxEMM) adds environment main effects, polygenic interactions, and environment-specific noise to GREML:
(Equation 6) |
X should include the main effect of Z. We define as the interaction effect of SNP s and environment p, and is the noise contributed by environment p to person i.
We assume β and ϵ are random, as in Equation 4, and also model γ and δ as random effects:
(Equation 7) |
The interactions are independent between SNPs, as in β. For β, this means entries are independent, and for γ it means rows are independent. However, V allows the interaction effects to correlate across environments. Intuitively, captures the homogeneous effects (β) and Vpp captures the environment p-specific effects . Off-diagonal terms, , account for genetic effects shared between environments p and in excess of the homogeneous sharing across all environments, .
W is interpreted analogously, where Wpp indicates environment p-specific noise and indicates covariance between the noise contributed by environments p and . In discrete environments, is not identified and the Wpp can be assumed zero mean WLOG (Appendix B).
The model in Equation 6 can be simplified using ∗, the column-wise Khatri-Rao product:
(Equation 8) |
Just as the random β and ϵ can be marginalized for additive LMMs, giving Equation 5 from Equation 4, the random β, γ, δ, and ϵ in the GxEMM model in Equation 8 can be marginalized:
(Equation 9) |
The GxE variance component nicely decomposes into the Hadamard product of environmental similarity (ZVZT) and genetic similarity (K).
Equation 9 defines a complex likelihood. But commuting the matrix multiplications in ZVZT and ZWZT through the (linear) Hadamard product simplifies the expression:
(Equation 10) |
This phenotypic covariance matrix is easily visualized for discrete environments (Figure 1, with vp = Vpp and wp = Wpp). Since K and Z are known, Equation 10 is a standard variance component estimation problem. We typically fit the model with REML using LDAK.37 For small sample sizes we also use HE regression, which is computationally stabler and unbiased.
Interestingly, this marginalization from Equations 8, 9, and 10 holds if and only if γ has the covariance in Equation 7 (under large N or random G, Appendix D). This matters because it links the simple visual intuition (Figure 1) underlying the variance component model in Equation 10 to the concrete linear model in Equation 8; prima facie, distinct linear models could give equivalent variance component models. This bridge also enables GxEMM to naturally generalize to arbitrary environments, as they are seamlessly accommodated by the linear regression model in Equation 3. Conversely, the pictorial representation in Figure 1 does not easily adapt to general environments, nor do models that directly start from variance components rather than the underlying linear model.
We use Wald and LR tests for GxEMM, with asymptotic standard errors derived from the information matrix and the delta method. We allow negative variance component estimates to reduce bias—which is important for aggregating estimates across traits38—and because non-negative total heritability does not imply either that or that V is positive definite. We test individual HE model fits using permutations. In psychENCODE, we permute within-disease to mitigate violations due to non-exchangeability across individuals by preserving any noise heterogeneity across the disease groups.41,42
In this paper, we do not fit GxEMM with a general V or W (which we call the Full model) because it has parameters, which is computationally and statistically difficult for the range of N where basic REML methods are computationally feasible. Instead, we consider several restricted models for V (and W): Hom, where V = 0; IID, where ; and Free, where . Both IID and Free ignore correlation between genetic effects across environments (beyond the homogeneous sharing captured in ), which will perform well when environments are approximately exchangeable rather than structured, e.g., city or hospital indicators, but not discretized height. IID further assumes that genetics explain equal variance in each environment.
Ni et al.23 develop two models similar to Free GxEMM, RNM and MRNM. For computational and statistical reasons, they are implemented only for special cases. RNM only fits bivariate environments and assumes Hom noise (which is unreliable, Appendix A). MRNM fits only univariate environments, where the Free and IID models coincide (Appendix B). A distinct strength of MRNM is modeling gene-environment correlation. In our model, this can be phrased as learning correlations between the genetic main and interaction effects ( for each SNP s). In the future, we will investigate G-E correlation with GxEMM by adding terms like to our model in Equation 10.
Nonetheless, G-E correlation simulations in Ni et al.23 suggest the bias is not likely to be severe in practice for GxEMM. Specifically, GxE estimates were unbiased in simulations so long as Z was adjusted as a fixed effect—which is well known, and GxEMM always includes Z as a fixed effect. Ni et al.23 do observe that adjusting for Z changes the homogeneous variance components estimates, but this is entirely expected—cf. the distinction between “Marginal” and “Conditional” variances in Weissbrod et al.25 Overall, generally fitting Full GxEMM with G-E correlation would be useful in large datasets, but this remains computationally and statistically unrealized in unrelated samples. Randomized methods may address this gap in the future.43,44
GxEMM for Binary Traits
We extend two binary trait heritability estimation methods from the GREML context to GxEMM. The first uses REML and treats the 0/1 disease label as a quantititative trait, and then rescales the REML estimates post hoc:
(Equation 11) |
where is any of the REML-based GxEMM heritability estimates; K is the disease prevalence in the population and P is the sample prevalence; and φ and are the standard Gaussian density and distribution functions. This scaling approach is well established for the Hom model31 and works well when covariates and ascertainment are modest,32 but otherwise can fail badly.24
The second approach to fit GREML for binary traits, PCGC, directly models the binary nature of the trait and estimates parameters with the method of moments. PCGC incorporates covariates and ascertainment (approximately), making it far more robust than the REML rescaling approach for the Hom model.24 We directly fit PCGC-based GxEMM using its LDAK implementation. We estimate standard errors using resampling25 and the delta method.
The LDAK implementation discards on-diagonal entries in the variance component moment estimating equation. Although this is efficient for standard genetic similarity matrices, which have near-constant diagonals, this ignores important information for Free noise. A subtler and more significant problem, though, is that this off-diagonal approach treats the homogeneous noise variance component asymmetrically, as it captures both and the residual moment estimating error. This caused estimates of w based on including K − 1 noise heterogeneity matrices to behave strangely, with the implicit component (the K-th, say) being upward biased relative to the other K − 1 as it combines both wK and the overall PCGC approximation error. To mitigate this issue and enforce symmetry across environments, we use all K environmental noise similarity matrices in PCGC, which has a unique method-of-moments solution, enforces symmetry, and provides identified estimates of wK by formally defining . We note this issue is relevant only for discrete environments (Appendix B).
Simulation Details
We tested GxEMM with simulations using real genotypes from CONVERGE (see CONVERGE Data). For each simulation, we randomly choose S = 1,000 SNPs to have both additive and interaction effects. Because causal SNPs are chosen uniformly at random, we use the standard GRM (the Gram matrix of G after column centering and scaling) inside GxEMM in these simulations. We do not use the causal relatedness matrix because it is unknown in practice. We independently draw 200 datasets per parameter set, except for the N = 1,000 simulations where we draw 500 datasets. We note that REML failed for a small number of simulations.
We define Z by assigning each sample to one of two discrete environments uniformly at random. We define X to include Z and the CONVERGE fixed effects (see CONVERGE Data) and draw i.i.d. Gaussian with mean zero and variance such that X explains 10% of the variation in the raw phenotype (note, however, that α is irrelevant after residualizing in X in HE and REML). We scale columns of X and G to mean zero, variance one so that the effect sizes are easily interpretable.
We vary , v, and w depending on the simulation setting, with different choices corresponding to either the Hom, IID, or Free GxEMM models. We always set the homogeneous noise level so that the phenotype has (residual) variance 1:
Given these parameter and data choices, we draw traits from the GxEMM model in Equation 6 (Figure 2).
To simulate binary traits, we treat the generated quantitative trait, , as a disease liability and then threshold to generate the disease label:
where is the 80th percentile of so that y represents a disease with prevalence 20%.
We also perform a specific simulation study of large-effect covariates on binary traits (Figure 3, left). Rather than choose α randomly, we set all its entries to 0 except for the main effect of environment 1, i.e., the entry in α corresponding to . This term, , controls the relative prevalence of the binary trait between environments. We also assess a variant where the disease has low population prevalence but a 50/50 case/control cohort is ascertained by preferentially measuring diseased samples (Figure 3, right), setting . In these simulations, we simulate a population of size 500,000 and then ascertain 5,000 cases and controls; this strategy is computationally limited to modest population prevalences, e.g., . We draw random SNPs i.i.d. with 50% frequency in these ascertainment simulations.
Finally, we performed simulations specifically designed to compare GxEMM to RNM (Figure S1). To mirror the simulations in Ni et al.,23 we assume a continuous, centered, and scaled univariate environment and exclude other fixed effects. We then define the liability by drawing a 35% heritable trait with variance 1 and then adding the main environmental effect. We vary the main environment effect, , to evaluate a range of realistic settings. We note that large environmental main effects are plausible in practice (e.g., Peterson et al.19), especially because candidate environments for GxE are generally proposed based on their direct trait relevance.
Rat Data
We studied 1,407 rats with genotype and partial phenotype information. We use the same 115 traits that we previously used,45 which were deemed suitable for mixed model analysis in the original study.46 We excluded one wound-healing trait where REML struggled to converge, and 24 traits with <1,000 observed samples.
The studied rats are an outbred mixture of eight inbred strains, a strategy designed to increase genetic association mapping power. The outbreeding strategy is not simple to describe, but it is designed to maintain the frequencies of the eight founder alleles at modest frequencies, avoid extreme inbreeding, and ensure mixing between founder strains.
We used the same kinship matrix, covariates, and trait transformations as in Baud et al.,46 which we have summarized in Table S1. In particular, this involves adjusting for trait-relevant covariates—always including sex—and using trait-specific transformations.
To be conservative, we do not impute the phenotypes, though genetics-unaware imputation seems unlikely to substantially bias GxEMM.
CONVERGE Data
We studied 9,303 samples with genotype and covariate information and pairwise kinship <.05, following Peterson et al.19 We studied 14 binary stressors and 10 quantitative measurements as environmental covariates (Table S2). Two stress questionnaire items were very rare and excluded.
For each choice of environment variable (E), we used an intercept, age, ten genetic PCs (from an LDAK relatedness matrix37), and E as fixed effects. We also always include the interactions between these terms and E, which can be important for reducing bias in GxE testing in large samples.47
When fitting mixed models, we use the GCTA-based genetic relatedness matrix (GRM) in simulations and the LDAK-based GRM in real data. We previously found the difference between the LDAK and ordinary GRMs to be qualitatively minimal.19
To analyze a single quantitative covariate z, we linearly scaled it to have minimum 0 and maximum 1, and then defined . This is analogous to the discrete environment Z matrix: they exactly coincide when z covariate takes only two values, and in general z can be interpreted as proportional membership between two stylistic groups, the min and max values of z. In turn, v1 (w1) describes genetic (non-genetic) variance specific to low z, v2 (w2) describes the genetic (non-genetic) variance specific to high z, and describes heritability that is independent of z. This construction of z also allows naturally extending our definition of discrete environment-specific heritabilities (Equation 2) to quantitative environments: is the heritability for samples with the minimum (maximum) observed environmental value. We feel this coding of the environmental effect makes the interaction random effects more interpretable, but we emphasize that the simpler model which evaluates only z (instead of Z) is perfectly valid when its assumptions are met.
For higher-dimensional environments, the per-environment Z matrices could be concatenated by . The environment-specific heritabilities from Equation 2 are not easily extensible to this setting, however.
psychENCODE Data
We analyzed the processed genotype and prefrontal cortex gene expression data from Gandal et al.48 and Wang et al.49 These data have already been adjusted for nonlinearity in raw expression measurements and large, non-genetic covariate effects, including biological factors like age, technical factors like RIN, and latent confounders as estimated by expression Surrogate Variables.50 We restricted only to measurements from the pre-frontal cortex, as cerebullum and temporal cortex had much lower sample sizes.
We restricted only to autosomal SNPs and genes, and we only used SNPs with MAF > 5%, missingness < 10%, and Hardy-Weinberg p > .001. We then filtered our samples such that no pair had genome-wide relatedness above 0.05.22 After these filters, we obtained a sample size of N = 931 with both genotype and gene expression data.
For each autosomal gene, we extracted cis SNPs, which we define here as SNPs within 1 Mb of the gene transcription start or end site. This yielded 24,905 total genes. The median number of cis-SNPs per gene was 2,189. We then constructed cis kinship matrices for each gene by centering and scaling each SNP.
For each gene, we performed Hom, IID, and Free GxEMM with HE, due to the modest sample size. For comparison, we also fit two REML tools: REML-based GxEMM, and the Hom and IID models as implemented in GCTA.27 As expected, GCTA and GxEMM with REML obtained essentially identical estimates for both Hom and IID (Figure S2). And, as expected, the HE estimates were also highly correlated but noisier. We note that GCTA had substantially higher average than REML-based GxEMM (5.8% versus 4.2%), but this was driven almost entirely by GCTA’s lower rate of convergence: when restricting to genes where both methods converge, the methods gave similar results (5.8% versus 5.7%). This is redolent of the bias induced by restricting to genes with positive heritability estimates.38
We assess significance in HE by using 10,000 permutations of samples within-disease class (Table S3). Even under noise heterogeneity, this approach is an exact permutation test when genetics are completely null because samples remain exchangeable within-disease class. More generally, our permutations are in line with approximate permutation tests used widely (e.g., for FEATHER42 or HE regression26).
Results
Quantitative Trait Simulations
We simulate data from increasingly complex polygenic interaction models to assess GxEMM and to compare the Hom, IID, and Free models. We first simulate from the purely additive Hom model, equivalent to GREML, by varying and fixing the heterogeneity terms to 0, i.e., vk = wk = 0. As expected, all GxEMM models performed well, as illustrated by the roughly unbiased Hom estimates of (Figures 2A and 2B, gray points). IID and Free also gave unbiased estimates for the total heritability, and their heterogeneity tests were appropriately null (Figure 2A, orange and blue lines).
Second, we draw from the IID model by varying the single heterogeneity parameter and fixing . As expected, fitting the IID model provides roughly unbiased estimates of both and (Figures 2C and 2D), as does Free GxEMM. However, the Hom model underestimates total heritability (gray points, Figure 2D) and also gives the false impression that genetic factors are shared between environments.
Third, we simulate from the Free model by specifying different levels of genetic variance in each environment, varying and then setting v2 = 1 − v1; we keep = 0 and Hom noise . For all v1, Free GxEMM accurately estimates environment-specific heritabilities (dark orange and green lines in Figures 2E and 2F), but Hom GxEMM underestimates total heritability.
Our fourth setting again draws from the Free model, except now genetics are homogeneous(, and v1 = v2 = 0) and we instead vary the distribution of noise per environment, taking and setting w2 = 1.65 − w1. Free GxEMM again performs well], correctly avoiding GxE false positives (Figure 2G, solid blue) and providing unbiased estimates (Figure 2H). Also, note that allowing Free noise increases Hom power (dotted gray, Figure 2G), showing that GxEMM can be useful even in the absence of GxE.
In this specific simulation, we also evaluate a variant of the Free model allowing Free GxE but assuming Hom noise. Related models have been used before.23,51 However, this approach is susceptible to dramatic, replicating false positives under noise heterogeneity (Figure 2G, dotted blue lines) as well as severe bias: for example, consistently estimating a 10% heritable trait to have 100% heritability (dashed dark orange and green lines with diamonds, Figure 2H). We provide a theoretical characterization of this bias in Appendix A. Such GxE models that do not allow noise heterogeneity should almost never be used in practice.
We used Wald tests in these simulations, but results were similar when using LRT (Figure S3). Also, although we used two discrete environments here for simplicity, simulations with bivariate quantitative environments gave similar results (Figure S4). Finally, we also found that the GxEMM standard error estimates were accurate or slightly conservative (Figure S5).
We report the runtimes for the three GxEMM models on this simulated dataset, other simulated datasets, and the three real datasets in Table S4. Broadly, GxEMM runs in tens of minutes for thousands of samples, with higher costs for richer models and larger sample sizes.
Binary Trait Simulations
We next examine binary traits by simulating a 20% prevalence disease with and a discrete, binary environment with a 25%/75% split (Material and Methods), roughly based on the CONVERGE data (below). We focus on simulations from the Hom model to assess which approaches yield calibrated genetic heterogeneity tests. We compare fitting GxEMM either with REML, using the standard liability scale adjustment, or with PCGC, which directly models the binary nature of the trait.
We first assessed the impact of differential prevalence between environments by varying , the mean liability in environment 1 (Figure 3, left). For , both estimators (PCGC and REML) and all GxEMM models (Hom, IID, and Free) perform well. PCGC continued to perform well for all tested , giving calibrated heterogeneity tests and unbiased heritability estimates for the larger group, ; however, the heritability in the smaller group, , was conservative (or very noisy at , as basically no cases are observed in environment 1).
On the other hand, REML breaks down in several ways for . First, the estimate is downward biased, as expected.24 Second, becomes upward biased, and the IID heterogeneity test becomes inflated. Third, Free GxEMM gives severely biased environment-specific heritability estimates, and often badly inflated heterogeneity tests.
To assess power, we repeated these simulations under Free genetic heterogeneity and found that IID GxEMM had power to detect genetic heterogeneity for both REML and PCGC (Figure S6, left). However, Free GxEMM had low power, reflecting the loss of information from the liability thresholding process. Second, under Hom genetics and Free noise, PCGC was unbiased or slightly conservative while REML was again upwardly biased (Figure S6, right). Third, we performed binary trait simulations mirroring the quantitative trait simulations in LMMs for Polygenic Effects (GREML) to broadly test PCGC-based GxEMM (Figure S7). PCGC performed similar to REML for quantitative traits, except power is lower, as expected, and PCGC performed poorly under extreme variance heterogeneity, likely because the underlying first-order approximation breaks down in extreme settings.
Disease studies often preferentially ascertain disease cases to increase power. As ascertainment causes bias in GREML,24 we assessed its impact on GxEMM through simulations. We fixed (similar to CONVERGE) and ascertained 50/50 case/control cohorts (Material and Methods). Although GxEMM with REML breaks down under ascertainment—in particular, Free/IID GxEMM tests have roughly 50%/25% false positive rate for 1% prevalence—PCGC remains calibrated (Figure 3, right).
Finally, we evaluated the RNM model, a recent REML-based polygenic GxE model implemented in the MTG2 software package.52 To streamline comparison, we performed simplified versions of the above non-ascertained simulations with a continuous, univariate environment and no additional covariates. Consistent with our above results (Figure 3, left), we found that PCGC broadly performed well, as did REML-based GxEMM and RNM when the environment had no main effect (Figure S1). However, both RNM- and REML-based GxEMM become biased as the environment main effect grows. In particular, RNM obtains roughly 90% false positive rates when the environment explains roughly 50% of the liability-scale phenotypic variation.
In summary, repurposing standard tools for inferring polygenic GxE for quantitative traits is not viable. On the other hand, PCGC-based inference provides calibrated heterogeneity tests and approximately unbiased GxEMM parameter estimates. We caution, though, that PCGC estimator variances are large and, in particular, the Free model adds negligible value beyond the IID model for binary traits at these sample sizes.
Phenome-wide Sex-Specific Genetics in Outbred Rats
Our first application is to sex-specific genetic effects across 115 phenotypes in 1,407 outbred rats from the Rat Genome Sequencing and Mapping Consortium46 (Material and Methods). These samples have high genetic relatedness, which aids mixed model power. Although many traits are known to be sexually dimorphic, it is not generally well known to what extent, and for which traits, sex differences are driven by autosomal genetic variation.
First, we fit Hom GxEMM and found 105/115 traits were heritable at p = .05/115, with average (gray violin, Figure 4A). We next fit IID GxEMM, testing for GxSex interaction and found that the number of heritable traits increased to 112/115 (rust violin). The average IID heritability was 73.8%, which is 9.7% larger than the GREML heritability. The sex-specific component of the IID model explained 30.3% of trait variance on average and was Bonferroni significant for 12 traits (gold violin), including measurements of bone density, hemoglobin, platelets, serum composition, and glucose trial response (Figure S8, Table S1).
On average across all traits, the Free heritability estimates were similar, for both sexes, to the IID model ( orange violin and green violin, Figure 4A). Nonetheless, the Free model uncovered many traits with sex-specific genetic variance, including eight that were Bonferroni significant (orange vF and green vM violins, Figure 4B). All eight had significant in the IID analysis, but the Free analysis more precisely characterizes the genetic heterogeneity. For example, 3/8 of these traits have significantly different genetic effect sizes between sexes (Wald p < .05/115; 7/8 have p < .05): female-specific genetic effects drove two bone measurements, white blood cell count, and serum chloride; male-specific genetic effects drove two platelet traits, glucose response, and serum potassium (Figure S8). The Free model also found sex-specific noise levels for 5/115 traits (p < .05/115, pink violin), including three of the genetically heterogeneous traits and an additional bone density trait.
To further explore the data and illustrate the GxEMM model, we plot results for two of the traits with sex-specific heritability. The first is a femur size measurement (distal femur cortical area, Figure 4C). In Hom GxEMM, this trait has high heritability, roughly 75%. Nonetheless, heritability is increased to roughly 85% by the IID model and, moreover, the homogeneous component of IID GxEMM is near zero. Finally, the Free model goes further by revealing that the sex-specific genetic effects are primarily active in females. This analysis shows that GxEMM can uncover dramatically non-additive genetic architecture even when ordinary GREML explains substantial heritability.
An approximate mirror image is obtained for a glucose tolerance trait (area under glycemia curve over baseline during intraperitoneal glucose tolerance test, Figure 4D): is large, yet homogeneous heritability largely vanishes after accounting for sex-specific heritability in IID GxEMM, which is in turn revealed by the Full model to be driven largely by male-specific genetic effects. This result is consistent with a previously reported sex-APOE interaction effect for glucose.53 This rich characterization of sex-specific architecture has clear implications for genetic association studies: power can be maximized by studying the sex with greater heritability, and associations can be interpreted in light of known average difference between sexes.
Polygenic Stress Interaction in Major Depression
We next apply GxEMM to major depression (MD), a moderately heritable disease that is likely genetically and environmentally heterogeneous.54 We analyzed the CONVERGE cohort, which recruited about 10,000 Han Chinese women between 30 and 60 years old. Women were chosen because they have higher MD heritability than men (i.e., 42% versus 29%55), and clinically ascertained MD-affected case subjects were selected for the same reason.56 This strategy successfully led to replicated GWAS hits for MD54 and later yielded three SNP effects specific to people without major lifetime stress.19 Here, we extend these SNP heterogeneity analyses to the polygenic level with GxEMM using our robust approach for binary traits based on PCGC.
We first fit Hom GxEMM with PCGC to MD and found (SE 4.5%) assuming an 8.8% population prevalence (Figure 5).19 Importantly, we adjusted for interactions between genetic PCs and the environment, which can be essential for avoiding bias and false positives.47
We then fit 14 IID GxEMM models with each of 14 different binary stress measures (CONVERGE Data). IID modestly increased heritability on average (, Table S2), broadly supporting polygenic stress interactions. One stress measure (divorced/separated/widowed status) was Bonferroni significant (p = 0.0020 < .05/14). This measure had relatively high prevalence and MD effect severity, which both increase interaction test power; more severe measures (e.g., “Child Abuse”) were less frequent, and more frequent measures (e.g., “Natural Disaster”) were less severe. Altogether, the IID model supports polygenic stress interaction for MD, but large standard errors and high correlations between measures prevent conclusions about which specific major life stresses drive the interactions.
The Free model performed reasonably for the non-stress groups, finding slightly higher heritability relative to IID (41.2% versus 43.4%) but not finding any nonstress-specific genetic variance at Bonferroni significance. More importantly, however, the Free model performed badly in the smaller stress groups, consistent with simulations (Figure S6, left). Similarly, differential noise estimates had large standard errors and none were significantly different from 0.
Although we primarily focus on stress interactions due to prior knowledge that they are relevant to MD, we also investigated polygenic interaction across each of ten quantitative environmental measures (Material and Methods). IID results were qualitatively similar to the results from the binary stress measures, with average heritability increasing to (Figure S9) and one measure, “Cold Mother,” that was nearly Bonferroni significant for polygenic interaction (p = 0.0079 > .05/10; “Cold Father” point estimates were similar and had p = .0198). As for the binary stress measures, the Free model was not informative. Overall, IID GxEMM had power in this dataset and provided evidence for genetic heterogeneity, but larger sample sizes seem necessary for Free GxEMM to be useful.
Brain Expression Heterogeneity in Psychiatric Disease
Finally, we apply GxEMM to brain gene expression data from psychENCODE48,49 to test for differential genetic and non-genetic factors governing transcriptional regulation in psychiatric disease relative to control subjects. In this analysis, disease state plays the role of the “environment” in GxEMM. After quality control (Material and Methods), the data include 931 samples, including 356 schizophrenia-affected case subjects (SCZ), 158 bipolar disorder (BPD)-affected case subjects, and 417 control subjects (CTRL), each measured on 24,905 genes. In this analysis, we focus on fitting GxEMM with HE regression because REML is computationally and statistically unstable at smaller sample sizes, especially for richer models.26 (We note, though, that REML gave broadly consistent results when it converged [Figure S2].) For each gene, we estimate local cis-heritability by using relatedness matrices built specifically from SNPs within 1 Mb of the gene’s transcription start site.
We first fit Hom GxEMM and found (SE 0.07%) on average across the genome (Table 2). This is in line with previous transcriptome-wide heritability estimates from whole blood.26,57 Here and below, we test transcriptome-wide averages with a Wald test, where standard errors are calculated assuming each gene is independent, which is an established approximation (e.g., Hernandez et al.26).
Table 2.
Mean (%) | 4.06 | 4.12 | 4.32 | 4.34 | 3.82 |
SE (%) | 0.07 | 0.07 | 0.08 | 0.08 | 0.07 |
Wald p for | <2 × 10−16 | <2 × 10−16 | <2 × 10−16 | <2 × 10−16 | <2 × 10−16 |
Wald p for | – | 1.15 × 10−11 | 1.93 × 10−9 | <2 × 10−16 | 1.00 |
Wald p for | – | – | 3.83 × 10−6 | <2 × 10−16 | 1.00 |
Because N is modest, we fit GxEMM with HE regression. We compare the models using Wald tests, estimating the standard errors of the transcriptome-wide averages assuming independence across genes. Dash (–) indicates that the comparison is not meaningful.
To investigate transcriptome-wide patterns of disease-dependent genetic architecture, we next fit the IID model. We estimated transcriptome-wide average heritability of , which slightly (1.5%) but statistically significantly (p < 2 × 10−16, paired t test) increased heritability over the homogeneous GREML model. We then fit the richer Free model, which estimated disease-specific heritabilities of 4.32% (SE 0.08%), 4.34% (SE 0.08%), and 3.82% (SE of 0.07%) for BPD, SCZ, and CTRL, respectively. For BPD and SCZ, the increase over is and significant (both p < 2 × 10−16); the increases over are also statistically significant (p = 3.83 × 10−6 and p < 2 × 10−16). Conversely, is lower than , suggesting that GREML actually overestimates the heritability that is common across all disease states, consistent with simulations (Figures 2D and 2F). Overall, GxEMM shows statistically significant interaction between psychiatric disease state and genetic control of gene expression in the prefrontal cortex, and the Free model further suggests an increased role of genetics specifically active in the disease states.
To confirm that our results were not primarily driven by confounding from population structure, we performed the same analysis after restricting our dataset to samples from the “European” population, which was the most numerous population in our psychENCODE data (N = 872). We found that transcriptome-wide average estimates were essentially unchanged (Table S5).
We next tested for genetic heterogeneity at individual genes, which is less powerful but more precise than transcriptome-wide average tests. Based on simulations, we have low power for genetic heterogeneity tests across all genes at this sample size (Figure S10). Instead, we test 63 genes with known significant gene-based genetic associations for SCZ.58 We test for significance using within-disease group permutations (Material and Methods).
IID GxEMM does not find any gene with significant polygenic-disease interaction at 25% FDR (Benjamini-Hochberg, Figure 6A, Table S3). However, Free GxEMM detects 9 associations across the three subgroup-specific genetic variances (3 are Bonferroni significant). These results are consistent with our transcriptome-wide average tests, where we found that the Free model had considerably higher power than IID.
Figures 6B and 6C show the GxEMM estimates for two of these nine genes. First, SNX19 is an SCZ-associated protein coding gene that also has known experimentally validated functional eQTL that overlap SCZ risk loci.59, 60, 61, 62, 63 This gene has high cis-heritability in the Hom model (roughly 75%). Next, the well-established IID model fails to find even suggestive genetic heterogeneity. However, the rich Free model finds substantial heterogeneity that is largely driven by SCZ-specific genetics. This is consistent with simulation results showing low power for the IID GxE test when most of the specific heritability is concentrated in one group.
The second significant gene we illustrate, FURIN, is a protease that has been robustly associated with neurodevelopment in zebrafish and human neural progenitor cell development in vitro.64 FURIN also has a strong cis-eQTL that colocalizes with GWAS-significant SCZ SNPs,65 and this eQTL was experimentally shown to modify FURIN expression and, in turn, brain-derived neurotrophic factor.66 In psychENCODE, we found that FURIN has essentially zero homogeneous or IID heritability, in contrast with SNX19. On the other hand, the Free model is able to reveal significant heritability specifically in SCZ. These stories both support the links between genotype, expression, and disease and demonstrate how GxEMM can be used to uncover genetic variants with disease-specific functional genomic effects. We also found Bonferroni-significant genetic heterogeneity for CHRNA2, a previously identified “high confidence” gene supported by association with SCZ and Hi-C interaction with nearby eQTL.49
Finally, we tested for differential expression variance between disease groups transcriptome-wide. We find evidence for pervasive differences in expression variance across disease states: for example, 2,459 genes (roughly 10%) have Levene p < .001. This heterogeneity was driven primarily by higher expression variance in SCZ: of these 2,459 genes, 97.1% had higher variance in SCZ than CTRL. We also note that 92.2% of genes had higher variance in SCZ than BPD, but this is at least partially because SCZ has higher sample size (and thus power). Adding confidence in the Levene tests, we permuted disease labels and found no evidence of test inflation (e.g., 10% of genes have Levene p <.099 after permuting), suggesting disease state truly correlates with expression variance. Although this test does not establish genetic heterogeneity, it does show that Free GxEMM is likely more powerful than IID: in simulations, variance heterogeneity increased Free power over IID regardless whether it was genetic (Figure 2E) or non-genetic (Figure 2G).
Discussion
Gene-environment interactions (GxE) and polygenicity are separately well documented. We help bring these concepts together with a linear mixed model for polygenic GxE that we call GxEMM. GxEMM consists of three key models: Hom, IID, and Free. Each pairwise comparison adds biological interpretation: comparing Hom to IID can demonstrate the existence of GxE, and comparing IID to Free can demonstrate and appropriately adapt to heteroscedasticity across environments. Finally, comparing Hom and Free models can add power for detecting GxE over the IID model; for example, the Free model, but not the IID model, had power to show GxE for SNX19 and FURIN in psychENCODE. Generally, GxEMM can be used for any covariate putatively interacting with the genome, including genetic variants,67 study indicators for meta-analysis68,69 or genotype quality control,70 or phenotypic subtypes.71
We have introduced several methods to fit GxEMM, based on REML, HE, and PCGC; all are implemented as an R wrapper of the LDAK software package. In practice, only one method is recommended for a given dataset: HE for small sample sizes, PCGC for binary traits, and REML for large sample sizes and continuous traits. These same concepts apply to standard homogeneous heritability estimates.
GxEMM unifies, formalizes, and solves several important biases in recent approaches to estimate GxE-based heritability. In particular, only GxEMM can accommodate general environments, noise heterogeneity, modest sample size, and binary traits, which we support with theory and simulation.
There are several methodological limitations to GxEMM. First, like most LMM methods, GxEMM is computationally intensive, as it fits several variance components to individual-level data. In special cases, the problem can be rewritten to roughly reduce the computational complexity from to .21,28 In the future, we will investigate larger sample sizes using randomized method of moments to fit GxEMM43,44 (Material and Methods).
A second limitation is our assumption of Gaussian random effects—although the central limit theorem does support this for polygenic effects. But non-normality is not obviously more salient for GxEMM than for ordinary GREML, and we used quantile normalization in psychENCODE and trait-specific transformations for the rat traits to mitigate the impact of outliers.26,42,72 We note that these normalizations, too, have downsides, and in particular reduce power for Levene tests for variance heterogeneity.
A third limitation to GxEMM is that we do not test or correct for gene-environment (G-E) correlation. This is a well-known source of bias in the fixed effect case.30 Surprisingly, however, Ni et al.23 did not observe inflation for polygenic interaction tests in simulations. In fact, we prove that this is due to a fundamental difference between the polygenic and single-SNP case (Proposition 3): absent systemic coordination between the main genetic effects on the environment and trait, the per-SNP biases average out over the genome. While such coordination would be intrinsically interesting, it has not been documented in real data and is not likely to be large in practice, and thus polygenic G-E is likely not to cause false positives for polygenic GxE tests. Nonetheless, we describe how to incorporate G-E interaction within the GxEMM model framework in the Material and Methods, for completeness.
Fourth, we did not fit Full GxEMM, which can learn structured genetic correlations across environments. Such models are flexible but complex, with many degrees of freedom and non-trivial identification concerns. However, at biobank-scale sample sizes, there will likely be sufficient power to fit the Full model.
A fifth limitation is that GxEMM does not currently allow random effects beyond the core Hom and GxE variance components. However, in the context of family studies (e.g., Diego et al.73) or high-dimensional covariates (e.g., Moore et al.21), it can be valuable to model background confounding variables as random effects in GxE tests. In future work, GxEMM could in principle be easily extended to accommodate such additional variance components, e.g., Zaitlen et al.74 jointly fit pedigree- and additive-relatedness matrices in the homogeneous setting. Nonetheless, in practice, we use and recommend standard filters on kinship and population structure corrections when analyzing natural populations, as is standard in GREML analyses.22 In the case of the rat experimental data, we did not perform these steps because confounding by cryptic relatedness is very unlikely;45,46 however, additional random effects may still be helpful, e.g., for cage effects.
The results from our rat analysis shows widespread sex-specific heritability. The Free model, specifically, has strong statistical support for several traits even at stringent thresholds. This power derived from the high levels of genetic relatedness in these model organisms. Likewise, GxEMM will be powerful in other model organism and family studies, which can help build broad intuition about causal genetic architecture that can inform human disease studies. This is particularly true for autosome-sex interactions, which may be important for many complex human diseases.75
There are several limitations to our analysis of the psychENCODE data. First, our LDAK-based REML approach often failed to converge, so we used HE regression, which is robust but less powerful than REML. Second, cell type proportions are known to vary between people and disease states, causing bulk differential expression across disease groups for any gene that is simply differentially expressed across cell types. In future work, we will extend GxEMM to partition expression heritability into cell type proportion effects and within-cell type effects.
GxEMM may also be useful for GWAS applications. An IID-based approach has been shown to improve power and calibration over the standard Hom model in some cases,47 suggesting that Free GxEMM will likely yield further improvements when the IID assumption fails.
Declaration of Interests
The authors declare no competing interests.
Acknowledgments
We would like to thank Doug Speed for adding GxEMM to LDAK, which will enable GxEMM analyses on tens of thousands of samples by increasing speed and reducing memory requirements relative to our R implementation that calls LDAK.
N.Z. is supported by NIH K25HL121295, U01HG009080, R01HG006399, R01CA227237, R03DE025665, R01ES029929, and DoD W81XWH-16-2-0018. A.D. is supported by U01HG009080 and R01HG006399.
Published: January 2, 2020
Footnotes
Supplemental Data can be found online at https://doi.org/10.1016/j.ajhg.2019.11.015.
Contributor Information
Andy Dahl, Email: andywdahl@gmail.com.
Noah Zaitlen, Email: noah.zaitlen@ucsf.edu.
Appendix A: Unmodelled Heteroscedasticity Causes Bias in Polygenic Interaction Estimates
In this appendix we quantify the bias under the Free GxEMM model when environmental heterogeneity is ignored. We assume discrete environments and that the modeler fits the similarity matrices , where Kk is the kinship matrix K with entries outside of environment k zeroed out. We assume that for all k. We also assume that the true variance of y is:
(Equation A1) |
where Ek is the identity matrix with entries outside environment k zeroed out—i.e., we assume there is no genetic variance but that the noise variance may differ across environments when for some j and k. For simplicity, we also ignore the homogeneous genetic component and covariates. For identifiability, we assume ; under this constraint, is the average phenotypic variance across all samples. Finally, we assume approximate independence between genetics and environment. It is worth noting how simple this model is—pure heteroscedastic noise.
We analytically study the behavior of the HE regression, which is more tractable than the MLE.
Proposition 1. Assume for all k that and that and for all k. Under the pure heteroscedastic noise model (Equation A1), the bias in the Free genetic heterogeneity estimate with Hom noise is:
(Equation A2) |
Proof. The moment estimating problem is:
Note that modeling the diagonal terms, rather than regressing only on the strict lower triangular entries, is important when variance may be heterogeneous across samples.
The OLS solution for the parameters v and is given by:
where n is the vector of environment-specific sample sizes, , and D is diagonal (because for ) with . In expectation:
The last equality derives from our identification assumption on . Likewise, the expected values of the quadratic forms in S are:
Let . The expected variance component estimates are (using block inversion in the second equality):
The last line follows from some algebra and nicely emphasizes that the are unbiased when wk = 0 for all k, i.e., under noise variance homogeneity.
The dk terms can be simplified assuming that the environments are independent of the genotypes and that each , so that:
The approximation in the last line uses the assumption that the same a and b values can be used across k, which is implied by our assumed gene-environment independence and large nk.
To approximate dk, we assume that nk = n0 is constant across all groups, giving:
The last equality derives from our identification assumption on . This approximation could be expanded to capture modest variability across nj by Taylor approximation around n0.
□
In the case of noise homogeneity, wk = 0 for all k, hence the HE estimate is unbiased, as is well known. The bias is also small in the case where the kinship matrix is very far from the identity (either because off-diagonals are large, increasing b, or because on-diagonals are highly variable, increasing a). However, in all other settings the expected estimate for environment-k-specific heritability vk will be proportional to the true environment-k-specific noise (wk), attenuated by a factor measuring the distance between K and I.
We show that this bias approximation can be accurate in practice by evaluating the HE regression model fits from the quantitative simulations performed in the main text. Specifically, we fit HE regression ignoring noise heterogeneity to the simulations with homogeneous genetics but heterogeneous noise (Figures 2G and 2H) and compare the estimated genetic heterogeneity with our bias approximation (Figure 7). We find that our bias approximation is accurate even though these simulations include nonzero heritability and fit a homogeneous genetic variance component, unlike our simplified theoretical analysis in this appendix—supporting the practical relevance of our bias approximation.
Appendix B: Identification of V/W and Standardizing G/Z
Parameter Identification
The Free and IID assumptions resolve an identification problem that arises for the genetic variance parameters of the Full model, , when Z is discrete. Under the Full model, any constant λ can be passed between and V without changing the likelihood by , with a vector of 1 s. Conceptually, this is equivalent to the fact that the population mean and the environmental main effects are not jointly identified in linear regression models with discrete environments.
Free and IID GxEMM break this symmetry because Vij and cannot both be 0 for nonzero λ. Intuitively, the Free and IID models prioritize pairs where the mean effect across environments is absorbed into (so ). We also note that some parameters in the model of Ni et al.23 are not identified for this reason, and the problem grows as the number of environments increases.
W is also not fully identified for discrete Z. First, off-diagonal entries cannot be estimated, so the Full and Free models are identical. Second, , which eliminates a degree of freedom from the diagonal of W. In particular, the IID and Hom noise models coincide.
Covariate Scale Identification
Columns of G can be assumed mean zero WLOG because the restricted likelihood is invariant under the mapping for all :
where the equivalence is modulo the projection orthogonal to the fixed effects, which include 1N and Z ( stacks γ into an K × L matrix column-wise).
Z, however, cannot be centered without changing the likelihood, because G is not projected out as a fixed effect. Mapping , now for , the initial covariance parameters are equivalent to the the new parameters iff
where JK is a matrix of 1 s. So adding μ to the columns of Z comes WLOG if and only if there always exists a such that the implied above remains in the set of allowed covariance matrices. In particular, demeaning the columns of Z comes WLOG only under the Full model, because the LHS in the final equation is non-diagonal for .
Fundamentally, this asymmetry between Z and G is because we treat the former’s main effect as fixed (α) and the latter’s as random (β).
The scale of columns of Z is irrelevant for Free GxEMM because the feasible set for V is closed under conjugation by diagonal matrices—multiplying by λ can be counterbalanced by multiplying row k and column k of V by . However, the column scaling of Z can be very important for IID GxEMM. By default, we do not scale or center Z when it measures discrete environments in GxEMM. We do, nonetheless, require that , which comes without loss of generality and permits simpler formulas for heritability. We note that Moore et al.21 and common practice in many machine learning tasks centers and scales columns of Z; when environments are discrete, this corresponds to an assumption that rarer environments harbor larger specific genetic effects.
In the Full model, the columns of Z become identified only up to span. This has pros and cons: a benefit is invariance to linear transformations, but a con is that individual elements of V become meaningless. Conversely, individual elements of the diagonal V fit by Free GxEMM are meaningful, as the explicit assumption of diagonal V identifies a basis for span.
Appendix C: Gene-Environment Correlation and GxE Bias in Polygenic Setting
It is well known that gene-environment (G-E) correlation can cause bias in gene-environment interaction (GxE) tests if uncorrected. This is a general statistical issue in fixed effect regression, and main covariate effects should almost always be adjusted when testing interactions. This has been emphasized recently in the GxE literature for testing individual genetic variants; e.g., Dudbridge and Fletcher30 provide a worked example analyzing the simplest case of a single causal genetic variant.
Ni et al.23 recently introduced a polygenic GxE model that incorporates G-E. The motivation for modeling G-E was by analogy to the fixed effect setting, where the bias is not in dispute. However, Ni et al.23 provide no reason this bias would persist in the polygenic setting, nor simulations where G-E causes false-positive GxE.
Here, we provide a theoretical analysis that answers these questions. It is possible for polygenic G-E to bias GxE, but only under a very specific condition that, to our knowledge, has never been evaluated in reality. In particular, we feel that G-E correlation is unlikely to be a significant source of false GxE in the polygenic setting.
Specifically, the condition for G-E correlation to bias GxE estimates is that the GxE effects for each SNP (γ) must correlate with the product of its direct effect on the phenotype (β) and its effect on the environment (α):
For example, in the case of GxSmoking for BMI, the fact that G-Smoking correlation exists is insufficient to cause bias. However, the polygenic GxSmoking estimates will be biased if the interaction of SNPs that both directly increase smoking likelihood and BMI tend to have a positive interaction , indicating postive epistasis between main BMI effects and indirect effects mediated through smoking . If, instead, we retain but assume , then these two pathways to BMI are instead negatively epistatic, meaning the total SNP effect is less than the sum of its direct and smoking-mediated effects.
Intuitively, the bias from G-E correlation on GxE per-SNP averages out to 0 across all SNPs. For a small number of causal SNPs, however, the contributions will not perfectly cancel. In the simplest, most studied case—a single causal SNP—the problem is at its most extreme.
Proposition 2. Assume the GxE model with G-E correlation23 defined by:
Assume also that both and are i.i.d., and also independent of each other, where are the GxE effects for SNP l. Then G-E correlation causes false positive inflation in the HE regression test for polygenic GxE if and only if:
where are the GxE interaction effects for environment p and is the SNP l-by-environment p interaction effect.
Proof. First, note that the fixed effects μ and ω are eliminated by residualizing covariates; we do not track the impact of this finite-sample projection going forward, however, which is not quite formal but is very standard. The model simplifies to:
In expectation—over the random variables e, ϵ, α, β, and γ—the quadratic form in y for is:
defining and using the notation f(i,j)2 = f(i,j) + f(j,i) for any two-argument term f.
The final term simplifies because the expectation is zero unless , giving:
(Equation A3) |
where is an arbitrary dummy index (these terms are i.i.d. over ) and where we define .
We allow the conditional variance to depend on α. But the GxE term is still proportional to , because loci are assumed independent. Each diagonal entry is:
Together, these reductions give:
(Equation A4) |
This decomposes the expectation into the appropriate first two terms, for the homogeneous and GxE effects, and a third bias term. This bias is clearly zero when . The second term, for GxE, also appropriately disappears under the GxE null—if γ is deterministically zero, then .
□
A large-N argument could likely also be made that under certain genotype matrix structures.
η is closely related to the concept of coordinated interaction we are developing in related work. η conceptually measures the sum of coordinated interactions between the primary polygenic effect (β) and the indirect pathways through ().
Our proposition does not claim unbiasedness in general for the estimated GxE variance components, nor does it control any overdispersion of the estimated GxE parameter due to G-E correlation. Instead, we show that the GxE estimate is unbiased when under the null GxE hypothesis (). This is precisely the criterion used in Dudbridge and Fletcher30 to argue that G-E correlation does cause bias for GxE. Our result differs substantially, however, because we model the full spectrum of polygenic SNP effects rather than individual SNPs.
This result also relates to calculations in Sulc et al.,76 which focuses on a related problem where the “E” is unmeasured. However, they assume (1) and (2) that is univariate, which together make the math more straightforward76 and also do not link their results to the equivalent assumptions on the SNP effect sizes, which is crucial for biological interpretation. In particular, it is not clear from their analysis that significant interaction results are obtained if and only if there is coordinated interaction via .
In summary, we provide a necessary and sufficient condition for G-E correlation to bias GxE interaction tests in the polygenic setting. This demonstrates that the bias is likely negligible in practice, as well as provides a proper theoretical characterization of the relationship between G-E and GxE bias under polygeneicity. This was previously unknown: for example, Ni et al.23 did not demonstrate GxE bias in simulations with G-E correlation.
Appendix D: Marginalizing Random Interaction Coefficients
This section shows that the set of Gaussian GxE coefficients (γ) with covariance (V for the environmental covariance, D for the SNPs covariance) describes essentially the same distributions as the set of linear mixed models with Hadamard kinship matrices . The former are the natural model for polygenic GxE; the latter are intuitive (for discrete environments) estimates for environment-specific heritability and easily fit by REML. In practice, D is usually taken to be the identity matrix.
Proposition 3. Assume that , that has continuous, random entries, and that has rank P, with P < N. Write for some fixed D. Then
where ∗ is the column-wise Khatri-Rao product; is Hadamard; and is Kronecker.
Proof. First, note the identity:
Now, right to left is easy:
The other direction assumes decomposition of the variance into Hadamard products holds, so
Using the standard identity , where concatenates columns of a matrix, the above equality can be written as
By assumption, this identity holds and almost all genotypes . The span of such pure four-way tensors is , so its kernel has dimension zero and thus .
□
A similar result can be obtained if N grows large with fixed L.
We note that the easy direction appears implicitly in many human genetics papers (e.g., Robinson et al.,20 Ni et al.,23 Yang et al.27). We do not know where to find a clear proof, however, and are unaware of any discussion of the other direction.
Web Resources
GxEMM (free R implementation), https://github.com/andywdahl/gxemm
GxEMM scripts to reproduce the simulations and rat analyses, https://github.com/andywdahl/gxemm-scripts
GxEMM to reproduce the pyschENCODE analyses, https://github.com/nguyenkhiemv/GxEMM
Supplemental Data
References
- 1.Barreiro L.B., Tailleux L., Pai A.A., Gicquel B., Marioni J.C., Gilad Y. Deciphering the genetic architecture of variation in the immune response to Mycobacterium tuberculosis infection. Proc. Natl. Acad. Sci. USA. 2012;109:1204–1209. doi: 10.1073/pnas.1115761109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gagneur J., Stegle O., Zhu C., Jakob P., Tekkedil M.M., Aiyar R.S., Schuon A.K., Pe’er D., Steinmetz L.M. Genotype-environment interactions reveal causal pathways that mediate genetic effects on phenotype. PLoS Genet. 2013;9:e1003803. doi: 10.1371/journal.pgen.1003803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lee M.N., Ye C., Villani A.C., Raj T., Li W., Eisenhaure T.M., Imboywa S.H., Chipendo P.I., Ran F.A., Slowikowski K. Common genetic variants modulate pathogen-sensing responses in human dendritic cells. Science. 2014;343:1246980. doi: 10.1126/science.1246980. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Fairfax B.P., Humburg P., Makino S., Naranbhai V., Wong D., Lau E., Jostins L., Plant K., Andrews R., McGee C., Knight J.C. Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression. Science. 2014;343:1246949. doi: 10.1126/science.1246949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Knowles D.A., Davis J.R., Edgington H., Raj A., Favé M.J., Zhu X., Potash J.B., Weissman M.M., Shi J., Levinson D.F. Allele-specific expression reveals interactions between genetic variation and environment. Nat. Methods. 2017;14:699–702. doi: 10.1038/nmeth.4298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Glass D., Viñuela A., Davies M.N., Ramasamy A., Parts L., Knowles D., Brown A.A., Hedman A.K., Small K.S., Buil A., UK Brain Expression consortium. MuTHER consortium Gene expression changes with age in skin, adipose tissue, blood and brain. Genome Biol. 2013;14:R75. doi: 10.1186/gb-2013-14-7-r75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Battle A., Brown C.D., Engelhardt B.E., Montgomery S.B., GTEx Consortium. Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group. Statistical Methods groups—Analysis Working Group. Enhancing GTEx (eGTEx) groups. NIH Common Fund. NIH/NCI. NIH/NHGRI. NIH/NIMH. NIH/NIDA. Biospecimen Collection Source Site—NDRI. Biospecimen Collection Source Site—RPCI. Biospecimen Core Resource—VARI. Brain Bank Repository—University of Miami Brain Endowment Bank. Leidos Biomedical—Project Management. ELSI Study. Genome Browser Data Integration &Visualization—EBI. Genome Browser Data Integration &Visualization—UCSC Genomics Institute, University of California Santa Cruz. Lead analysts. Laboratory, Data Analysis &Coordinating Center (LDACC) NIH program management. Biospecimen collection. Pathology. eQTL manuscript working group Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. [Google Scholar]
- 8.Zhernakova D.V., Deelen P., Vermaat M., van Iterson M., van Galen M., Arindrarto W., van ’t Hof P., Mei H., van Dijk F., Westra H.J. Identification of context-dependent expression quantitative trait loci in whole blood. Nat. Genet. 2017;49:139–145. doi: 10.1038/ng.3737. [DOI] [PubMed] [Google Scholar]
- 9.Kang H.M., Subramaniam M., Targ S., Nguyen M., Maliskova L., McCarthy E., Wan E., Wong S., Byrnes L., Lanata C.M. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 2018;36:89–94. doi: 10.1038/nbt.4042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Favé M.J., Lamaze F.C., Soave D., Hodgkinson A., Gauvin H., Bruat V., Grenier J.C., Gbeha E., Skead K., Smargiassi A. Gene-by-environment interactions in urban populations modulate risk phenotypes. Nat. Commun. 2018;9:827. doi: 10.1038/s41467-018-03202-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Jostins L., Ripke S., Weersma R.K., Duerr R.H., McGovern D.P., Hui K.Y., Lee J.C., Schumm L.P., Sharma Y., Anderson C.A., International IBD Genetics Consortium (IIBDGC) Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature. 2012;491:119–124. doi: 10.1038/nature11582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Myers R.A., Scott N.M., Gauderman W.J., Qiu W., Mathias R.A., Romieu I., Levin A.M., Pino-Yanes M., Graves P.E., Villarreal A.B., GRAAD Genome-wide interaction studies reveal sex-specific asthma risk alleles. Hum. Mol. Genet. 2014;23:5251–5259. doi: 10.1093/hmg/ddu222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mitra I., Tsang K., Ladd-Acosta C., Croen L.A., Aldinger K.A., Hendren R.L., Traglia M., Lavillaureix A., Zaitlen N., Oldham M.C. Pleiotropic Mechanisms Indicated for Sex Differences in Autism. PLoS Genet. 2016;12:e1006425. doi: 10.1371/journal.pgen.1006425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Klein S.L., Flanagan K.L. Sex differences in immune responses. Nat. Rev. Immunol. 2016;16:626–638. doi: 10.1038/nri.2016.90. [DOI] [PubMed] [Google Scholar]
- 15.Small K.S., Todorčević M., Civelek M., El-Sayed Moustafa J.S., Wang X., Simon M.M., Fernandez-Tajes J., Mahajan A., Horikoshi M., Hugill A. Regulatory variants at KLF14 influence type 2 diabetes risk via a female-specific effect on adipocyte size and body composition. Nat. Genet. 2018;50:572–580. doi: 10.1038/s41588-018-0088-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Exner D.V., Dries D.L., Domanski M.J., Cohn J.N. Lesser response to angiotensin-converting-enzyme inhibitor therapy in black as compared with white patients with left ventricular dysfunction. N. Engl. J. Med. 2001;344:1351–1357. doi: 10.1056/NEJM200105033441802. [DOI] [PubMed] [Google Scholar]
- 17.Mega J.L., Simon T., Collet J.P., Anderson J.L., Antman E.M., Bliden K., Cannon C.P., Danchin N., Giusti B., Gurbel P. Reduced-function CYP2C19 genotype and risk of adverse clinical outcomes among patients treated with clopidogrel predominantly for PCI: a meta-analysis. JAMA. 2010;304:1821–1830. doi: 10.1001/jama.2010.1543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Riaz N., Havel J.J., Kendall S.M., Makarov V., Walsh L.A., Desrichard A., Weinhold N., Chan T.A. Recurrent SERPINB3 and SERPINB4 mutations in patients who respond to anti-CTLA4 immunotherapy. Nat. Genet. 2016;48:1327–1329. doi: 10.1038/ng.3677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Peterson R.E., Cai N., Dahl A.W., Bigdeli T.B., Edwards A.C., Webb B.T., Bacanu S.A., Zaitlen N., Flint J., Kendler K.S. Molecular Genetic Analysis Subdivided by Adversity Exposure Suggests Etiologic Heterogeneity in Major Depression. Am. J. Psychiatry. 2018;175:545–554. doi: 10.1176/appi.ajp.2017.17060621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Robinson M.R., English G., Moser G., Lloyd-Jones L.R., Triplett M.A., Zhu Z., Nolte I.M., van Vliet-Ostaptchouk J.V., Snieder H., Esko T., LifeLines Cohort Study Genotype-covariate interaction effects and the heritability of adult body mass index. Nat. Genet. 2017;49:1174–1181. doi: 10.1038/ng.3912. [DOI] [PubMed] [Google Scholar]
- 21.Moore R., Casale F.P., Jan Bonder M., Horta D., Franke L., Barroso I., Stegle O., BIOS Consortium A linear mixed-model approach to study multivariate gene-environment interactions. Nat. Genet. 2019;51:180–186. doi: 10.1038/s41588-018-0271-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ni G., van der Werf J., Zhou X., Hyppönen E., Wray N.R., Lee S.H. Genotype-covariate correlation and interaction disentangled by a whole-genome multivariate reaction norm model. Nat. Commun. 2019;10:2239. doi: 10.1038/s41467-019-10128-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Golan D., Lander E.S., Rosset S. Measuring missing heritability: inferring the contribution of common variants. Proc. Natl. Acad. Sci. USA. 2014;111:E5272–E5281. doi: 10.1073/pnas.1419064111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Weissbrod O., Flint J., Rosset S. Estimating SNP-Based Heritability and Genetic Correlation in Case-Control Studies Directly and with Summary Statistics. Am. J. Hum. Genet. 2018;103:89–99. doi: 10.1016/j.ajhg.2018.06.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hernandez R.D., Uricchio L.H., Hartman K., Ye C., Dahl A., Zaitlen N. Ultrarare variants drive substantial cis heritability of human gene expression. Nat. Genet. 2019;51:1349–1355. doi: 10.1038/s41588-019-0487-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Casale F.P., Horta D., Rakitsch B., Stegle O. Joint genetic analysis using variant sets reveals polygenic gene-context interactions. PLoS Genet. 2017;13:e1006693. doi: 10.1371/journal.pgen.1006693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kang E.Y., Lee C.H., Furlotte N.A., Joo J.W.J., Kostem E., Zaitlen N., Eskin E., Han B. An Association Mapping Framework To Account for Potential Sex Difference in Genetic Architectures. Genetics. 2018;209:685–698. doi: 10.1534/genetics.117.300501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Dudbridge F., Fletcher O. Gene-environment dependence creates spurious gene-environment interaction. Am. J. Hum. Genet. 2014;95:301–307. doi: 10.1016/j.ajhg.2014.07.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Dempster E.R., Lerner I.M. Heritability of Threshold Characters. Genetics. 1950;35:212–236. doi: 10.1093/genetics/35.2.212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Lee S.H., Wray N.R., Goddard M.E., Visscher P.M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 2011;88:294–305. doi: 10.1016/j.ajhg.2011.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhou X., Carbonetto P., Stephens M. Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 2013;9:e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Jiang J., Li C., Paul D., Yang C., Zhao H. On high-dimensional misspecified mixed model analysis in genome-wide association study. Ann. Stat. 2016;44:2127–2160. [Google Scholar]
- 35.Steinsaltz D., Dahl A., Wachter K.W. Statistical properties of simple random-effects models for genetic heritability. Electron. J. Stat. 2018;12:321–356. doi: 10.1214/17-EJS1386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Stephens M., Balding D.J. Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 2009;10:681–690. doi: 10.1038/nrg2615. [DOI] [PubMed] [Google Scholar]
- 37.Speed D., Hemani G., Johnson M.R., Balding D.J. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 2012;91:1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Steinsaltz D., Dahl A., Wachter K.W. On Negative Heritability and Negative Estimates of Heritability. bioRxiv. 2018 doi: 10.1534/genetics.120.303161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Bulik-Sullivan B.K., Loh P.R., Finucane H.K., Ripke S., Yang J., Patterson N., Daly M.J., Price A.L., Neale B.M., Schizophrenia Working Group of the Psychiatric Genomics Consortium LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Chen G.B. Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman-Elston regression. Front. Genet. 2014;5:107. doi: 10.3389/fgene.2014.00107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Abney M. Permutation testing in the presence of polygenic variation. Genet. Epidemiol. 2015;39:249–258. doi: 10.1002/gepi.21893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Schweiger R., Fisher E., Weissbrod O., Rahmani E., Müller-Nurasyid M., Kunze S., Gieger C., Waldenberger M., Rosset S., Halperin E. Detecting heritable phenotypes without a model using fast permutation testing for heritability and set-tests. Nat. Commun. 2018;9:4919. doi: 10.1038/s41467-018-07276-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Wu Y., Sankararaman S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics. 2018;34:i187–i194. doi: 10.1093/bioinformatics/bty253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Pazokitoroudi A., Wu Y., Burch K.S., Hou K., Pasaniuc B., Sankararaman S. Scalable multi-component linear mixed models with application to SNP heritability estimation. bioRxiv. 2019 [Google Scholar]
- 45.Dahl A., Iotchkova V., Baud A., Johansson Å., Gyllensten U., Soranzo N., Mott R., Kranis A., Marchini J. A multiple-phenotype imputation method for genetic studies. Nat. Genet. 2016;48:466–472. doi: 10.1038/ng.3513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Baud A., Hermsen R., Guryev V., Stridh P., Graham D., McBride M.W., Foroud T., Calderari S., Diez M., Ockinger J., Rat Genome Sequencing and Mapping Consortium Combined sequence-based and genetic mapping analysis of complex traits in outbred rats. Nat. Genet. 2013;45:767–775. doi: 10.1038/ng.2644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Sul J.H., Bilow M., Yang W.Y., Kostem E., Furlotte N., He D., Eskin E. Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models. PLoS Genet. 2016;12:e1005849. doi: 10.1371/journal.pgen.1005849. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Gandal M.J., Zhang P., Hadjimichael E., Walker R.L., Chen C., Liu S., Won H., van Bakel H., Varghese M., Wang Y., PsychENCODE Consortium Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder. Science. 2018;362:9–999. doi: 10.1126/science.aat8127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wang D., Liu S., Warrell J., Won H., Shi X., Navarro F.C.P., Clarke D., Gu M., Emani P., Yang Y.T., PsychENCODE Consortium Comprehensive functional genomic resource and integrative model for the human brain. Science. 2018;362:eaat8464. doi: 10.1126/science.aat8464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Leek J.T., Storey J.D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:1724–1735. doi: 10.1371/journal.pgen.0030161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Jarquín D., Crossa J., Lacaze X., Du Cheyron P., Daucourt J., Lorgeou J., Piraux F., Guerreiro L., Pérez P., Calus M. A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor. Appl. Genet. 2014;127:595–607. doi: 10.1007/s00122-013-2243-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Lee S.H., Van der Werf J.H.J. Using dominance relationship coefficients based on linkage disequilibrium and linkage with a general complex pedigree to increase mapping resolution. Genetics. 2006;174:1009–1016. doi: 10.1534/genetics.106.060806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Hanson A.J., Banks W.A., Hernandez Saucedo H., Craft S. Apolipoprotein E Genotype and Sex Influence Glucose Tolerance in Older Adults: A Cross-Sectional Study. Dement. Geriatr. Cogn. Disord. Extra. 2016;6:78–89. doi: 10.1159/000444079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.CONVERGE consortium Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature. 2015;523:588–591. doi: 10.1038/nature14659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Kendler K.S., Gatz M., Gardner C.O., Pedersen N.L. A Swedish national twin study of lifetime major depression. Am. J. Psychiatry. 2006;163:109–114. doi: 10.1176/appi.ajp.163.1.109. [DOI] [PubMed] [Google Scholar]
- 56.McGuffin P., Katz R., Watkins S., Rutherford J. A hospital-based twin register of the heritability of DSM-IV unipolar depression. Arch. Gen. Psychiatry. 1996;53:129–136. doi: 10.1001/archpsyc.1996.01830020047006. [DOI] [PubMed] [Google Scholar]
- 57.Price A.L., Patterson N., Hancks D.C., Myers S., Reich D., Cheung V.G., Spielman R.S. Effects of cis and trans genetic ancestry on gene expression in African Americans. PLoS Genet. 2008;4:e1000294. doi: 10.1371/journal.pgen.1000294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Huckins L.M., Dobbyn A., Ruderfer D.M., Hoffman G., Wang W., Pardiñas A.F., Rajagopal V.M., Als T.D., T Nguyen H., Girdhar K., CommonMind Consortium. Schizophrenia Working Group of the Psychiatric Genomics Consortium. iPSYCH-GEMS Schizophrenia Working Group Gene expression imputation across multiple brain regions provides insights into schizophrenia risk. Nat. Genet. 2019;51:659–674. doi: 10.1038/s41588-019-0364-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Zhu Z., Zhang F., Hu H., Bakshi A., Robinson M.R., Powell J.E., Montgomery G.W., Goddard M.E., Wray N.R., Visscher P.M., Yang J. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016;48:481–487. doi: 10.1038/ng.3538. [DOI] [PubMed] [Google Scholar]
- 60.Fullard J.F., Giambartolomei C., Hauberg M.E., Xu K., Voloudakis G., Shao Z., Bare C., Dudley J.T., Mattheisen M., Robakis N.K. Open chromatin profiling of human postmortem brain infers functional roles for non-coding schizophrenia loci. Hum. Mol. Genet. 2017;26:1942–1951. doi: 10.1093/hmg/ddx103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Zhu Z., Zheng Z., Zhang F., Wu Y., Trzaskowski M., Maier R., Robinson M.R., McGrath J.J., Visscher P.M., Wray N.R., Yang J. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nat. Commun. 2018;9:224. doi: 10.1038/s41467-017-02317-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Wu Y., Zeng J., Zhang F., Zhu Z., Qi T., Zheng Z., Lloyd-Jones L.R., Marioni R.E., Martin N.G., Montgomery G.W. Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits. Nat. Commun. 2018;9:918. doi: 10.1038/s41467-018-03371-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Ma L., Semick S.A., Chen Q., Li C., Tao R., Price A.J., Shin J.H., Jia Y., Brandon N.J., Cross A.J., BrainSeq Consortium Schizophrenia risk variants influence multiple classes of transcripts of sorting nexin 19 (SNX19) Mol. Psychiatry. 2019 doi: 10.1038/s41380-018-0293-0. Published online January 11, 2019. [DOI] [PubMed] [Google Scholar]
- 64.Fromer M., Roussos P., Sieberts S.K., Johnson J.S., Kavanagh D.H., Perumal T.M., Ruderfer D.M., Oh E.C., Topol A., Shah H.R. Gene expression elucidates functional impact of polygenic risk for schizophrenia. Nat. Neurosci. 2016;19:1442–1453. doi: 10.1038/nn.4399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Dobbyn A., Huckins L.M., Boocock J., Sloofman L.G., Glicksberg B.S., Giambartolomei C., Hoffman G.E., Perumal T.M., Girdhar K., Jiang Y., CommonMind Consortium Landscape of Conditional eQTL in Dorsolateral Prefrontal Cortex and Co-localization with Schizophrenia GWAS. Am. J. Hum. Genet. 2018;102:1169–1184. doi: 10.1016/j.ajhg.2018.04.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Hou Y., Liang W., Zhang J., Li Q., Ou H., Wang Z., Li S., Huang X., Zhao C. Schizophrenia-associated rs4702 G allele-specific downregulation of FURIN expression by miR-338-3p reduces BDNF production. Schizophr. Res. 2018;199:176–180. doi: 10.1016/j.schres.2018.02.040. [DOI] [PubMed] [Google Scholar]
- 67.Crawford L., Zeng P., Mukherjee S., Zhou X. Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet. 2017;13:e1006869. doi: 10.1371/journal.pgen.1006869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Han B., Eskin E. Random-effects model aimed at discovering associations in meta-analysis of genome-wide association studies. Am. J. Hum. Genet. 2011;88:586–598. doi: 10.1016/j.ajhg.2011.04.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Wen X., Stephens M. Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene–environment interactions. Ann. Appl. Stat. 2014;8:176–203. doi: 10.1214/13-AOAS695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Speed D., Cai N., Johnson M.R., Nejentsev S., Balding D.J., UCLEB Consortium Reevaluation of SNP heritability in complex human traits. Nat. Genet. 2017;49:986–992. doi: 10.1038/ng.3865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Dahl A., Cai N., Ko A., Laakso M., Pajukanta P., Flint J., Zaitlen N. Reverse GWAS: Using genetics to identify and model phenotypic subtypes. PLoS Genet. 2019;15:e1008009. doi: 10.1371/journal.pgen.1008009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Liu X., Mefford J.A., Dahl A., Subramaniam M., Battle A., Price A.L., Zaitlen N. GBAT: a gene-based association method for robust trans-gene regulation detection. bioRxiv. 2018 doi: 10.1186/s13059-020-02120-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Diego V.P., Rainwater D.L., Wang X.L., Cole S.A., Curran J.E., Johnson M.P., Jowett J.B., Dyer T.D., Williams J.T., Moses E.K. Genotype x adiposity interaction linkage analyses reveal a locus on chromosome 1 for lipoprotein-associated phospholipase A2, a marker of inflammation and oxidative stress. Am. J. Hum. Genet. 2007;80:168–177. doi: 10.1086/510497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Zaitlen N., Kraft P., Patterson N., Pasaniuc B., Bhatia G., Pollack S., Price A.L. Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS Genet. 2013;9:e1003520. doi: 10.1371/journal.pgen.1003520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Traglia M., Bseiso D., Gusev A., Adviento B., Park D.S., Mefford J.A., Zaitlen N., Weiss L.A. Genetic Mechanisms Leading to Sex Differences Across Common Diseases and Anthropometric Traits. Genetics. 2017;205:979–992. doi: 10.1534/genetics.116.193623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Sulc J., Mounier N., Felix G., Winkler T., Wood A.R., Frayling T.M. Maximum likelihood method quantifies the overall contribution of gene-environment interaction to complex traits: an application to obesity traits. bioRxiv. 2019 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.