Abstract
In genetic association analysis of complex traits, detection of interaction (either GxG or GxE) can help to elucidate the genetic architecture and biological mechanisms underlying the trait. Detection of interaction in a genome-wide interaction study (GWIS) can be methodologically challenging for various reasons, including a high burden of multiple comparisons when testing for epistasis between all possible pairs of a set of genomewide variants, as well as heteroscedasticity effects occurring in the presence of GxG or GxE interaction. In this paper, we address the problem of an even more striking phenomenon that we call the “feast or famine” effect that occurs when testing interaction in a genomewide context. We show that in any given GxE GWIS, the type 1 error of standard interaction tests performed genomewide can vary widely from the nominal level, where the actual type 1 error in any given GWIS varies as a predictable function of the observed trait and environmental values. Using standard methods, some GWISs will have systematically underinflated p-values (“feast”), and others will have systematically overinflated p-values (“famine”), which can lead to false detection of interaction, reduced power, inconsistent results across studies, and failure to replicate true signal. This startling phenomenon is specific to detection of interaction in a GWIS, and it may partly explain why such detection has often proved challenging and difficult to replicate. We show that the feast or famine effect occurs across a wide range of GxE analysis methods, including but not limited to (1) testing interaction in a linear or linear mixed model (LMM) using standard approaches such as t-tests/Wald tests, likelihood ratio tests, or score tests; (2) doing a combined interaction-association test in a linear model or LMM using standard approaches; (3) testing interaction with multiple environments or multiple SNPs, where these are modeled as random effects in a LMM using standard approaches; (4) performing tests of interaction in a GWIS where significance is assessed using permutation of the trait residuals. We show theoretically that the key cause of this phenomenon is which variables are conditioned on in the analysis. Using this insight, we have developed (i) a diagnostic ratio to detect which GWASs are subject to a strong “feast or famine” effect and (ii) the TINGA method to adjust the interaction test statistics to make their p-values approximately uniform under the null hypothesis. In simulations we show that TINGA both controls type 1 error and improves power. TINGA allows for covariates and population structure through use of a linear mixed model and accounts for heteroscedasticity. We apply TINGA to detection of epistasis in a study of flowering time in Arabidopsis thaliana.
Author summary
Testing for interactions in GWAS can lead to insight into biological mechanisms, but poses greater challenges than ordinary genetic association GWAS. When testing for interaction in a GWAS setting with one fixed SNP or environmental variable, the standard test statistics may not have the expected statistical properties under the null hypothesis, which can lead to false detection of interaction, inconsistent results across studies, reduced power, and failure to replicate true signal. We propose the TINGA method to adjust the test statistics so that the null distribution of their p-values is closer to uniform. Through simulations and real data analysis, we illustrate the problems with the standard analysis and the improvement of our proposed method.
Introduction
It is well-known that the effects of a genetic variant on a trait can be different for individuals with different environments, such as age [1], sex [2–5], lifestyle [6] and other exposures [7]. The genetic effects can also depend on other variants, either from the same genome [8,9] or the genome of another species (such as pathogen and host [10], mother and offspring [11]). Detection of such interaction effects can enhance the ability to identify genetic effects that would otherwise be reduced or masked [12]; they are considered as one of the reasons why results of marginal association studies are sometimes hard to replicate [13]; they are believed to account for a large part of missing heritability [14–16] ; and they can elucidate the genetic architecture of complex traits and diseases [12,17,18] and benefit many areas such as public health [19] and agriculture [20,21]. Much previous work has been done to develop appropriate methods for detecting interactions in GWAS, aiming to improve computational efficiency, reduce false positives and increase power [4,22–29].
One challenge specific to epistasis detection is that, because of the large number of tests, exhaustive search for epistatic effects in a GWAS context has a larger computational burden and lower statistical power than ordinary trait-variant association studies. To deal with this issue, various methods have been developed that correct for multiple testing while still remaining powerful [30,31]. Another option is to reduce the number of tests by a two-stage approach: first select a subset of SNPs that are more likely to be involved in interaction and then test for interaction among them [22,26,32,33].
Previous work [34–36] has found that it can be hard to replicate interactions in GWAS. This can occur for a variety of reasons. For example, in some cases, an apparent epistatic effect that is detected could be due to an unsequenced causal variant [34,37,38]. Another important issue that has been identified is heteroscedasticity [39–41] that can result under the null model when, for example, interaction is present between one of the two tested variables and some other variable not included in the model or when the null model is misspecified in some other way. If not accounted for, this heteroscedasticity can lead to excess type 1 error [39–41].
Many scenarios of testing for GxG or GxE in a GWAS context involve fixing one genetic variant or environmental factor and performing an interaction GWAS by testing the fixed variable for interaction with each genetic variant across the genome. Systematically inflated or deflated p-values in such an interaction GWAS have been previously reported, based on both data and simulations [38–40]. Even under simplified assumptions, in the absence of problems such as heteroscedasticity, it has been noted that type 1 error rates and genomic control inflation factors are highly variable across such interaction GWASs [39,40]. In this paper, we develop a deeper and more detailed understanding of this phenomenon, which we call the “feast or famine” effect in interaction GWAS. We frame this problem as resulting from the choice of variables to condition on and show how changing this choice has the potential to resolve the problem. Our framework also explains clearly why the “feast or famine” effect only occurs in interaction GWAS, not in ordinary association GWAS. We implement our ideas in a method we call TINGA (Testing INteraction in GWAS with test statistic Adjustment), in which we adjust the t-statistic for interaction by re-centering and re-scaling it using the null conditional mean and conditional variance of its numerator, with a more appropriate choice of conditioning variables. In simulations, we demonstrate the ability of TINGA to greatly reduce or eliminate the “feast or famine” effect while controlling type 1 error and increasing power. We also develop a useful diagnostic that accurately predicts the magnitude and direction of the “feast or famine” effect in any given data set. We apply the methods to detect epistasis in a GWAS for flowering time in Arabidopsis thaliana.
Materials and methods
We consider the problem of testing for interaction, either or , in a GWAS context. In a sample of individuals, let be an trait vector, and let be an matrix of genotypes for a set of genome-wide variants. Let be an vector that, in the case of testing, represents the environmental variable that we wish to test interaction with and in the case of testing, represents the genotype at a particular variant that we wish to test interaction with (where we assume that is removed from the matrix in that case). In addition, we can allow for an matrix of covariates (including intercept), where these are implicitly taken as fixed and are conditioned on throughout the analysis. By “testing interaction in a GWAS context,” we mean that for each in , we test for interaction between and in a linear or linear mixed model (LMM) for , where is the jth column of .
In this section, we first describe what we call the “feast or famine” effect for testing interaction in a GWAS context. We explain how the “feast or famine” effect can result in some GWASs having systematically overinflated interaction p-values, reducing power, while others have systematically underinflated p-values, resulting in excess type 1 error. In what follows, we focus our exposition on the t-statistic for testing interaction, but the “feast or famine” effect is very general. We show that the feast or famine effect occurs across a wide range of GxE analysis methods, including but not limited to (1) testing interaction in a linear or linear mixed model (LMM) using standard approaches such as t-tests/Wald tests, likelihood ratio tests, or score tests; (2) doing a combined interaction-association test in a linear model or LMM using standard approaches; (3) testing interaction with multiple environments or multiple SNPs, where these are modeled as random effects in a LMM using standard approaches [22,28]; (4) performing tests of interaction in a GWIS where significance is assessed using permutation of the trait residuals. We show that the “feast or famine” effect does not occur in ordinary GWAS for testing association between a trait and each genetic variant, but only when testing interaction in a GWAS context. Next we describe our TINGA method to correct the interaction test statistics to greatly reduce or eliminate this effect.
In the simplest setting in which there are no covariates and no population sub-structure, we let denote the t-statistic for testing interaction between and , i.e., for testing , in the following linear model:
| (1) |
where is a vector of length with every entry equal to 1, , , and are unknown scalar parameters, , where is unknown and is the identity matrix, and where, for any two vectors and , both of length , we define to be the vector of length with ith element , where, e.g., . (Note that the test statistics would remain exactly the same if we replaced in (1) by the element-wise product of the vectors and , but choosing to center the variables before multiplying them has various advantages such as reducing potential collinearity and making the coefficients more interpretable.)
The “feast or famine” effect: what we thought we knew about testing interaction in a GWAS context was wrong
For simplicity, we first focus the exposition on interaction testing. An essential feature of testing interaction in a GWAS context is that we obtain a set of test statistics , , where , with the same and used in all the test statistics and only varying. As a thought experiment, imagine the simplest possible null scenario in which , and the columns of are mutually independent, with the elements of drawn as i.i.d. , the elements of drawn as i.i.d. from some distribution , and the elements of drawn as i.i.d. from some distribution , for . What would be the distribution of in this case? It is well-known that for any given , the distribution of in this case is the (central) Student’s distribution on , which we denote by . Thus, it is tempting to assume that must be approximately i.i.d. draws from , but that is (perhaps surprisingly) incorrect.
In this simple scenario, we show that it is most appropriate to think of as i.i.d. draws from some distribution whose mean is 0 and whose variance is a function of . For some choices of , the variance of the resulting ’s is larger than 1 (where 1 is the approximate variance of for large ), while for other choices of , the variance of the resulting ’s is smaller than 1. Thus, if we used to calculate p-values for , respectively, which would be the standard approach, then in one GWAS these p-values might be systematically too big on average, in a second GWAS these p-values might be systematically too small on average, and in a third GWAS, they might be about right (where by “about right” we mean approximately i.i.d. uniform under the null).
This can easily be observed in simulations (see also [39,40]). Fig 1 shows four histograms, each of which depicts the p-values for a GWAS obtained as described above, where is 1,000, is 5,000, is taken to be Bernoulli(.2), and is taken to be for , where are drawn as i.i.d. Unif(.1, .9), to mimic unlinked genotypes from a haploid organism or an inbred line. In Panel A of Fig 1, the p-values are seen to be systematically overinflated, while in Panel B of Fig 1, the p-values are seen to be systematically underinflated. The information in Table 1 supports this conclusion, where we can see that for Panel A, the s.d. of the interaction t-statistics is < 1 and the genomic control inflation factor is < 1, while for Panel B the opposite holds. We repeated this experiment 400 times, and in each replicate, we tested whether the 5,000 p-values were i.i.d. Uniform(0,1) distributed under the null hypothesis (which is equivalent to testing whether the 5,000 interaction t-statistics are i.i.d. distributed) using the two-sided equal local levels (ELL) test as implemented in qqconf [42]. (See S1 Text for an R script to perform this test.) In 190 out of 400, i.e., 47.5%, of the replicates, the two-sided ELL test for uniformity was rejected at level .05, clearly showing that the t-statistics for interaction in a GWAS are not i.i.d. distributed under the null hypothesis.
Fig 1. Histograms of p-values for t-tests for interaction in a GWAS when the null hypothesis is true.
Each histogram is based on a replicate of and 5,000 genotypes, . In each histogram, interaction is tested between and in the linear model in (1) for , as described in the text, and the 5,000 p-values are computed using the the distribution and are displayed in the histogram. Panels A and B represent two different replicates of a null simulation as described in the text. In Panel C, the same replicate is used as in Panel A, and a new set of 5,000 genotypes is simulated and used in the interaction tests. Similarly, in Panel D, the same replicate is used as in Panel B, and a new set of 5,000 genotypes is simulated and used in the interaction tests.
Table 1.
Summary statistics for the examples in Fig 1
| Panel | mean | s.d. | genomic control | ELL p-value |
|---|---|---|---|---|
| A | .015 | .93 | .88 | 2.2e-10 |
| B | −.002 | 1.09 | 1.19 | 3.6e-12 |
| C | .013 | .94 | .92 | 9.5e-9 |
| D | −.010 | 1.09 | 1.16 | 3.5e-12 |
For each panel of Fig 1, “ mean” is the mean and “ s.d.” is the s.d. of the interaction t-statistics whose p-values are displayed in the panel. The genomic control is based on the squares of the interaction t-statistics in each panel. The ELL p-value is the p-value for testing the null hypothesis that the interaction p-values are uniformly distributed under the null hypothesis, as described in [42].
This effect seems to be very general and also occurs when, e.g., and are taken to be Gaussian or Binomial, as we show later. Furthermore, if instead of a t-test for interaction, we apply a likelihood ratio chi-squared test or F-test for interaction to the same simulated data sets, we get essentially indistinguishable histograms to those in Fig 1 (which is perhaps not surprising since they are asymptotically equivalent tests), and the same 190 replicates out of 400 are rejected by the ELL test for uniformity of the p-values, showing that the likelihood ratio chi-squared test and F-test for interaction are also subject to the “feast or famine” effect.
Many standard methods are affected by the “feast or famine” effect. In Figs. S4–S8, we show that the feast or famine effect occurs across a wide range of GxE analysis methods, including but not limited to (1) testing interaction in a linear or linear mixed model (LMM) using standard approaches such as t-tests/Wald tests, likelihood ratio tests, or score tests; (2) doing a combined interaction-association test in a linear model or LMM using standard approaches; (3) testing interaction with multiple environments or multiple SNPs, where these are modeled as random effects in a LMM using standard approaches; (4) performing tests of interaction in a GWIS where significance is assessed using permutation of the trait residuals.
A deeper understanding
We want to emphasize that we are not simply saying that the interaction p-values from a given GWAS are positively correlated. A further key point is that for a particular GWAS, i.e., for a particular choice of , it is, in principle, predictable based on whether the p-values will be systematically too large, systematically too small or about right. For example, in Fig 1, when we keep the same as in Panel A and simulate a completely new and independent set of genotypes for testing interaction, as in Panel C, we again see overinflation of the p-values. Similarly, when we keep the same as in Panel B and simulate a completely new and independent set of genotypes for testing interaction, as in Panel D, we again see underinflation of the p-values. This is further supported by the information in Table 1. Thus, use of standard methods would be expected to result in loss of power (“famine”) in some GWASs (e.g., the used in Panels A and C) and excessive type 1 (“feast”) error in other GWASs (e.g., the used in Panels B and D).
To understand why this happens, it is helpful to think about which variables we are conditioning on. The ordinary t-statistic for interaction was developed in a non-GWAS context in which it made sense to condition on and and treat as random, and in that case, the null conditional distribution of can be proven to be in the simple setting described above. As a direct consequence of this, it is also true that the unconditional distribution of is . In other words, if we randomly choose a GWAS (i.e., randomly choose ), and then randomly choose a null SNP from that GWAS, then has distribution . However, in any particular GWAS, and are fixed, and only is varying, so it is more appropriate to consider the null conditional distribution of the t-statistic for interaction where we condition on and and treat as random [39]. We show that even in the simple case described above, conditional on , the distribution of depends on and is not . In fact, in the slightly more general null hypothesis scenario when has some marginal effect on but no interaction with , we show that not only the null conditional variance of but even its null conditional mean depends on .
These same ideas apply to testing interaction in a GWAS context if we think of setting to be the genotype of one particular variant, we exclude from the columns of , and we consider a GWAS in which we test for interaction between and for in model (1) using a t-test for interaction. The upshot is that for some or GWASs, i.e., for some realizations of , use of a distribution to assess significance of interaction will systematically overstate the evidence for interaction (“feast”), while for other or GWASs, it will systematically understate the evidence for interaction (“famine”). Whether there is feast or famine will depend on the luck of what value of is observed. This statistical phenomenon could be an important explanation of the difficulty in detecting and replicating epistasis and gene-environment interaction that has long been observed.
With this conditioning explanation in mind, one way of thinking of the “feast or famine” effect is that if we average across many interaction GWASs, then the t-statistic for interaction has correct type 1 error, but its false positives are excessively concentrated in some GWASs, and its false negatives are excessively concentrated in some other GWASs. The good news is that, as we show below, (i) we can accurately predict, based on the observed , whether the GWAS will be “feast” or “famine” or neither, and (ii) our conditioning explanation implies that by doing conditional calculations, such as we describe below, we should in principle be able to alleviate or entirely eliminate this effect.
Why doesn’t ordinary (non-interaction) GWAS have the “feast or famine” phenomenon?
We have argued that when testing interaction in a GWAS context, we are actually conditioning on and and letting be random, and that the t-statistic for interaction does not have a t-distribution under the null hypothesis when we condition on . By a similar argument, we could point out that in an ordinary (non-interaction) GWAS, we are conditioning on and letting be random, rather than the reverse. Does this also cause a problem for the t-statistic for association? The answer is no. The problem we describe does not occur for ordinary (non-interaction) GWAS, but is specific to interaction GWAS, as we now explain.
First, consider the t-statistic for association in an ordinary GWAS. We consider a slightly more general scenario than before in which there may be additional covariates in the model (where includes an intercept). Suppose the model we use for testing association is
| (2) |
where is , is , and is , all as defined before, is an (unknown) vector, is the unknown scalar parameter of interest, and , where is unknown.
Define , an symmetric matrix. We note that the t-statistic for testing in the model in (2) can be written as
| (3) |
From this formula, it is clear that the t-statistic is symmetric in and . The symmetry between and in the ordinary (non-interaction) t-statistic for association means that in large samples, the distribution of the t-statistic under the null hypothesis of no association would be approximately the same regardless of whether we conditioned on and let be random or conditioned on and let be random. The only difference would be that would typically be a Binomial or Bernoulli random variable (genotype) and might commonly be a conditionally normal random variable (phenotype). In very small sample sizes, the difference between the underlying distributions of and would change the conditional distribution of the t-statistic for association depending on which one you conditioned on, but in typical GWAS sample sizes, the central limit theorem will take effect, and the conditional distribution of the t-statistic for association will be approximately the same in both cases.
This difference between ordinary (non-interaction) GWAS and interaction GWAS can be seen in simulations. We performed replicates of a null simulation similar to that in the previous subsection, except that instead of being Bernoulli(.2), we made in replicate , where are i.i.d. Unif(.1, .9). In replicate , we tested interaction between and in Model (1)) for , obtaining interaction t-statistics . We also tested association between and in a model with no other covariates except intercept, obtaining ordinary association t-statistics as in (3). We obtain the interaction p-values for using the distribution and the ordinary association p-values for using the distribution. In this simulation, when we apply the two-sided ELL test for uniformity at level .05 to the interaction p-values from each replicate, we reject 29.3% of the 5,000 replicates as being significantly non-uniform. In contrast, when we apply the same ELL test to the ordinary association p-values from each replicate, we reject just 4.8% of the 5,000 replicates, which is not significantly different from the nominal 5% rate. This verifies that the ordinary GWAS p-values are showing the expected behavior, while the “feast or famine” effect is only showing up in the interaction p-values. This can be seen also in Fig 2 Panel A which depicts a histogram of the genomic control inflation factors for each replicate for the interaction GWASs in red and for the ordinary (non-interaction) GWASs in blue. The narrower blue histogram reflects the expected sampling variability of the GCIF based on 5,000 i.i.d. test statistics. In contrast, the wider red histogram reflects the additional spread due to the “feast or famine effect”, i.e., the fact that conditional on the p-values may be systematically over- or under-inflated compared to uniform. Fig 2 Panel B is similar but for a simulation in which is in replicate instead of and is instead of . In S1 Text, a similar pair of histograms can be seen for the case when both and are normally distributed.
Fig 2. Histograms of GCIFs for interaction GWAS and for ordinary, non-interaction GWAS.

Each panel is based on simulated null GWASs in which , and are simulated independently, with the elements of i.i.d. normal. For each GWAS, two different GCIFs are calculated, each based on test statistics. The GCIF for ordinary (non-interaction) GWAS uses the genetic association tests between and the s, and the GCIF for interaction GWAS uses the interaction tests based on Model (1). In each panel, the blue histogram represents the resulting GCIFs for ordinary (non-interaction) association testing, and the red histogram represents the resulting GCIFs for interaction testing. In Panel A, both and the s are Bernoulli distributed, and in Panel B, both and the s are Binomial(2) distributed.
Consider the case when follows a LMM, i.e., the model is as in (1) except that
where is a GRM. In this framework, it is also true that the Wald test statistic for association (i.e., the Wald test for ) is symmetric between and when is known. Thus, in this case also, ordinary GWAS association testing is essentially not affected by whether we condition on and let be random or condition on and let be random.
TINGA method for correcting t-statistics for interaction in a GWAS
To address the “feast or famine” effect in interaction GWAS, we propose to correct the interaction t-statistics for a given GWAS by subtracting off the null conditional means of their numerators and dividing by the conditional s.d.s given the () observed for that GWAS. We call this approach TINGA for “Testing INteraction in GWAS with test statistic Adjustment.”
In the most general case, we consider testing for interaction in the model
| (4) |
where and are as defined before, is a vector of unknown coefficients, and , where either in the case of a linear model, or else where is as defined before and is an unknown heritability parameter, in the case of a LMM, and where is an unknown parameter. Then the t-statistic for interaction can be written as
| (5) |
where the “M” in stands for “marginal”, and is a symmetric matrix that removes the marginal effects of , and , where in the simplest case represents just the intercept, but it may contain additional covariates as needed. We let be the matrix whose columns are , and the columns of . Then in the case of a linear model, we have , and in the case of a LMM, we have , where is with the estimated value of plugged in.
In the LMM context, the test based on is commonly called the “Wald test.” In fact, the ordinary t-test for interaction is also a Wald test, so this term is not a useful way of distinguishing the LMM-based test from the ordinary one. We refer to the test based on as the “t-test” in both cases, and, when needed, we specify whether it is performed in an LMM or a linear model.
For both the linear and LMM cases, we define the numerator of the t-statistic to be
| (6) |
Then the interaction t-statistic in (5) can be rewritten as
| (7) |
where both and are calculated based on Model (4), has the additional assumption , and denotes estimated variance. For testing interaction in a GWAS context, we propose to replace by a “corrected” statistic
| (8) |
where the difference from Eq (7) is that we condition on instead of on . The remaining challenge of the methods development is to obtain appropriate estimators and . We perform the following steps:
We approximate by , where is quadratic in .
We calculate and approximate as functions of and .
We calculate and theoretically based on a suitable model.
We obtain estimates and for the quantities in step 3.
We plug the estimates from step 4 into the expressions for and
from step 2 to obtain and , respectively, and calculate in (8).
The quadratic approximation in step 1 is is based on an asymptotic approximation and is detailed in S1 Text. The calculation of in step 2 is completely straightforward. To approximate , we perform a variance calculation that is exact for the case when has a normal distribution and is otherwise approximate (see S1 Text). Step 5 is completely straightforward given the other steps. Here, we give more details on steps 3 and 4.
For the conditional moment calculations in step 3, to model , we consider two different modeling approaches: a normal approximation and a discrete model. For the normal approximation, we assume a normal regression model for , i.e., we take , where , or, more generally, where consists of the intercept and any confounding covariates that are in , we take , with , , and unknown. For the discrete model approach, we instead assume a discrete model for , where we assume that conditional on , the entries of the vector , call them , are independent with for all choices of , where these may also depend on as needed. Since is a genotype, we will have when the genotypes are from a diploid organism or when the genotypes are from a haploid organism or inbred line. For the latter case, we can use a logistic regression model for , and for the former case a binomial regression model. In the A. thaliana dataset we analyze, both and are binary and there are no additional confounding covariates, in which case the discrete model can simply be specified in terms of the two parameters and , without the need for a logistic model.
For the conditional moment calculations in step 3, we also consider two different modeling approaches for . The first approach is to assume that Model (4) holds, which we call the homoscedastic model. The second approach assumes a more general and robust version of Model (4) in which we allow a specific type of heteroscedasticity, namely, we allow to depend quadratically on , and we call this the heteroscedastic model. In an interaction GWAS, it can potentially be important to consider this specific type of heteroscedasticity, because it arises naturally in a model in which interacts with some other variable in a linear model or LMM for , even if it doesn’t interact with [39–41,43]. That is, suppose the true model for could be written
| (9) |
where , , , , , , , and are as before, and are unknown scalar coefficients, and is some additional variable that might or might not be observed, is independent of , and that interacts with . In other words, from the point of view of testing for interaction between and , this is a null model, but it allows for the possibility that does interact with some other variable, , such as a SNP on another chromosome, or a non-genetic variable. Then in this model, if we calculate , we find that it depends on quadratically. In other words, we have the specific type of heteroscedasticity described above. This motivates the heteroscedastic model for .
Given the modeling assumptions described above, we now consider the calculation of and in step 3. When the normal approximation is used for , then with either the homo- or heteroscedastic model for , we obtain a multivariate normal distribution for , from which and can be easily computed using standard properties of multivariate normal. When a discrete model is used for , then with either the homo- or heteroscedastic model for , we can apply a Bayes rule calculation to obtain the discrete distribution . For example, if we assume unrelated individuals, then conditional on , are independent with
| (10) |
where is a univariate normal density function.
Approximate null conditional mean and variance of interaction t-statistic numerator
To better understand the surprising behavior of the t-statistic for interaction in a GWAS setting under the null hypothesis, it can be helpful to examine approximate analytical formulas for the null conditional mean and variance of the t-statistic numerator given , where and are the variables that remain fixed for the GWAS. If we instead took the more common approach of conditioning on , we would obtain zero for the null conditional mean and for the null conditional variance.
When we use the normal approximation for , use a linear model for instead of an LMM, and assume no covariates, then it becomes possible to obtain approximate analytical formulas for and . We obtain
| (11) |
where , , , , and where for any 3 vectors and of length , we define and .
The motivation for this notation is that “p” denotes “parameters”, and and are functions only of parameters; “d” denotes data, and the subscript “2” in denotes that is a function of only the observed sample second moments of and not of any parameters. The subscript “3” in and denotes that they are functions of only the observed sample third and second moments of and not of any parameters. In the special case when , we get
| (12) |
These approximate formulas can serve as useful heuristics about when the null conditional expectation of the interaction t-statistic might or might not be approximately zero. From this approximation, we get that if both and are 0, which would happen if is independent of , then the null conditional expectation should be 0. We can also see that if (1) is multivariate normal with arbitrary correlation or (2) and are independent (with any distribution), then in sufficiently large samples, and will both be close to 0, so we expect the null conditional mean to be close to 0 in sufficiently large samples. However, if is heteroscedastic with respect to , or if has a skewed distribution and and are correlated, then the null conditional mean could be non-zero when is correlated with or , even in large samples.
Using the normal approximation, the approximate null conditional variance is
| (13) |
where (with as defined in Eq (11) is a function of only parameters, , , and are functions of only the observed sample 4th and second moments of , with , , and ; , , and are “mixed” terms that are functions of both parameters and data, but that depend on the data only through the observed sample 2nd moments of , with , and . When , we further get
| (14) |
which is almost exclusively a function of the observed second and fourth sample moments of , except for the parameter .
The above formulas can be useful as heuristics, but when has a discrete distribution, we instead use a discrete model for , and the null conditional mean and conditional variance based on that do not lend themselves to a simple closed-form expression. Furthermore, with covariates or in a LMM, the results are also more involved. Finally, the variance expression we give above is the one we obtain in the special case when we assume , and, more generally, we usually prefer to do a Wald test, in which case we need an estimate of the conditional variance under the alternative model, which is also a more involved calculation.
Estimation step
In step 4, we need to obtain estimates and of the quantities we derived theoretically in step 3. In the case that the model for , is the linear, homoscedastic model with , then when we use the normal approximation for , we can fit ordinary least squares (OLS) regression of on and use the fitted values as and as . Similarly, when , is the linear, homoscedastic model with , is the discrete model, and is binary (or binomial), we can use logistic (or binomial) regression of on to obtain and . However, to obtain under the alternative (which can allow us to do a more powerful Wald-type test instead of a score-type test), and for all other modeling cases, we instead use some version of the Bayes rule calculation, where we fit the model for to obtain its parameters, fit the model for , to obtain its parameters, and then plug the estimated parameters into the Bayes rule calculation. To allow for heteroscedasticity in step 4, we first regress out of to obtain , and then we replace by in the model for interaction, and allow for heteroscedasticity in the model, where the variance can depend on and .
Diagnostic ratio for identifying GWISs in which the “feast or famine” effect occurs
When we consider how to predict the likely “feast or famine” effect for a GWAS based on a given observed in data, a natural starting point is to consider the mean and variance adjustments to the t-statistic obtained using the TINGA method. These adjustments change with the variant as we scan the genome, and they depend on some modeling choices. For the diagnostic, we could choose rather simple modeling assumptions (normal regression models for on and for on ) and consider a hypothetical that is completely independent of , so that the mean adjustment reduces to 0, and we are left with only a variance adjustment that is now a function of various sample second and fourth central moments (including cross-moments) of . In this case, the ratio of to the square of the denominator of the t-statistic, call it , would seem to be a natural measure of the feast or famine effect (and the TINGA adjustment to the test statistic would be to multiply the t-statistic by ), where large would be indicative of “feast” and small of “famine”. However, the denominator of the t-statistic also involves , so we approximate it by taking a first-order approximation to its conditional expectation given under the simple modeling assumptions we chose above.
The resulting ratio is given by
This ratio turns out to have a direct interpretation as one of the 5 components of co-kurtosis between and , where denotes the vector of residuals from simple linear regression of on . If we abbreviate as , then we have that . Informally, the sample co-kurtosis is generally considered to measure the extent to which the variables and are observed to have extreme observations or outliers, and whether the extreme observations for and tend to co-occur in the same individuals. However, as we show in the Results section, even just the ordinary sampling variability in this ratio that occurs in different replicates is highly predictive of the FoF effect for an interaction GWAS using the given . Note that the diagnostic ratio does not contain any information on interaction between and and any specific .
For the case when there are additional covariates in the linear model for , beyond just intercept, , and , we can extend the definition of to
where , where is the residual of after regressing out where consists of the intercept and any additional covariates but does not include or , and where .
Additional methodological considerations
In the special case when at least one of and is discrete, it is natural to place certain constraints on when one would or would not perform any sort of interaction test. For example, if both and are binary and are perfectly correlated, then there would typically be zero information in the data on interaction between them as a predictor of , and if they are almost perfectly correlated, then the amount of information available on interaction would be quite low. In the case when and are both binary, we can think of constructing a 2 × 2 table of counts of the four possible observed values of () in the data, and we require the minimum cell count (MCC), i.e., the smallest of the counts of the four possible observed values, to be at least 5 in order to perform the interaction t-test.
Step 4 of the TINGA method requires some additional parameter estimation compared to the interaction t-test. If all variables were continuous, then with typical GWAS sample sizes, the estimation of a handful of additional parameters would pose little problem for the inference. When and are both binary, however, then we require MCC ≥ 20 in order to perform the additional estimation in step 4. Therefore, our TINGA method uses a mixed strategy in that case, in which, when 5 ≤ MCC < 20 we use the interaction t-test, and when MCC ≥ 20, we use the adjustment strategy. All the TINGA results for the case when both and are binary use this “mixed” strategy.
Specifically for the problem of epistasis detection, it has been noted that in the presence of an untyped causal variant, two typed variants in strong linkage disequilibrium that form a haplotype that tags the untyped variant could exhibit false epistasis [34]. Therefore, in detection of epistasis, we only test for epistasis between variants and if their sample correlation is close to 0. (In our data analysis we use a cut-off of .1 for absolute value of correlation.)
For the problem of epistasis detection, for a given pair of SNPs, there are two possible adjustments, one based on conditioning the test on and the other based on conditioning the test on . We propose the strategy of conditioning on the less polymorphic of and , because that should result in more information available for the statistical test leading to a more powerful test. We test this strategy in simulations.
Simulations
Type 1 error simulations I
For the first set of type 1 error simulations, we simulate 105 replicate GWASs each with individuals. In each replicate, we simulate the elements of as i.i.d draws from Bin(2, 0.2), then center and standardize them, and we assume that explains 1% of the variance of the trait. Among a much larger set of SNPs in the GWAS, we assume that there are 149 associated SNPs, consisting of 49 SNPs that each explain 1% of the variance of the trait and 100 SNPs that each explain 0.5% of the variance of the trait. Conditional on allele frequency , the genotype of the th SNP is simulated as i.i.d. draws from , where are i.i.d. Uniform(0.2,0.8), and where SNPs are taken to be unlinked. Then each is centered and standardized based on (i.e., is subtracted off and the result divided by ). Then is simulated as
| (15) |
where , exactly 49 of the ’s have magnitude equal to with a random sign (positive or negative) given to each , exactly 100 of the ’s have magnitude with a random sign, and the remaining ’s are 0. (The value of the overall mean has no impact and can be taken to be 0.) In each replicate, we calculate the diagnostic ratio based on , and for each , , we test for interaction between and in the model of Eq 1 using (i) the usual t-test for interaction (ii) a TINGA-adjusted t-test in which we condition on and (iii) the HC3 method [44], which is a heteroscedasticity-corrected t-test.
Power simulations I
For the first set of power simulations, we simulate 106 replicate GWASs each with individuals. In each replicate, we simulate and as above. Then is simulated as
| (16) |
where , and are as above, is the index of one of the SNPs for which , and . In each replicate, we calculate the diagnostic ratio based on , and we test for interaction between and in the model of Eq 1 using (i) the usual t-test for interaction and (ii) a TINGA-adjusted t-test in which we condition on .
Replicability
To estimate replicability as a function of the diagnostic ratio, we first binned the observed diagnostic ratios from the type 1 error and power simulations. For simulated GWASs with diagnostic ratio in the th bin, we calculated replicability as
where is the probability that a tested interaction is a true null, which we took to be 0.9998, is the empirical type 1 error at level 10−4 observed for the simulated GWASs with diagnostic ratio in the ith bin, is the empirical power observed for the simulated GWASs with diagnostic ratio in the ith bin, T1E is the overall average type 1 error at level 10−4, and pow is the overall average power. Thus, represents the probability that an interaction detected for a GWAS with diagnostic ratio in the ith bin would be replicated in another randomly chosen GWAS.
“Feast” effect due to being associated with a causal variant: example of Hemani et al. 2014 and 2021
Hemani et al. 2014 [45] studied epistasis in expression traits in humans, identifying 501 significant pairwise interactions between common SNPs affecting expression of 238 genes. In the case of some of these significant pairwise interactions, it was later noted that a third SNP associated with the trait could explain all of the pairwise interaction. [37] While this type of effect is well-known [34,37] in the case when the two putatively interacting SNPs are proximal and in LD with one another, Hemani et al. 2021 [38] found that this effect had occurred in a large number of cases in which one of the two putatively interacting SNPs was cis to the gene whose expression was being influenced, the other putatively interacting SNP was on a different chromosome, and the third SNP whose effect could explain the interaction was cis to the gene and in LD with the first SNP, a type of effect which was unexpected [38]. Motivated by these findings, Hemani et al. 2021 performed simulations in which an untyped is causal for , and an observed SNP is in LD with and is conditionally independent of given is tested for interaction with SNPs , , none of which are on the same chromosome with . In the simulations, they find the effect we have called the “feast” effect, i.e., the genomic control inflation factor and type 1 error of the GWIS varied across replicates and were on average inflated. They report, “Here we show... a previously unrecognized property of the gold-standard statistical test to detect interactions, namely that the presence of imperfectly tagged additive causal variants can lead to phantom epistasis between unlinked markers. Therefore, the false positive rate in studies that use the test may not be sufficiently controlled and, to our knowledge, no current statistical fix exists for this problem.”
We perform simulations similar to those of Hemani et al. 2021. Specifically, we simulate 100 GWAS replicates, each based on individuals. In each replicate, we take the haplotype frequencies of to be those of (rs67903230, rs13069559) observed in the data of Hemani et al. 2014, and we take . Following Hemani et al. 2021, we sample the proportion of variance of explained by from Uniform (0, .5) and test for interaction between and each , . For each replicate, we calculate the diagnostic ratio based on , the p-values for interaction based on the ordinary t-test, and the p-values for interaction based on the TINGA-adjusted t-test.
Type 1 error and power simulations II
In the next set of simulations, allow for a linear mixed model for instead of a linear model, and we consider the case where both and are Bernoulli distributed, because that is the situation in the A. thaliana dataset. We compare the performance of the t-test and our methods in 3 simulation settings.
Non-GRM case
In each replicate, we simulate a Bernoulli and Bernoulli for independent individuals, and simulate under the alternative model 17
| (17) |
where , , , have marginal effects on and only has interactive effect with on (setting 3 in S1 Text).
GRM case 1: unrelated individuals; accounting for additive polygenic effects
In this case, ’s are simulated in the same way as GRM case 1. is simulated with the same model 17, except a GRM as an extra variance component 18
| (18) |
(setting 4 in S1 Text) .
GRM case 2: population structure with 3 sub-populations
We also tried the setting with population structure in which there are 3 sub-populations. See setting 11 in S1 Text. Both and ’s are still Bernoulli distributed, and simulated with the 3 sub-populations. , , , and have effects on also has indicators of the sub-populations as covariates.
Results
Simulations
Diagnostic ratio predicts the “feast or famine” effect
In Fig 3, type 1 error of the ordinary t-test and of the TINGA-corrected t-test are plotted against the diagnostic ratio. The plot clearly demonstrates the “feast or famine” effect, i.e., the type 1 error of the t-test varies systematically and significantly across GWISs, with some GWISs having type 1 error that is too small and others having type 1 error that is too large, where this effect can be accurately predicted in advance by the diagnostic ratio calculated from , without using any information on the genotypes ,
Fig 3. Empirical type 1 error vs. diagnostic ratio for the standard t-test for interaction and the TINGA adjusted t-test for interaction.
105 simulated GWAS’s are divided into bins based on their observed diagnostic ratio, and the average type 1 error in the bin is plotted against the average diagnostic ratio in the bin, for the standard t-test and the TINGA adjusted t-test that conditions on . Each vertical line segment represents a 95% confidence interval for the type 1 error in the bin, for the given testing method. The y-axis is empirical type 1 error at nominal level 10−4 divided by 10−4, so, e.g., a y-value of 2 corresponds to an empirical type 1 error of 2e-04. The y-axis is logarithmically scaled.
“Feast or famine” adversely affects type 1 error, power, and replicability
In Fig 4, power of the t-test and of the TINGA-adjusted t-test are plotted against the diagnostic ratio, while in Fig 5, the replicability of significant interaction results detected using the t-test and detected using the TINGA-adjusted t-test are plotted against the diagnostic ratio. From the results, we can see that for the ordinary t-test for interaction, the “feast or famine” effect results in systematic inflation or deflation of type 1 error ascross SNPs within a GWAS (Fig 3), reduced power in the famine GWASs (Fig 4), and lack of replicability of interaction detected in the feast GWASs (Fig 5), where an excessively low diagnostic ratio predicts deflated type 1 error and reduced power for the GWAS and an excessively high diagnostic ratio predicts inflated type 1 error and reduced replicability of the detected interaction.
Fig 4. Power vs. diagnostic ratio for the standard t-test for interaction and the TINGA adjusted t-test for interaction.
106 simulated GWASs are divided into bins based on their observed diagnostic ratio, and the average observed power at level 1e-04 in the bin is plotted against the average diagnostic ratio in the bin, for the standard t-test and the TINGA adjusted t-test that conditions on . Each vertical line represents a 95% confidence interval for the average power of the GWASs with diagnostic ratios in the given bin, based on the given testing method.
Fig 5. Replicability vs. diagnostic ratio for the standard t-test for interaction and the TINGA adjusted t-test for interaction.
106 simulated GWASs are divided into bins based on their observed diagnostic ratio, and the average replicability at level 10−4 in the bin is plotted against the average diagnostic ratio in the bin, for the standard t-test and the TINGA adjusted t-test that conditions on . Replicability for a given testing method in a given GWAS is defined as the probability that a significant result at level 10−4 in the given GWAS would be replicated at level 10−4 in an independent GWAS. Each vertical line represents a 95% confidence interval for the average replicability of GWASs with diagnostic ratios in the given bin, based on the given testing method.
Conditioning on can correct the “feast or famine” effect
In Fig 3, we can see that the type 1 error of the TINGA-adjusted t-test does not vary significantly from the nominal level across GWISs. Similarly, the power (Fig 4) and replicability (Fig 5) appear to stay approximately constant across GWISs when the TINGA-adjusted t-test is used. This shows that using appropriate conditioning (i.e., conditioning on instead of on ) in the analysis effectively eliminates the “feast or famine” effect, as we predicted theoretically.
“Feast” effect due to being associated with a causal variant: example of Hemani et al. 2014 and 2021
The example of Hemani et al. 2014 and 2021 demonstrates a type of model misfit that can lead to a particularly strong “feast” effect for the standard interaction t-test. In Fig 6, we can see that the smallest p-values based on the t-test can be more than 6 orders of magnitude too small. In Figs 7 and 8, we can see that the genomic control inflation factor based on the t-test can range all the way up to 3.5 in our simulations. In Fig 9, we can see that the type 1 error (at nominal level 10−3) for the t-test in a given GWIS can range as high as 0.07 or higher across the simulated GWISs. These results are in close agreement with previous work [38].
Fig 6. QQ-plot of p-values for the standard t-test for interaction and the TINGA adjusted t-test for interaction when is associated with a causal variant.
Each QQ-plot is based on 106 interaction p-values from 100 GWISs simulated under the null hypothesis of no interaction. The shaded region represents the 95% acceptance region based on equal local levels [42] for a test of the null hypothesis that the p-values are i.i.d. Unif(0,1) under the null hypothesis.
Fig 7. Histogram of genomic control inflation factors for the standard t-test for interaction and the TINGA adjusted t-test for interaction when is associated with a causal variant.
Each histogram is based on 100 GWISs simulated under the null hypothesis of no interaction.
Fig 8. Genomic control inflation factor vs. diagnostic ratio for the standard t-test for interaction and the TINGA adjusted t-test for interaction when is associated with a causal variant.
Results from 100 simulated GWISs are plotted under the null hypothesis of no interaction. The lines and are shown in black.
Fig 9.
Type 1 error vs. diagnostic ratio for the standard t-test for interaction and the TINGA adjusted t-test for interaction when is associated with a causal variant Results from 100 simulated GWISs are plotted under the null hypothesis of no interaction. Type 1 error for each GWIS is estimated based on 104 unassociated SNPs. In each case, the vertical line represents the 95% confidence interval for type 1 error. The nominal type 1 error of 10−3, is represented by a horizontal black line.
Furthermore, we can see that the magnitude of the “feast” effect is very accurately predicted by the diagnostic ratio, with the diagnostic ratio itself appearing to be an unbiased estimated of the genomic control inflation factor of the t-test (Fig 8), while the type 1 error of the t-test also appears to be a monotonic function of the diagnostic ratio (Fig 9).
Finally, the results show clearly that conditioning on corrects the feast effect. In Fig 6, the TINGA-adjusted t-statistic can be seen to have correctly calibrated p-values under the null hypothesis. In Fig 7, the genomic control inflation factor of the TINGA-adjusted t-statistic appears to have a symmetric distribution closely centered on 1, which is the expected behavior under the null hypothesis. In Fig 8, the genomic control inflation factor of the TINGA-adjusted t-tstatistic appears to be close to 1 for all observed values of the diagnostic ratio. In Fig 9, the type 1 error of the TINGA adjusted t-statistic is seen to be not significantly different from the nominal, across all observed values of the diagnostic ratio.
Type 1 error and power simulations II
We run the each of the 3 simulation settings (Non-GRM, GRM case 1 and GRM case 2) 5000 times independently to mimic 5000 independent GWASs. For , , , we test at level 0.05. For 5000 replicates, the 95% confidence interval is (0.0440, 0.0560). The results are in Table 2. Since the type I error rates are obtained across multiple GWASs, both uncorrected and corrected have reasonable type I error.
Table 2.
Type I error at level 0.05
| Unadjusted | TINGA | |
|---|---|---|
| Non-GRM | 0.0574* | 0.0544 |
| GRM case1 | 0.0560 | 0.0516 |
| GRM case2 | 0.0418* | 0.0488 |
| Non-GRM | 0.0526 | 0.0516 |
| GRM case1 | 0.0548 | 0.0548 |
| GRM case2 | 0.0426* | 0.0502 |
| Non-GRM | 0.0536 | 0.0518 |
| GRM case1 | 0.0560 | 0.0544 |
| GRM case2 | 0.0494 | 0.0590* |
Type I error of testing for the interaction between and , , , over 5000 replicates. Both , ’s are Bernoulli, , , , and () have effects on . Methods are the Bernoulli version.
indicates a type 1 error that is significantly different from the nominal at level .05.
Table 3 compares the power of uncorrected and corrected methods for detecting interaction between and . Fig 10 are plots of the power curves for the first two simulations. We can see that TINGA consistently has higher power than the unadjusted approach. For the results when and have other distributions see S1 Text.
Table 3.
Power at different p-value cutoffs
| p-value cutoff 10−5 | Unadjusted | TINGA |
|---|---|---|
| Non-GRM | 0.7046 | 0.7346 |
| GRM case1 | 0.7064 | 0.7322 |
| GRM case2 | 0.7222 | 0.8168 |
| p-value cutoff 10−6 | ||
| Non-GRM | 0.5216 | 0.5748 |
| GRM case1 | 0.5308 | 0.5810 |
| GRM case2 | 0.5566 | 0.6526 |
Power of testing for the interaction between and , over 5000 replicates. Both , ’s are Bernoulli, , , , and () have effects on . Methods are the Bernoulli version.
Fig 10. Power curves.
x-axis is the type I error rates for testing , y-axis is the power for testing . Both and are independent Bernoulli. , , , and have effects on . (a) non-GRM case (b) GRM case 1
Simulation under null: check p-values within a GWAS
In this part, we consider the distribution of the GCIF within a GWAS. For each replicate GWAS, the sample size is 1000, and we simulate and ’s independently. We consider the following cases: (1) Both and ’s are Bernoulli, and is simulated under a linear model (setting 1 in S1 Text). (2) Both and ’s are Binomial(2), and is simulated under a linear model (setting 1 in S1 Text). (3) Both and the ’s are normal, and is simulated under a linear model. (4) Both and ’s are Bernoulli, and is simulated under a LMM (setting 2 in S1 Text). Then for every GWAS, we calculate a GCIF based on the p-values for interaction. Fig 11 gives the histograms of the resulting GCIFs. From this we can see that our methods make the GCIF much more concentrated around 1.
Fig 11. Uncorrected vs. corrected GCIF under the null.
Genomic control inflation factors of interaction tests between and each of ’s where is the outcome; and the ’s are independent; (a) Both and ’s are Bernoulli distributed; linear model for ; 500 replicates (b) Both and are binomial; linear model for ; 500 replicates; (c) Both and are normal; linear model for ; 500 replicates (d) Both and ’s are Bernoulli; LMM for ; 200 replicates
Switch role of
As described above, for the case of detecting epistasis between a pair of genetic variants, there could be two possible ways to apply TINGA. We have proposed the strategy of conditioning on the less polymorphic of the two variants (i.e., the one with the smaller minor allele frequency), because we expect that it should result in more information available for the statistical test, leading to a more powerful test. We design a simulation to test this idea. In each of 1,000 replicates, we simulate one variant with MAF .07 and another with MAF .25, independently, and we simulate according to simulation setting 5 in S1 Text. We then test interaction between the two variants using TINGA with (1) being the variant with smaller MAF and being the variant with larger MAF and (2) the reverse ( being the variant with larger MAF and being the variant with smaller MAF). Fig 12 is a scatterplot of the resulting p-values on the −log10 scale. This verifies our intuition that it is a more powerful strategy to condition on the variant with the smaller MAF, so we employ this strategy in the data analysis.
Fig 12. scatter plot of scaled p-values for the two possible TINGA analyses of interaction between a pair of genetic variants.
, where the x-axis is p-value for the case when is taken to be the variant with the larger MAF, and the y-axis is for the same pair where is taken to be the variant with the smaller MAF. Both are Bernoulli distributed. The two MAFs used to generate the data are 0.07 and 0.25.
Analysis of flowering time in A. thaliana
Data Description
We apply our methods to a data set on flowering time in Arabidopsis thaliana that has been previously analyzed [46]. We use the number of days between germination and flowering at 10°C as the phenotype, and we include samples from 931 selected accessions from different regions. The SNPs were filtered based on minor allele frequency (MAF) ≥ 0.03 [47]. LD pruning was done to remove variants with pairwise LD of [47]. After filtering, there are 865,350 SNPs remaining. We use a LMM for the phenotype, where the GRM is computed based on all available SNPs with allele frequency ≥ 0.05.
Strategy for detecting epistasis
Step1: Select 865 variants with smallest marginal p-values
Due to the large number of SNPs (865,350 after filtering) and the fact that we use a LMM for , it is computationally impractical to do a pairwise search over all possible pairs of SNPs for epistasis. Therefore, we start by identifying the .1% of SNPs with the smallest p-values from the ordinary GWAS based on the LMM for , which results in 865 SNPs selected. For each of these 865 SNPs, we test for interaction with with each of the 865,350 other SNPs in the genome (subject to constraints on informativeness and the constraint that the SNPs have , as described in the Methods section).
Step2: Perform fast, approximate, Wald tests in an LMM for testing interaction between each of the 865 selected SNPs and each of the 865,350 other SNPs in the genome
Even with the number of tests reduced by a factor of more than 500, we still need a fast computation strategy because we are performing interaction tests based on an LMM. We take a two-stage approach, where we first apply a fast, approximate Wald test. Then we only perform more time-consuming and accurate calculations for p-values that are small based on the fast, approximate Wald test, and we content ourselves with the coarser approximation for the p-values that are large. The key idea of the fast approximate Wald test is to regress out all variables aside from the interaction term step by step using matrix operations, so that we can avoid looping over the SNPs. We have adapted this method to LMM. (See S1 Text for details.)
Step3: Perform more accurate p-value calculation only for those pairs with fast approximate Wald p-value
< 10−4 Both the p-value for interaction in a LMM and the TINGA method will be applied only to those pairs with fast, approximate Wald p-value < 10−4. Furthermore, for some pairs, interaction was not tested at all because informativeness constraints were not met (we required MCC ≥ 5) or our constraint on correlation was not met (we required ). After these filtering steps (based on MCC, r2 and fast approximate Wald p-value), there are 71,863 pairs of SNPs remaining, with 762 of the originally chosen SNPs having at least one pair, and these 71,863 are the pairs for which we calculate the interaction t-statistic and TINGA statistic.
Step4: Significance under Bonferroni correction
When applying the Bonferroni correction, we arguably only need to correct for the number of pairs that have at least one of the two SNPs in the selected set of 865 associated variants and that satisfy MCC ≥ 5 and . However, if we are being very conservative, we could consider that we are potentially searching over all distinct pairs with MCC ≥ 5 and , of which there are 2.7 × 1011. Taking in to account that two tests were performed, the Bonferroni correction level could be very conservatively taken to be .
Findings
Table 4 contains information on the pairs that are significant after Bonferroni correction. Table 5 lists the corresponding genes. Among the identified SNPs, Chr5:18607017 is also detected in the study of its association with plant dry weight [13], average growth rate [13] and flowering time [46], [48]; Chr5:20430580 is also detected in the study of its association with leaf margin serrated [49]. Other SNPs are not directly found in other studies, but the genes in which they are located such as AT5G10140 [46] [50] [49] are found to be related to flowering time in many other studies.
Table 4.
Significant pairs
| SNP | MAF | Mar p | SNP | MAF | Mar p | MCC | Wald | TINGA |
|---|---|---|---|---|---|---|---|---|
| Chr5:3176549 | 0.41 | 2.0e-4 | Chr1:21470240 | 0.061 | 0.070 | 26 | 3.6e-6 | 1.3e-14 |
| Chr5:3198884 | 0.44 | 2.3e-4 | Chr5:1921009 | 0.064 | 0.072 | 29 | 1.6e-7 | 5.6e-15 |
| Chr5:18607017 | 0.27 | 1.8e-5 | Chr4:4835999 | 0.052 | 0.47 | 20 | 8.7e-7 | 3.2e-16 |
| Chr5:20430580 | 0.31 | 2.4e-4 | Chr5:25047282 | 0.084 | 0.14 | 33 | 1.0e-7 | 4.6e-15 |
| Chr5:12406770 | 0.22 | 0.0076 | Chr5:25333255 | 0.083 | 1.5e-4 | 22 | 9.1e-8 | 8.7e-15 |
“Mar p”: Marginal p-value of corresponding SNP in the gene-phenotype association test.
Table 5. Significant pairs.
The genes that the SNPs are in (black) or near (red).
| SNP | Gene | SNP | Gene |
|---|---|---|---|
| Chr5:3176549 | AT5G10140 | Chr1:21470240 | AT1G58030 |
| Chr5:3198884 | AT5G10190 | Chr5:1921009 | AT5G06290 |
| Chr5:18607017 | AT5G45870 | Chr4:4835999 | AT4G08025 |
| Chr5:20430580 | AT5G50180 | Chr5:25047282 | AT5G62370 |
| Chr5:12406770 | AT5G05055 | Chr5:25333255 | AT5G63160 |
Example QQ-Plot for a given choice of
Of course in the data, we do not know the truth. However, it can be interesting to consider how the QQ-plot is affected by the TINGA correction for a given SNP that does not appear to show evidence of interaction. We consider SNP Chr5_18593622 (MAF 0.28) which has a relatively small p-value for SNP-trait association, but shows little evidence of interaction. For this particular SNP, in addition to performing the 2-stage process described above, we calculate both its Wald t-test and TINGA interaction p-values in a LMM for each of the 696,396 SNPs in the genome with which it has and MCC ≥ 20 (skipping the step of filtering by the fast, approximate Wald test). Fig 13 displays the (differenced) QQ-plots of the p-values from these methods, with simultaneous 95% acceptance regions for i.i.d. uniform p-values outlined in red, where these use the method of [42]. (In a differenced QQ-plot, the y-axis depicts the difference between observed and expected p-values, which is particularly helpful for creating a useful visualization when the plot contains a large number of points.) We can see that for this particular SNP, the distribution of p-values is much closer to uniform after TINGA adjustment.
Fig 13. Differenced QQ-plots of p-values for interaction of SNP Chr5_18593622 with 696,396 genomewide SNPs.

using (A) the t-test for interaction in an LMM and (B) TINGA. The expected quantile is plotted on the x-axis, and the difference between the observed and expected quantiles is plotted on the y-axis. The red lines give the boundaries of the 95% simultaneous acceptance region for i.i.d. uniform p-values.
Discussion
Identifying interaction, either or , can give insight into both genetic effects on a complex trait and underlying biological mechanisms, and it can also help to clarify the role of environment in the case of testing. For testing interaction in a genomewide context, we have identified and described the “feast or famine” effect, in which different GWISs have fundamentally different null distributions. For example, if we consider GWISs in which the assumed null model holds and there is no interaction under the null hypothesis, then on average over different GWISs standard methods have correct type 1 error overall, but false positives are overly concentrated in certain GWISs (“feast” GWISs) and false negatives are overly concentrated in certain other GWISs (“famine” GWISs). If the environmental variable does interact with some predictors (either genetic variants or non-observed covariates), then the type 1 error disparity for non-interacting variants is even more extreme. In the example of Hemani et al. 2014 and 2021 [38,45], in which the variable is associated with a causal SNP and there is model misspecification in the standard null model, the feast effect can be quite extreme, and the overall type 1 error very inflated. We show that whether a given GWIS will be a “feast” or “famine” GWIS is a reproducible property that can be predicted as a function of the observed trait and environmental values. We show that the “feast or famine” effect applies for different types of variables, including normal, binomial or binary. We show that the feast or famine effect occurs across a wide range of GxE analysis methods, including but not limited to (1) testing interaction in a linear or linear mixed model (LMM) using standard approaches such as t-tests/Wald tests, likelihood ratio tests, or score tests; (2) doing a combined interaction-association test in a linear model or LMM using standard approaches; (3) testing interaction with multiple environments or multiple SNPs, where these are modeled as random effects in a LMM using standard approaches; (4) performing tests of interaction in a GWIS where significance is assessed using permutation of the trait residuals. We show that the “feast or famine effect” affects only interaction GWAS, not ordinary association GWAS. The “feast or famine effect” can lead to excess type 1 error, reduced power, inconsistent results across studies, and failure to replicate true signal. Furthermore, we show that whether a given GWAS will be a “feast” or “famine” GWAS is a reproducible property, and that it can be corrected for.
We develop the TINGA method which corrects the test statistic for interaction by choosing different conditioning variables that are more appropriate for a GWAS than the standard choice. TINGA also allows for covariates and population structure through a LMM, and it accounts for heteroscedasticity. In simulations we show that TINGA can greatly reduce or eliminate the “feast or famine” effect while preserving the overall type 1 error, which we show can result in higher power.
Furthermore, we have developed a diagnostic ratio, based only on the observed (), which summarizes the degree of “feast” or “famine” that would be expected for that interaction GWAS. When the diagnostic indicates only a weak effect, the researcher could more reasonably proceed with a standard analysis, while when the diagnostic indicates a strong effect, then it will be clear that use of a more sophisticated statistical method would be critical to both improve power and guard against excess false positive results. The corresponding diagnostic result could also be reported alongside any detected interaction signal, as a way of addressing concerns about statistical validity. In the context of epistasis GWAS, having a diagnostic can in addition be important for computational reasons, with sophisticated statistical methods such as TINGA reserved for choices of variants for which a strong “feast or famine” effect is predicted.
We apply TINGA to a GWAS for flowering time in A. thaliana. Using TINGA we detect 5 significant interactions after Bonferroni correction, where all the detected interactions involve loci identified in previous studies as associated with flowering time. This demonstrates the potential of the TINGA method for detecting interaction in a GWAS.
For epistasis detection in a GWAS, there is a computational challenge in testing epistasis for all possible pairs of variants. When the model for is a LMM, as in our data analysis, this computational challenge is made much greater, even for the usual LMM-based t-test for interaction without any correction. We have developed a fast approximate version of the LMM-based t-test for interaction, and we use it as part of an adaptive approach to genomewide testing, where more accurate but time-consuming methods are applied only if the approximate p-value is sufficiently small. In other words, our strategy is to spend more computational time on small p-values and to be content with coarse approximations to large p-values. In future work, there could be further scope for making faster algorithms for all aspects of interaction testing with a LMM in a GWAS context.
Supplementary Material
Acknowledgments
This study was supported by NIH grant R01 HG001645 (to M.S.M.). We thank Peter Laurin for suggesting the data set and for help with preliminary data analysis.
Footnotes
Supporting information
References
- 1.Glass D, Viñuela A, Davies MN, et al. Gene expression changes with age in skin, adipose tissue, blood and brain. Genome Biol. 2013;14(7):R75. doi: 10.1186/gb-2013-14-7-r75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Myers R, Scott N, Gauderman W, et al. Genome-wide interaction studies reveal sex-specific asthma risk alleles. Hum Mol Genet. 2014;23(19):5251–5259. doi: 10.1093/hmg/ddu222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mitra I, Tsang K, Ladd-Acosta C, Croen L, Aldinger K, Hendren R, et al. Pleiotropic Mechanisms Indicated for Sex Differences in Autism. PLoS Genet. 2016;12(11):e1006425. doi: 10.1371/journal.pgen.1006425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Small K, Todorčević M, Civelek M, et al. Regulatory variants at KLF14 influence type 2 diabetes risk via a female-specific effect on adipocyte size and body composition. Nat Genet. 2018;50(4):572–580. doi: 10.1038/s41588-018-0088-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Leite J, Soler J, Horimoto A, Alvim R, Pereira A. Heritability and Sex-Specific Genetic Effects of Self-Reported Physical Activity in a Brazilian Highly Admixed Population. Hum Hered. 2019;84(3):151–158. doi: 10.1159/000506007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Laville V, Majarian T, Sung Y, et al. Gene-lifestyle interactions in the genomics of human complex traits. Eur J Hum Genet. 2022;30(6):730–739. doi: 10.1038/s41431-022-01045-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Carbone M, Arron ST, Beutler B, et al. Tumour predisposition and cancer syndromes as models to study gene–environment interactions. Nat Rev Cancer. 2020;20(9):533–549. doi: 10.1038/s41568-020-0265-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Evans D, Spencer C, Pointon J, et al. Interaction between ERAP1 and HLA-B27 in ankylosing spondylitis implicates peptide handling in the mechanism for HLA-B27 in disease susceptibility. Nat Genet. 2011;43(8):761–767.doi: 10.1038/ng.873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Moutsianas L, Jostins L, Beecham A, et al. Class II HLA interactions modulate genetic risk for multiple sclerosis. Nat Genet. 2015;47(10):1107–1113. doi: 10.1038/ng.3395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang M, Roux F, Bartoli C, Huard-Chauveau C, Meyer C, Lee H, et al. Two-way mixed-effects methods for joint association analysis using both host and pathogen genomes. Proc Natl Acad Sci U S A. 2018;115(24):E5440–E5449. doi: 10.1073/pnas.1710980115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Clark M, Chazara O, Sobel E, et al. Human Birth Weight and Reproductive Immunology: Testing for Interactions between Maternal and Offspring KIR and HLA-C Genes. Hum Hered. 2016;81(4):181–193. doi: 10.1159/000456033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Evans L, Arehart C, Grotzinger A, Mize T, Brasher M, Stitzel J, et al. Transcriptome-wide gene-gene interaction associations elucidate pathways and functional enrichment of complex traits. PLoS Genet. 2023;19(5):e1010693. doi: 10.1371/journal.pgen.1010693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Vasseur F, Exposito-Alonso M, Ayala-Garay OJ, Wang G, Enquist BJ, Vile D, et al. Adaptive diversification of growth allometry in the plant Arabidopsis thaliana. Proceedings of the National Academy of Sciences. 2018;115(13):3416–3421. doi: 10.1073/pnas.1709141115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Visscher P, Brown M, McCarthy M, J Y. Five years of GWAS discovery. Am J Hum Genet. 2012;90(1):7–24. doi: 10.1016/j.ajhg.2011.11.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Eichler E, Flint J, Gibson G, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11(6):446–450. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Robinson M, English G, Moser G, et al. Genotype-covariate interaction effects and the heritability of adult body mass index. Nat Genet. 2017;49(8):1174–1181. doi: 10.1038/ng.3912. [DOI] [PubMed] [Google Scholar]
- 17.Epistasis Mackay T. and quantitative traits: using model organisms to study gene-gene interactions. Nat Rev Genet. 2014;15(1):22–33. doi: 10.1038/nrg3627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Roth C, Murray D, Scott A, Fu C, Averette A, Sun S, et al. Pleiotropy and epistasis within and between signaling pathways defines the genetic architecture of fungal virulence. PLoS Genet. 2021;17(1):e1009313. doi: 10.1371/journal.pgen.1009313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ritz B, Chatterjee N, Garcia-Closas M, et al. Lessons Learned From Past Gene-Environment Interaction Successes. Am J Epidemiol. 2017;186(7):778–786. doi: 10.1093/aje/kwx230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lopez-Cruz M, Aguate F, Washburn J, et al. Leveraging data from the Genomes-to-Fields Initiative to investigate genotype-by-environment interactions in maize in North America. Nat Commun. 2023;14(1):6904. doi: 10.1038/s41467-023-42687-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Alipour H, Abdi H, Rahimi Y, Bihamta M. Dissection of the genetic basis of genotype-by-environment interactions for grain yield and main agronomic traits in Iranian bread wheat landraces and cultivars. Sci Rep. 2021;11(1):17742. doi: 10.1038/s41598-021-96576-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Crawford L, Zeng P, Mukherjee S, Zhou X. Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet. 2017;13(7):e1006869. doi: 10.1371/journal.pgen.1006869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Dahl A, Nguyen K, Cai N, Gandal M, Flint J, N Z. A Robust Method Uncovers Significant Context-Specific Heritability in Diverse Complex Traits. Am J Hum Genet. 2020;106(1):71–91. doi: 10.1016/j.ajhg.2019.11.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tang D, Freudenberg J, Dahl A. actorizing polygenic epistasis improves prediction and uncovers biological pathways in complex traits. Am J Hum Genet. 2023;110(11):1875–1887. doi: 10.1016/j.ajhg.2023.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Greene CS, Penrod NM, Kiralis J, et al. Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Mining. 2009;2(1):5. doi: 10.1186/1756-0381-2-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Emily M, Mailund T, Hein J, Schauser L, Schierup M. Using biological networks to search for interacting loci in genome-wide association studies. Eur J Hum Genet. 2009;17(10):1231–1240. doi: 10.1038/ejhg.2009.15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lippert C, Listgarten J, Davidson R, et al. An exhaustive epistatic SNP association analysis on expanded Wellcome Trust data. Sci Rep. 2013;3:1099. doi: 10.1038/srep01099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Moore R, Casale FP, Jan Bonder M, et al. A linear mixed-model approach to study multivariate gene–environment interactions. Nat Genet. 2019;51(1):180–186. doi: 10.1038/s41588-018-0271-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Sheppard B, Rappoport N, Loh PR, Sanders SJ, Zaitlen N, Dahl A. A model and test for coordinated polygenic epistasis in complex traits. PNAS. 2021;118(15):e1922305118. doi: 10.1073/pnas.1922305118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Dudbridge F, Koeleman B. Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. Am J Hum Genet. 2004;75(3):424–435. doi: 10.1086/423738. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Li J, Ji L. Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity (Edinb). 2005;95(3):221–227. doi: 10.1038/sj.hdy.6800717. [DOI] [PubMed] [Google Scholar]
- 32.Evans D, Marchini J, Morris A, Cardon L. Two-Stage Two-Locus Models in Genome-Wide Association. PLoS Genet. 2006;2(9):e157. doi: 10.1371/journal.pgen.0020157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Marderstein AR, Davenport ER, Kulm S, Van Hout CV, Elemento O, Clark AG. Leveraging phenotypic variability to identify genetic interactions in human phenotypes. Am J Hum Genet. 2021;108(1):49–67. doi: 10.1016/j.ajhg.2020.11.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wei W, Hemani G, Haley C. Detecting epistasis in human complex traits. Nat Rev Genet. 2014;15(11):722–733. doi: 10.1038/nrg3747. [DOI] [PubMed] [Google Scholar]
- 35.Ahmad S, Varga T, Franks P. Gene × environment interactions in obesity: the state of the evidence. Hum Hered. 2013;75(2–4):106–115. doi: 10.1159/000351070. [DOI] [PubMed] [Google Scholar]
- 36.McAllister K, Mechanic L, Amos C, et al. Current Challenges and New Opportunities for Gene-Environment Interaction Studies of Complex Diseases. Am J Epidemiol. 2017;186(7):753–761. doi: 10.1093/aje/kwx227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wood A, Tuke M, Nallsm M, et al. Another explanation for apparent epistasis. Nature. 2014;514(7520):E3–E5. doi: 10.1038/nature13691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Hemani G, Powell J, Wang H, et al. Phantom epistasis between unlinked loci. Nature. 2021;596(7871):E1–E3. doi: 10.1038/s41586-021-03765-z. [DOI] [PubMed] [Google Scholar]
- 39.Voorman A, Lumley T, McKnight B, Rice K. Behavior of QQ-Plots and Genomic Control in Studies of Gene-Environment Interaction. PLoS ONE. 2011;6(5):e19416. doi: 10.1371/journal.pone.0019416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Rao T, Province M. A Framework for Interpreting Type I Error Rates from a Product-Term Model of Interaction Applied to Quantitative Traits. Genet Epidemiol. 2016;40(2):144–153. doi: 10.1002/gepi.21944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Zhang T, Sun L. Beyond the traditional simulation design for evaluating type 1 error control: From the “theoretical” null to “empirical” null. Genet Epidemiol. 2019;43(2):166–179. doi: 10.1002/gepi.22172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Weine E, McPeek MS, Abney M. Application of equal local levels to improve Q-Q plot testing bands with R package qqconf. J Stat Softw. 2023;106(10):1–31. doi: 10.18637/jss.v106.i10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Pare G, Cook NR, Ridker PM, Chasman DI. On the Use of Variance per Genotype as a Tool to Identify Quantitative Trait Interaction Effects: A Report from the Women’s Genome Health Study. PLoS Genetics. 2010;6(6):e1000981. doi: 10.1371/journal.pgen.1000981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Long T, Ervin L. Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model. JASA. 2000;54(3):217–224. [Google Scholar]
- 45.Hemani G, et al. Detection and replication of epistasis influencing transcription in humans. Nature. 2014;508:249–253. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 46.Consortium TG. 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana. J Cell. 2016;doi: 10.1016/j.cell.2016.05.063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Zan Y, Carlborg O. A Polygenic Genetic Architecture of Flowering Time in the Worldwide Arabidopsis thaliana Population. Molecular Biology and Evolution. 2019;36(1):141–154. doi: 10.1093/molbev/msy203. [DOI] [PubMed] [Google Scholar]
- 48.Grimm D, Roqueiro D, Salomé P, et al. easyGWAS: A Cloud-Based Platform for Comparing the Results of Genome-Wide Association Studies. Plant Cell. 2017;29(1):5–19. doi: 10.1105/tpc.16.00551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Atwell S, Huang Y, Vilhjálmsson B, et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature. 2010;465(7298):627–631. doi: 10.1038/nature08800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Li Y, Huang Y, Bergelson J, Nordborg M, Borevitz J. Association mapping of local climate-sensitive quantitative trait loci in Arabidopsis thaliana. Proc Natl Acad Sci U S A. 2010;107(49):21199–21204. 10.1073/pnas.1007431107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.











