Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jul 18.
Published in final edited form as: Hum Hered. 2014 Jul 18;78(2):81–90. doi: 10.1159/000363347

Incorporating gene-environment interaction in testing for association with rare genetic variants

Han Chen a,b, James B Meigs c,d, Josée Dupuis a,e
PMCID: PMC4169076  NIHMSID: NIHMS596001  PMID: 25060534

Abstract

Objectives

The incorporation of gene-environment interactions could improve the ability to detect genetic associations with complex traits. For common genetic variants, single marker interaction test and joint test of genetic main effects and gene-environment interaction have been well established and used to identify novel association loci for complex diseases and continuous traits. For rare genetic variants, however, single marker tests are severely underpowered due to the low minor allele frequency, and only a few gene-environment interaction tests have been developed. We aim at developing powerful and computationally efficient tests for gene-environment interaction with rare variants.

Methods

In this paper, we propose interaction and joint tests for testing gene-environment interaction of rare genetic variants. Our approach is a generalization of existing gene-environment interaction tests for multiple genetic variants under certain conditions.

Results

We show in our simulation studies that our interaction and joint tests have correct type I errors, and that the joint test is a powerful approach for testing genetic association, allowing for gene-environment interaction. We also illustrate our approach in a real data example from the Framingham Heart Study.

Conclusion

Our approach can be applied to both binary and continuous traits, and is powerful and computationally efficient.

Keywords: rare variant analysis, gene-environment interaction, sequence kernel association test, joint test, generalized linear mixed model

Introduction

Traditional genome-wide association studies (GWAS) have been successfully applied to identify a large number of genetic markers associated with complex diseases and related continuous traits. However, for most complex diseases and continuous traits, all genetic markers identified so far only explain a small proportion of the heritability in these traits, suggesting that a lot of genetic determinants are still undiscovered. Eichler et al. [1] suggested that gene-environment interaction and rare genetic variants may both account for some of the unexplained heritability.

Statistical methods to detect gene-environment interaction have been well established in the context of single marker tests [24]. To determine if a common genetic variant interacts with an environmental variable, we can either test the interaction effect only, or jointly test both genetic main effect and gene-environment interaction effect. By using the first approach, we are usually interested in detecting gene-environment interaction, regardless of the presence of a significant genetic main effect. However, by using the second approach, we are testing if the genetic marker is associated with the trait of interest, allowing for gene-environment interaction. These methods combined with multivariate meta-analysis have led to the discovery of novel common loci associated with fasting insulin level by incorporating gene by body mass index (BMI) interactions [5]. However, the power of single marker tests greatly depends on the minor allele frequency (MAF) of the genetic marker tested. As a result, single marker tests have little power in testing gene-environment interaction involving rare genetic variants.

On the other hand, rare genetic variants analysis has become a popular research field in genetic association studies, and many statistical methods for rare variants analysis have been proposed [612]. Of these methods, the Sequence Kernel Association Test (SKAT) [12] has been shown to be a powerful approach in various scenarios. However, all of these methods focus on the main effect association analysis of rare genetic variants.

Compared with common variant analysis, rare variant analysis often requires a larger sample size to attain comparable power. Compared with main effect analysis, interaction analysis also needs a larger sample size. Thus, little attention has been paid to interaction analysis for rare genetic variants, possibly due to the limited sample size in many cohort studies. Tzeng et al. [13] proposed a gene-trait similarity regression approach (SIMreg) to test gene-environment interaction of rare variants. This approach is flexible in the hypothesis testing, and the main effect test, test of interaction and the joint test of main effect and interaction are possible. It is available for continuous and binary traits, although the original reference only includes development for continuous traits. However, in the joint test, it implicitly assumes that the variance component parameters for genetic main effects and gene-environment interactions are of the same magnitude under the alternative hypothesis, which may not be true. In a first attempt to specifically consider gene-environment interactions for rare variants, Kazma et al. [14] proposed three joint tests: 1) minimum p-value; 2), CAST [6]; and 3) SKAT [12]. However, they considered only the scenario where the environmental variable is binary when computing SKAT weights, which limits the applicability of their SKAT approach. They used a single kernel for both genetic main effects and gene-environment interactions, and pointed out that it would be valuable to model genetic main effects and gene-environment interactions separately. Recently, Lin et al. [15] proposed the gene-environment set association test (GESAT) as an interaction test in a generalized linear model framework. Although GESAT was originally developed for common variants, it can be easily applied to rare variants by incorporating weights, which we name GESAT with weights (GESAT-W). This test is applicable to both binary and continuous traits, but the joint test was not provided.

In this paper, we propose a general approach for testing gene-environment interaction of rare variants, including two interaction tests and a joint test. Our approach is flexible and works for both binary and continuous traits. Genetic main effects and gene-environment interactions are modeled separately and no assumption is required on the magnitude of variance component parameters for genetic main effects and gene-environment interactions. We evaluate the performance of our methods in simulation studies and we also illustrate our approaches in testing gene by BMI interaction on fasting glucose and fasting insulin levels, adjusting for age, sex, cohort and 10 principal components (PC) [16], using an unrelated subset of individuals from the Framingham Heart Study.

Methods

Interaction test: fixed genetic main effects

Our notation is as follows: assuming a sample size of n, let Yi be the phenotype of individual i (1 ≤ in), and follows an exponential family distribution with E(Yi) = µi and Var (Yi) = ϕν(µi). Let Xi = (1, Xi1 Xi2,…, Xi(p−1)) be a row vector of length p, consisting of an intercept and (p − 1) covariates, Ei be one of these (p − 1) covariates, centered to have mean 0, and let Gi = (Gi1, Gi2,…, Giq) be a row vector of q genetic variants.

Assuming independent observations we can write the generalized linear mixed model for testing the gene by Ei interaction as

g(μi)=Xiβ+GiW1γ1+EiGiW2γ2,

where g(·) is the link function, W1 and W2 are q × q diagonal matrices with elements equal to weights for genetic main effects and gene-environment interaction effects respectively. β is a vector of fixed effects parameters for the intercept and (p − 1) covariates, γ1 is a vector of fixed effects parameters for genetic main effects, γ2 is a vector of random effects of gene-environment interaction, assumed to have mean 0 and covariance matrix τ2Iq. The interaction test can be constructed as H0 : τ2 = 0 versus H1 : τ2 > 0. Let Y = (Y1, Y2, …Yn)T be the phenotype vector, µ = (µ1, µ2, …, µn)T be the mean vector, X be an n × p covariates matrix with elements Xij where 1 ≤ in, 0 ≤ jp − 1 and Xi0 = 1, G be an n × q genotype matrix with elements Gij where 1 ≤ in and 1 ≤ jq, E = diag {Ei} be an n × n matrix. We define the working vector under the null hypothesis y = Xβ + GW1γ1 + Δ(Y − µ), where Δ = diag{g’ (µi)} [17]. Let ε = Δ(Y − µ), we can write the model as

y=Xβ+GW1γ1+ε,

where Var (ε) = V = diag {ϕν(µi)[g’(µi)]2}. Let βF, γF, ϕF be the maximum likelihood (or restricted maximum likelihood) estimates under the null hypothesis gi) = Xiβ + GiW1γ1, then we can calculate estimates µiF = g−1 (XiβF+GiW1γ1F), ΔF = diag {g’(µiF)}, VF = diagFν(µiF)[g’(µiF)]2}, and the test statistic for H0 : τ2 = 0 is

QF=(YμF)TΔFVF1EGW2W2GTEVF1ΔF(YμF).

Let Zi = (Xi, GiW1) be a row vector of length (p + q), Z be an n × (p + q) matrix with elements Zij where 1 ≤ in and 1 ≤ jp + q. Under the null hypothesis, QF~j=1qλjχ1,j2 where λj ’s are the eigenvalues of the matrix ΨF=W2GTE(VF1VF1Z(ZTVF1Z)1ZTVF1)EGW2 [12, 18].

This interaction test is straightforward, however, when the number of genetic variants q is large, we cannot usually get stable estimates of γ1, leaving it impractical. Below we propose an alternative interaction test which treats genetic main effects γ1 as random effects.

Interaction test: random genetic main effects

We use the same notations as in the previous subsection, but now assume that γ1 is a vector of random effects with mean 0 and covariance matrix τ1Iq. Again we define the working vector under the null hypothesis y = Xβ + GW1γ1 + Δ(Y − µ), with Var (y) = Σ = τ1GW1W1GT + V. We can get the maximum likelihood (or restricted maximum likelihood) estimates βR, τ1R, ϕR from the null model, and compute estimates µiR = g−1 (XiβR + GiW1γ1R), ΔR = diag {g’(µiR)}, yR = XβR + GW1γ1R + ΔR (Y − µR), VR = diagRν(µiR)[g’(µiR)]2}, ΣR = τ1R GW1W1GT + VR, where γ1R=τ1RW1GTR1(yRXβR). The test statistic for H0: τ2 = 0 is

QR=(YμR)TΔRVR1EGW2W2GTEVR1ΔR(YμR).

Under the null hypothesis, QR~j=1qλjχ1,j2 where λj’s are the eigenvalues of the matrix ΨR=W2GTE(R1R1X(XTR1X)1XTR1)EGW2.

Joint test

Kraft et al. [2] suggested simultaneously testing genetic main effect and interaction effect in a joint test. In the context of the generalized linear mixed model from the previous subsections, a joint test of main and interaction effects in our context is equivalent to testing the following hypotheses: H0 : τ1 = τ2 = 0 versus H1 : τ1 > 0 or τ2 > 0. Under the null hypothesis, we define the working vector y = Xβ + Δ(Y − µ) with Var(y) = V. Let βJ, ϕJ be the maximum likelihood (or restricted maximum likelihood) estimates from the null model gi) = Xiβ, then we can calculate estimates µiJ = g−1 (XiβJ), ΔJ = diag {g’(µiJ)}, yJ = XβJ + ΔJ (Y − µJ), VJ = diagJν(µiJ)[g’(µiJ)]2}. If we reparametrize τ1 = ρ τ, τ2= (1 − ρ)τ (0 ≤ ρ ≤ 1), with τ = τ1 + τ2 and ρ=τ1(τ1+τ2), the hypotheses for the joint test may be rewritten as H0 : τ = 0 versus H1 : τ > 0 where ρ is a nuisance parameter present only under the alternative hypothesis. For any fixed ρ, the test statistic for H0 : τ = 0 is

QJ(ρ)=(YμJ)TΔJVJ1[ρGW1W1GT+(1ρ)EGW2W2GTE]VJ1ΔJ(YμJ).

Let P=VJ1VJ1X(XTVJ1X)1XTVJ1, under the null hypothesis, QJ(ρ)~j=1nλjχ1,j2 where λj’s are the eigenvalues of the matrix ΨJ(ρ)=P12[ρGW1W1GT+(1ρ)EGW2W2GTE]P12. Note that if nq, there are at most q non-zero eigenvalues. When ρ = 1, this is the regular SKAT statistic for genetic main effects.

Following Lee et al. [19], we use the minimum p-value as the test statistic:

QJ=inf0ρ1p(ρ),

where p(ρ) is the p-value of QJ (ρ). For a sequence of S ρ ’s: 0 ≤ ρ1 < ρ2 < ⋯ < ρS ≤ 1, the test statistic is QJ = min {p1), p2),…, pS)}. To compute the p-value of observed QJ = qJ, following Voorman et al. [20], we can write the p-value as an integral, which needs to be solved numerically. We use a Monte Carlo method in our implementation (see Appendix).

There are several ways of computing the p-value of sum of chi-squares [2124]. In this paper, we generally use Kuonen’s saddlepoint method [23] (in interaction tests and evaluating F(x) in the joint test), except that in the joint test, we use modified moment matching method (matching the mean, variance and kurtosis) [19, 24] to compute ps) and quantiles Ts), because it takes much longer to find quantiles using Kuonen’s saddlepoint method.

Simulation Studies

We performed simulation studies to evaluate the empirical type I error and power of main effect SKAT [12], the interaction test treating genetic main effects as fixed (INT-FIX), the interaction test treating genetic main effects as random (INT-RAN), the joint test of genetic main effects and gene-environment interactions (JOINT), SIMreg tests by Tzeng et al. [13] (http://www4.stat.ncsu.edu/~jytzeng/software_simreg.php), and GESAT-W by Lin et al. [15]. In a preliminary simulation study for continuous traits, it took up to 26 minutes to get the p-value from one SIMreg interaction test on 2000 individuals and 20 genetic variants. As it would not be feasible to run 1 million simulations to evaluate the type I error, we excluded SIMreg interaction test from the comparison. We also excluded GESAT-W for binary traits as the current package is only available for continuous traits. We used Wu weights [12], which are the beta density function with parameters 1 and 25, evaluated at the MAF of the genetic variant, for SKAT, INT-FIX, INT-RAN, GESAT-W and JOINT. In the joint test, we used the Monte Carlo method with B =10,000 to calculate the p-value. In SIMreg main effect test (SR) and joint test (SR-JOINT), we used inverse allele frequency as the weight to calculate the genotype similarity matrix, as recommended by Tzeng et al. [13].

Type I Error

We simulated both continuous and binary phenotypes. For continuous traits, we simulated 5000 genotype datasets with sample size of 2000. In each genotype dataset, we simulated 20 biallelic genetic variants with MAF randomly sampled from a uniform distribution on (0.005, 0.05), and we fixed the linkage disequilibrium (LD) correlation between adjacent markers at r = 0.5. For each genotype dataset, we simulated 200 replicates of covariates: sex ~ Bernoulli (0.5), age ~ N (50,52), bmi ~ N(25,42), then we generated the continuous phenotypes Y from

Y=0.5sex+0.05age+0.1bmi+ε,

where ε ~ N(0,1) is the random error. The genotypes are not associated with the phenotype. We evaluated the empirical type I errors for SKAT, SR, INT-FIX, INT-RAN, GESAT-W, JOINT and SR-JOINT. We tested gene by BMI interaction in INT-FIX, INT-RAN, GESAT-W, JOINT and SR-JOINT.

For binary traits, we simulated case-control studies with 1000 cases and 1000 controls for each replicate. We first simulated 5000 genotype datasets with sample size 20,000. For each genotype dataset, we simulated 200 replicates of covariates. The genetic variants and covariates were simulated following the same procedure as for continuous traits. For each individual, we calculated the probability of disease P(Y =1) from

logP(Y=1)1P(Y=1)=β0+0.5sex+0.05(age50)+0.1(bmi25),

where β0 was determined such that the baseline (individuals with sex = 0, age = 50, bmi = 25) prevalence of disease is 10%. Once we simulated the disease status for all 20,000 individuals in each replicate, we randomly sampled 1000 cases and 1000 controls from the cohort. Then we evaluated the empirical type I errors for SKAT, SR, INT-FIX, INT-RAN, JOINT and SR-JOINT. We tested gene by BMI interaction in INT-FIX, INT-RAN, JOINT and SR-JOINT.

Power

For both continuous and binary phenotypes, we simulated 5 scenarios: 1. There are genetic main effects but no gene-BMI interaction effects; 2. There are genetic main effects and weak gene-BMI interaction effects; 3. There are both genetic main effects and gene-BMI interaction effects of the same magnitude; 4. There are gene-BMI interaction effects and weak genetic main effects; 5. There are gene-BMI interaction effects but no genetic main effects. For continuous traits, we simulated 5 genotype datasets with sample size 2000 in each scenario, and 200 replicates of covariates for each genotype dataset. The genetic variants and covariates were simulated following the same procedure as in the type I error simulations. For each replicate, we randomly selected 10 causal variants. The continuous phenotypes Y were generated from

Y=0.5sex+0.05age+0.1bmi+j=1qγ1,jgj+j=1qγ2,jgj(bmi25)+ε,

where gj is the j th genetic variant, with MAF fj. The genetic main effects γ1,j were simulated from γ1,j~N(0,σ12), where the constant σ1 was fixed at 0.2 in Scenarios 1 to 3, 0.02 in Scenario 4 and 0 in Scenario 5. Gene-BMI interaction effects γ2,j were simulated from γ2,j~N(0,σ22), where the constant σ2 was fixed at 0 in Scenario 1, 0.005 in Scenario 2, and 0.05 in Scenarios 3 to 5.

For binary traits, we first simulated 5 genotype datasets with sample size 20,000, and 200 replicates of covariates for each genotype dataset. The genetic variants and covariates were simulated following the same procedure as in the type I error simulations. For each replicate, we randomly selected 10 causal variants, and we calculated the probability of disease P(Y =1) from

logP(Y=1)1P(Y=1)=β0+0.5sex+0.05(age50)+0.1(bmi25)+j=1qγ1,jgj+j=1qγ2,jgj(bmi25)

where gj is the j th genetic variant with MAF fj. The parameter β0 was determined such that the baseline (individuals with sex = 0, age = 50, bmi = 25, all gj = 0) prevalence of disease is 10%. After obtaining the disease status for all 20,000 individuals in each replicate, we randomly sampled 1000 cases and 1000 controls from the cohort. Genetic main effects γ1,j and gene-BMI interaction effects γ2,j were calculated in the same way as for continuous traits, except that the constant σ1 was fixed at 0.4 in Scenarios 1 to 3, 0.04 in Scenario 4 and 0 in Scenario 5; and the constant σ2 was fixed at 0 in Scenario 1, 0.01 in Scenario 2, and 0.1 in Scenarios 3 to 5.

Results

Type I Error Simulations

Table 1 includes the empirical type I errors of the 7 tests: 1) genetic main effect test (SKAT); 2) SIMreg main effect test (SR); 3) gene-BMI interaction test adjusting for fixed genetic main effects (INT-FIX); 4) gene-BMI interaction test adjusting for random genetic main effects (INT-RAN); 5) GESAT-W gene-BMI interaction test (GESAT-W); 6) the joint test of genetic main effects and gene-BMI interaction (JOINT); and 7) SIMreg joint test (SR-JOINT) at α levels of 0.05, 0.01, 0.001 and 0.0001 from 1 million simulation replicates. For binary traits, GESAT-W is not available yet.

Table 1.

Empirical type I errors from simulation studies. Each entry represents the proportion of p-values less than corresponding alpha level from 1 million simulation replicates.

Level
α
SKAT SR INT-
FIX
INT-
RAN
GESAT-
W
JOINT SR-
JOINT
Continuous traits

0.05 0.05093 0.04800 0.05366 0.05033 0.04971 0.04984 0.04839
0.01 0.01008 0.00978 0.01084 0.00992 0.00986 0.00982 0.00993
0.001 0.00102 0.00116 0.00104 0.00096 0.00094 0.00097 0.00108
0.0001 0.00009 0.00013 0.00010 0.00008 0.00008 0.00010 0.00012

Binary traits

0.05 0.05021 0.04707 0.05799 0.04834 NA 0.04748 0.03647
0.01 0.00972 0.00941 0.01142 0.00915 NA 0.00905 0.00661
0.001 0.00089 0.00102 0.00114 0.00092 NA 0.00086 0.00062
0.0001 0.00007 0.00011 0.00010 0.00008 NA 0.00009 0.00006

INT-FIX has slightly inflated type I errors at high α levels, and it is more pronounced for binary traits. SR-JOINT is conservative for binary traits. All other tests have empirical type I errors close to the corresponding α levels at all 4 levels we studied. For binary traits, SKAT, INT-RAN and JOINT are slightly conservative at low α levels, but not as conservative as SR-JOINT. The results suggest that INT-RAN and JOINT that we propose in this paper are valid tests.

Power Simulations

For continuous traits, empirical power results at the α level of 0.0001 from 1,000 simulation replicates are shown in Figure 1. In Scenario 1, because there are genetic main effects but no gene-BMI interactions, the values for the interaction tests, namely INT-FIX, INT-RAN and GESAT-W are very close to 0. Main effect tests (SKAT and SR) are more powerful than joint tests (JOINT and SR-JOINT) in this case. A similar pattern is observed in Scenario 2, where the association is dominated by genetic main effects and interaction effects are weak. In Scenario 5, there are gene-BMI interactions but no genetic main effects, interaction tests (INT-FIX, INT-RAN and GESAT-W) are the most powerful tests. The power of joint tests (JOINT and SR-JOINT) is slightly lower. Main effect tests (SKAT and SR) show values very close to 0. A similar pattern is observed in Scenario 4. In Scenario 3, where both genetic main effects and gene-BMI interactions are of the same magnitude, SR-JOINT is the most powerful test, followed by JOINT.

Figure 1.

Figure 1

Power simulation results for continuous traits at the α level of 0.0001. Scenario 1 includes genetic main effects but no gene-BMI interaction effects; Scenario 2 includes genetic main effects and weak gene-BMI interaction effects; Scenario 3 includes both genetic main effects and gene-BMI interaction effects; Scenario 4 includes gene-BMI interaction effects and weak genetic main effects; Scenario 5 includes gene-BMI interaction effects but no genetic main effects.

For binary traits, empirical power results at the α level of 0.0001 from 1,000 simulation replicates are shown in Figure 2. The results have similar patterns as for continuous traits. Generally, main effect tests are most powerful when the association is dominated by genetic main effects, interaction tests are most powerful when there are no or weak main effects. Joint tests are most powerful when both main effects and interaction effects are of the same magnitude, and they are only slightly less powerful than the most powerful tests in other scenarios.

Figure 2.

Figure 2

Power simulation results for binary traits at the α level of 0.0001. Scenario 1 includes genetic main effects but no gene-BMI interaction effects; Scenario 2 includes genetic main effects and weak gene-BMI interaction effects; Scenario 3 includes both genetic main effects and gene-BMI interaction effects; Scenario 4 includes gene-BMI interaction effects and weak genetic main effects; Scenario 5 includes gene-BMI interaction effects but no genetic main effects

Computational Time

Although our simulation studies show that SR-JOINT has comparable power to our JOINT test proposed in this paper, a great advantage of our approach is its computational efficiency. We calculated average computational time for JOINT and SR-JOINT in our type I error simulation settings with sample size from 1,000 to 10,000. Table 2 shows that the computational time for SR-JOINT increases almost cubically with the sample size. In contrast, our JOINT test is much faster than SR-JOINT when the sample size is greater than or equal to 2,000. The comparison suggests that in real applications, SR-JOINT may only be applicable to small studies.

Table 2.

Computational time of joint tests. We assumed each gene had 20 genetic variants, and calculated mean computational time for each run and standard error of mean (SEM) from 1,000 replicates.

Sample
size
Mean time
JOINT
(SEM) in seconds
SR-JOINT
INT-FIX INT-RAN GESAT-W
1,000 11.53 (0.18) 16.27 (0.31) 0.047 (0.001) 0.779 (0.008) 0.090 (0.001)
2,000 12.74 (0.21) 146.41 (2.49) 0.075 (0.002) 1.179 (0.012) 0.157 (0.001)
5,000 17.13 (0.27) 2222.42 (35.74) 0.158 (0.003) 3.619 (0.032) 0.406 (0.003)
10,000 33.87 (0.45) 15779.14 (151.29) 0.292 (0.006) 9.591 (0.095) 0.820 (0.006)

We also compared computational time for three interaction tests: INT-FIX, INT-RAN and GESAT-W. Table 2 shows that INT-RAN is slower than GESAT-W, while INT-FIX is faster. All three interaction tests are much faster than SIMreg interaction test, which takes up to 26 minutes for one gene with sample size 2,000. The interaction tests are also faster than the joint tests.

Application to the Framingham Heart Study

In this real data example, we performed a genome-wide sliding window analysis for gene-BMI interaction on fasting glucose and fasting insulin levels. The Framingham Heart Study (FHS) was initiated in 1948, and in 1971, 5124 individuals who were the offspring of the Original Cohort and their spouses were recruited in the Offspring cohort. By 2005, 4095 individuals who had at least one parent in the Offspring Cohort had been recruited in the Third Generation Cohort (Gen III). We selected 1895 unrelated individuals from the FHS Offspring and Gen III Cohorts and we analyzed rank normalised fasting glucose and fasting insulin levels from the most recent physical examination. We adjusted for age, sex, cohort, first 10 PCs and BMI main effect and tested genetic main effects using SKAT and SR, gene-BMI interaction using INT-FIX, INT-RAN and GESAT-W, and the joint tests JOINT and SR-JOINT. We used genotype data from the Framingham Single Nucleotide Polymorphism (SNP) Health Association Resource (SHARe) and restricted our analysis to SNPs with MAF less than 5%. We defined the gene regions in each test using a sliding window method: each window has width 500kb, with 250kb each overlapping with the previous and subsequent windows.

We analyzed 11,090 sliding windows from all 22 autosomes. After removing windows with 0 or only 1 rare SNP, we had results from 10,516 windows, with the number of rare SNPs ranging from 2 to 67 with a median of 17. In Figure 3 we present Quantile-Quantile plots from both analyses, using SKAT, SR, INT-FIX, INT-RAN, GESAT-W, JOINT and SR-JOINT. There was no systematic inflation of type I errors: most p-values were very close to corresponding expected values.

Figure 3.

Figure 3

Quantile-Quantile plots from the sliding window analysis on fasting glucose and fasting insulin levels in an unrelated sample of 1895 individuals from FHS. P-values from SKAT, SR, JOINT, SR-JOINT, INT-FIX, INT-RAN and GESAT-W are plotted against expected p-values under corresponding null hypotheses.

We used 4.75×10−6 as the genome-wide significance level, based on a Bonferroni procedure to control for family-wise significance level of 0.05. At this significance level we only found 1 window associated with fasting glucose, allowing for gene-BMI interaction. The region is on chromosome 9, with position (NCBI Build 36) between 114,068,775 and 114,568,775, and there are 9 rare SNPs in this region, which are located in genes ROD1, HSDL2, KIAA1958, INIP, SNX30 and intergenic regions between them. The JOINT test p-value is 2.06×10−6, but interestingly, none of SKAT, SR, INT-FIX, INT-RAN, GESAT-W or SR-JOINT p-values (3.90×10−3, 2.98×10−3, 3.23×10−5, 1.90×10−5, 3.72×10−5, 8.64×10−5) reaches the genome-wide significance level. We did not find any windows associated with fasting insulin level, using SKAT, INT-FIX, INT-RAN and JOINT, at the genome-wide significance level of 4.75×10−6.

Discussion

We propose two interaction tests and one joint test for testing gene-environment interaction of rare genetic variants. These tests are flexible and applicable to both binary and continuous outcomes. In our simulation studies, we show that INT-FIX has slightly inflated type I errors at high α levels with moderate number of rare variants (q = 20), while INT-RAN and JOINT have well controlled type I errors. On the other hand, INT-FIX and INT-RAN have almost the same power in all simulation scenarios, thus we would recommend using INT-RAN instead of INT-FIX for the interaction test in general, and note that INT-FIX results should be interpreted with caution unless there are only a few variants in the test. An R package rareGE is available from the authors upon request.

Joint tests are slightly less powerful than main effect tests if there is no interaction, and slightly less powerful than interaction tests if there is no main effect, which is consistent with our knowledge of the single marker interaction and joint tests [2]. However, joint tests are most powerful when both genetic main effect and gene-environment interaction are present, and the power loss compared to main effect tests or interaction tests is acceptable when either interaction or main effect is absent or weak, suggesting that joint tests are an attractive approach for testing genetic association allowing for gene-environment interaction, when we do not have a priori knowledge about the presence of gene-environment interaction.

Tzeng et al. showed that their gene-trait similarity regression approach is equivalent to the haplotype random effects model [13, 25], thus if we replace genotypes by haplotypes in INT-RAN, it is equivalent to the interaction test in gene-trait similarity regression when analyzing continuous traits. Moreover, Tzeng’s joint test would be a special case of our joint test at fixed ρ = 0.5 if we had analyzed haplotypes. It is not surprising that the gene-trait similarity regression joint test has higher power than our joint test when genetic main effects and interaction effects are of the same magnitude. However, when the sample size is moderate to large, Tzeng’s joint test takes on average 11 to 465 times as much computational time as our joint test in a single-gene run, which limits its application.

The interaction test proposed by Lin et al. [15] introduces a tuning parameter in the estimation of genetic main effects. They used generalized cross validation [26] to select the tuning parameter. We note that INT-RAN is equivalent to their test if the tuning parameter is unconstrained to their pre-specified upper bound and equal to ϕRτ1R, estimated from our null generalized linear mixed model, treating genetic main effects as random. Moreover, our simulation study shows that the performance of their test is very similar to that of INT-RAN.

Although there are several genes (ROD1, HSDL2, KIAA1958, INIP, SNX30) overlapping with the region on chromosome 9 found to be associated with fasting glucose in the joint test, none of them has previously been reported to influence the fasting glucose levels. Little is known about the role of these genes in metabolism regulation, except that HSDL2 may be involved in the cholesterol-responsive atherosclerosis pathway [27]. However, the genotype data we used in the analysis were from SNP arrays originally designed for GWAS and most SNPs were common and excluded. For rare variants analysis, it would be ideal if we had sequence data which include more densely distributed rare SNPs. We hope to revisit this example in the future.

Gene-environment interaction has been of great interest in recent years, and gene-environment interaction studies on the whole genome have identified novel association loci for many traits [5, 2830]. However, most these studies so far were performed as single marker tests and not powerful for detecting gene-environment interaction of rare variants. The interaction and joint tests for gene-environment interaction of rare variants proposed in this paper were developed based on the SKAT framework [12], a variance component test for multiple genetic variants, which is powerful regardless of the directions of effects of those variants. Also, the proposed approaches can be viewed as a generalization of existing methods under specific conditions. We believe our general approach, especially the joint test, can be readily used in genetic epidemiological studies to test genetic associations with rare variants, allowing for effect modification by an environmental variable, and also facilitate future discoveries.

Acknowledgements

The authors would like to thank Dr. Xinyi Lin for sharing the R package iSKAT to perform GESAT-W. This research was partially supported by NIH awards R01 DK078616, U01 DK85526 and K24 DK080140. A portion of this research was conducted using the Linux Clusters for Genetic Analysis (LinGA) computing resources at Boston University Medicine Campus. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This work was partially supported by a contract with Affymetrix, Inc for genotyping services (Contract No. N02-HL-6-4278).

Appendix

Calculating P-value for the Joint Test

To compute the p-value of observed QJ = qJ, let Ts) be the value such that for ρs, 1 ≤ sS, P(QJs) > Ts)) = qJ. Let u1=W1GTVJ1ΔJ(YμJ),u2=W2GTEVJ1ΔJ(YμJ), then u1, u2 are random vectors with length q, and QJ(ρ)=ρu1Tu1+(1ρ)u2Tu2. Since u1 and u2 are asymptotically normal, their joint distribution is

(u1u2)~N((00),(Φ11Φ12Φ21Φ22)=(W1GTPGW1W1GTPEGW2W2GTEPGW1W2GTEPEGW2)).

The conditional distribution of u2 given u1

u2|u1~N(Φ21Φ111u1,Φ22Φ21Φ111Φ12).

Let λj be the eigenvalues of Φ22 − Φ21Φ11−1Φ12, UΛUT = Φ22−Φ21Φ11−1Φ12 be the eigendecomposition, where U is an orthogonal matrix, and Λ = diagj} is a q × q diagonal matrix, then the conditional distribution u2Tu2|u1~j=1qλjχ1,j2(δj2) is the sum of non-central chi-squares with 1 df, where δj is the j th element of the vector Λ12UTΦ21Φ111u1. The p-value of the test pJ = P(QJ < qJ) satisfies

1pJ=P(min{p(ρ1),p(ρ2),,p(ρS)}>qJ)=P(ρsu1Tu1+(1ρs)u2Tu2<T(ρs),S=1,2,,S)=E{P(ρsu1Tu1+(1ρs)u2Tu2<T(ρs),S=1,2,,S|u1)}=E{P(u2Tu2<mins=1,2,,ST(ρs)ρsu1Tu11ρs|u1)}.

Let F(x)=P(u2Tu2>mins=1,2,,ST(ρs)ρsu1Tu11ρs|u1=x) be a function on Rq, which can be easily evaluated using the conditional distribution u2Tu2|u1, and f (x) be the probability density function of the marginal distribution u1 ~ N (0, Φ11), then the p-value is

pJ=F(x)f(x)dx.

This is an integral on Rq. When q is small, we can use numeric methods to calculate the integral, such as adaptive Gaussian-Hermite quadrature. When q is large, however, quadrature rules become impracticable due to curse of dimensionality, thus we recommend using Monte Carlo methods to approximate the integral. We randomly sample B u1 ’s from N (0,Φ11) : u1,1, u1,2, …, u1,B, then the p-value is

pJ=1Bb=1BF(u1,b).

References

  • 1.Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11:446–450. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Hum Hered. 2007;63:111–119. doi: 10.1159/000099183. [DOI] [PubMed] [Google Scholar]
  • 3.Manning AK, LaValley M, Liu CT, Rice K, An P, Liu Y, Miljkovic I, Rasmussen-Torvik L, Harris TB, Province MA, Borecki IB, Florez JC, Meigs JB, Cupples LA, Dupuis J. Meta-analysis of gene-environment interaction: joint estimation of SNP and SNP × environment regression coefficients. Genet Epidemiol. 2011;35:11–18. doi: 10.1002/gepi.20546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Aschard H, Hancock DB, London SJ, Kraft P. Genome-wide meta-analysis of joint tests for genetic and gene-environment interaction effects. Hum Hered. 2010;70:292–300. doi: 10.1159/000323318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Manning AK, Hivert MF, Scott RA, Grimsby JL, Bouatia-Naji N, Chen H, Rybin D, Liu CT, Bielak LF, Prokopenko I, Amin N, Barnes D, Cadby G, Hottenga JJ, Ingelsson E, Jackson AU, Johnson T, Kanoni S, Ladenvall C, Lagou V, Lahti J, Lecoeur C, Liu Y, Martinez-Larrad MT, Montasser ME, Navarro P, Perry JR, Rasmussen-Torvik LJ, Salo P, Sattar N, Shungin D, Strawbridge RJ, Tanaka T, van Duijn CM, An P, de Andrade M, Andrews JS, Aspelund T, Atalay M, Aulchenko Y, Balkau B, Bandinelli S, Beckmann JS, Beilby JP, Bellis C, Bergman RN, Blangero J, Boban M, Boehnke M, Boerwinkle E, Bonnycastle LL, Boomsma DI, Borecki IB, Bottcher Y, Bouchard C, Brunner E, Budimir D, Campbell H, Carlson O, Chines PS, Clarke R, Collins FS, Corbaton-Anchuelo A, Couper D, de Faire U, Dedoussis GV, Deloukas P, Dimitriou M, Egan JM, Eiriksdottir G, Erdos MR, Eriksson JG, Eury E, Ferrucci L, Ford I, Forouhi NG, Fox CS, Franzosi MG, Franks PW, Frayling TM, Froguel P, Galan P, de Geus E, Gigante B, Glazer NL, Goel A, Groop L, Gudnason V, Hallmans G, Hamsten A, Hansson O, Harris TB, Hayward C, Heath S, Hercberg S, Hicks AA, Hingorani A, Hofman A, Hui J, Hung J, Jarvelin MR, Jhun MA, Johnson PC, Jukema JW, Jula A, Kao WH, Kaprio J, Kardia SL, Keinanen-Kiukaanniemi S, Kivimaki M, Kolcic I, Kovacs P, Kumari M, Kuusisto J, Kyvik KO, Laakso M, Lakka T, Lannfelt L, Lathrop GM, Launer LJ, Leander K, Li G, Lind L, Lindstrom J, Lobbens S, Loos RJ, Luan J, Lyssenko V, Magi R, Magnusson PK, Marmot M, Meneton P, Mohlke KL, Mooser V, Morken MA, Miljkovic I, Narisu N, O'Connell J, Ong KK, Oostra BA, Palmer LJ, Palotie A, Pankow JS, Peden JF, Pedersen NL, Pehlic M, Peltonen L, Penninx B, Pericic M, Perola M, Perusse L, Peyser PA, Polasek O, Pramstaller PP, Province MA, Raikkonen K, Rauramaa R, Rehnberg E, Rice K, Rotter JI, Rudan I, Ruokonen A, Saaristo T, Sabater-Lleal M, Salomaa V, Savage DB, Saxena R, Schwarz P, Seedorf U, Sennblad B, Serrano-Rios M, Shuldiner AR, Sijbrands EJ, Siscovick DS, Smit JH, Small KS, Smith NL, Smith AV, Stancakova A, Stirrups K, Stumvoll M, Sun YV, Swift AJ, Tonjes A, Tuomilehto J, Trompet S, Uitterlinden AG, Uusitupa M, Vikstrom M, Vitart V, Vohl MC, Voight BF, Vollenweider P, Waeber G, Waterworth DM, Watkins H, Wheeler E, Widen E, Wild SH, Willems SM, Willemsen G, Wilson JF, Witteman JC, Wright AF, Yaghootkar H, Zelenika D, Zemunik T, Zgaga L, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium, Multiple Tissue Human Expression Resource (MUTHER) Consortium. Wareham NJ, McCarthy MI, Barroso I, Watanabe RM, Florez JC, Dupuis J, Meigs JB, Langenberg C. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat Genet. 2012;44:659–669. doi: 10.1038/ng.2274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutat Res. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
  • 7.Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34:188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70:42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hoffmann TJ, Marini NJ, Witte JS. Comprehensive approach to analyzing rare genetic variants. PLoS One. 2010;5:e13584. doi: 10.1371/journal.pone.0013584. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. Am J Hum Genet. 2011;89:277–288. doi: 10.1016/j.ajhg.2011.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kazma R, Cardin NJ, Witte JS. Does accounting for gene-environment interactions help uncover association between rare variants and complex diseases? Hum Hered. 2012;74:205–214. doi: 10.1159/000346825. [DOI] [PubMed] [Google Scholar]
  • 15.Lin X, Lee S, Christiani DC, Lin X. Test for interactions between a genetic marker set and environment in generalized linear models. Biostatistics. 2013;14:667–681. doi: 10.1093/biostatistics/kxt006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  • 17.Breslow NE, Clayton DG. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association. 1993;88:9–25. [Google Scholar]
  • 18.Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2003;4:57–74. doi: 10.1093/biostatistics/4.1.57. [DOI] [PubMed] [Google Scholar]
  • 19.Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13:762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Voorman A, Brody J, Chen H, Lumley T. seqMeta: An R package for meta-analyzing region-based tests of rare DNA variants. 2014 http://cran.r-project.org/web/packages/seqMeta/index.html. [Google Scholar]
  • 21.Imhof JP. Computing the Distribution of Quadratic Forms in Normal Variables. Biometrika. 1961;48:419–426. [Google Scholar]
  • 22.Davies RB. Algorithm AS 155: The Distribution of a Linear Combination of χ2 Random Variables. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1980;29:323–333. [Google Scholar]
  • 23.Kuonen D. Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika. 1999;86:929–935. [Google Scholar]
  • 24.Liu H, Tang Y, Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput Stat Data Anal. 2009;53:853–856. [Google Scholar]
  • 25.Tzeng JY, Zhang D, Chang SM, Thomas DC, Davidian M. Gene-trait similarity regression for multimarker-based association analysis. Biometrics. 2009;65:822–832. doi: 10.1111/j.1541-0420.2008.01176.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.O'Sullivan F, Yandell BS, Raynor WJ., Jr Automatic Smoothing of Regression Functions in Generalized Linear Models. Journal of the American Statistical Association. 1986;81:96–103. [Google Scholar]
  • 27.Skogsberg J, Lundstrom J, Kovacs A, Nilsson R, Noori P, Maleki S, Kohler M, Hamsten A, Tegner J, Bjorkegren J. Transcriptional profiling uncovers a network of cholesterol-responsive atherosclerosis target genes. PLoS Genet. 2008;4:e1000036. doi: 10.1371/journal.pgen.1000036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Beaty TH, Ruczinski I, Murray JC, Marazita ML, Munger RG, Hetmanski JB, Murray T, Redett RJ, Fallin MD, Liang KY, Wu T, Patel PJ, Jin SC, Zhang TX, Schwender H, Wu-Chou YH, Chen PK, Chong SS, Cheah F, Yeow V, Ye X, Wang H, Huang S, Jabs EW, Shi B, Wilcox AJ, Lie RT, Jee SH, Christensen K, Doheny KF, Pugh EW, Ling H, Scott AF. Evidence for gene-environment interaction in a genome wide study of nonsyndromic cleft palate. Genet Epidemiol. 2011;35:469–478. doi: 10.1002/gepi.20595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Hamza TH, Chen H, Hill-Burns EM, Rhodes SL, Montimurro J, Kay DM, Tenesa A, Kusel VI, Sheehan P, Eaaswarkhanth M, Yearout D, Samii A, Roberts JW, Agarwal P, Bordelon Y, Park Y, Wang L, Gao J, Vance JM, Kendler KS, Bacanu SA, Scott WK, Ritz B, Nutt J, Factor SA, Zabetian CP, Payami H. Genome-wide gene-environment study identifies glutamate receptor gene GRIN2A as a Parkinson's disease modifier gene via interaction with coffee. PLoS Genet. 2011;7:e1002237. doi: 10.1371/journal.pgen.1002237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hancock DB, Artigas MS, Gharib SA, Henry A, Manichaikul A, Ramasamy A, Loth DW, Imboden M, Koch B, McArdle WL, Smith AV, Smolonska J, Sood A, Tang W, Wilk JB, Zhai G, Zhao JH, Aschard H, Burkart KM, Curjuric I, Eijgelsheim M, Elliott P, Gu X, Harris TB, Janson C, Homuth G, Hysi PG, Liu JZ, Loehr LR, Lohman K, Loos RJ, Manning AK, Marciante KD, Obeidat M, Postma DS, Aldrich MC, Brusselle GG, Chen TH, Eiriksdottir G, Franceschini N, Heinrich J, Rotter JI, Wijmenga C, Williams OD, Bentley AR, Hofman A, Laurie CC, Lumley T, Morrison AC, Joubert BR, Rivadeneira F, Couper DJ, Kritchevsky SB, Liu Y, Wjst M, Wain LV, Vonk JM, Uitterlinden AG, Rochat T, Rich SS, Psaty BM, O'Connor GT, North KE, Mirel DB, Meibohm B, Launer LJ, Khaw KT, Hartikainen AL, Hammond CJ, Glaser S, Marchini J, Kraft P, Wareham NJ, Volzke H, Stricker BH, Spector TD, Probst-Hensch NM, Jarvis D, Jarvelin MR, Heckbert SR, Gudnason V, Boezen HM, Barr RG, Cassano PA, Strachan DP, Fornage M, Hall IP, Dupuis J, Tobin MD, London SJ. Genome-wide joint meta-analysis of SNP and SNP-by-smoking interaction identifies novel loci for pulmonary function. PLoS Genet. 2012;8:e1003098. doi: 10.1371/journal.pgen.1003098. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES