1. Summary
Current methods for detecting genetic associations lack full consideration of the background effects of environmental exposures. Recently proposed methods to account for environmental exposures have focused on logistic regressions with gene-environment interactions. In this report, we developed a test for genetic association, encompassing a broad range of risk models, including linear, logistic and probit, for specifying joint effects of genetic and environmental exposures. We obtained the test statistics by maximizing over a class of score tests, each of which involves modified standard tests of genetic association through a weight function. This weight function reflects the potential heterogeneity of the genetic effects by levels of environmental exposures under a particular model. Simulation studies demonstrate the robust power of these methods for detecting genetic associations under a wide range of scenarios. Applications of these methods are further illustrated using data from genome-wide association studies of type 2 diabetes with body mass index and of lung cancer risk with smoking.
Keywords: gene-environment interaction, GWAS, score test, SNPs, environmental exposures
2. Introduction
Recent genome-wide association studies (GWAS) have been successful in discovering genetic variants associated with a variety of human complex traits, including common chronic diseases such as cancers, type-II diabetes and heart disease (The NHGRI GWAS Catalog www.genome.gov/gwastudies). However, a large proportion of the heritability of these complex traits still remained unexplained, and this has led to challenges in dissecting the full genetic architecture of these traits going beyond classical GWAS approaches (Eichler et al., 2010).
One major barrier to taking full advantage of the data collected in GWAS is the lack of suitable methods that are capable of incorporating the effects of environmental exposures to improve power for testing genetic association. A standard approach for incorporating environmental exposures in genetic association analysis is to include them as covariates in a regression model. It is well known that for linear regression analysis of quantitative traits, simple adjustments of predictive covariates can enhance the power of the genetic association test through the reduction of the residual variance. For conventional logistic regression analysis of binary disease outcomes, however, incorporation of environmental exposure simply as covariates implicitly assumes that the disease-gene association odd-ratio is constant across levels of environmental exposure and can result in loss of power (Pirinen, Donnelly and Spencer, 2012).
To account for heterogeneity in genetic effects by levels of environmental exposures, logistic regression models can be extended to incorporate gene-environment interaction terms. Some researchers have proposed to use “omnibus tests” (Kraft et al., 2007) for genetic main effects and gene-environment interaction terms in logistic models as a way of improving power for the detection of susceptibility genes with heterogeneous effects by levels of environmental exposures. Limitations of omnibus tests include loss of power due to increased degree-of-freedom and their inability to impose natural directional constraints where genetic effects may be modified, but not reversed, by environmental exposures.
Logistic regression remains the most popular model for the analysis of binary disease traits in genetic epidemiologic studies due to the wide availability of software and its easy amenability to case-control retrospective study designs. However, alternative models for specifying joint effects of genetic susceptibility and environmental risk factors are needed (Clayton, 2012). If a disease could be thought of as a manifestation of an underlying continuous liability variable exceeding some clinical threshold, then it may be natural to assume that the genetic and environmental effects act additively on the risk of a disease in the probit scale (Falconer, 1965; Zaitlen et al., 2012). If a genetic and an environmental effect act on the risk of a disease through independent biological mechanisms, then one can expect their effects to be additive on the linear scale (Rothman, Greenland and Walker, 1980).
In this report, we propose a novel test for genetic association that is optimal under a large class of underlying risk models incorporating gene-environment interactions. We use a class of weighted association tests with the weights accounting for degrees of heterogeneity of genetic effects expected under various models. We obtain the final test statistic by maximizing the individual test statistics over a class of weight functions and describe an analytic method of efficiently approximating a p-value. Since the form of the weight for which the test-statistic is maximized suggests a specific underlying model, the method is also useful for guiding strategies for replicating the genetic association in independent studies. Other practical advantages of this method include its applicability to both prospective and retrospective studies, similar to logistic regression, and its ability to enhance power by incorporating gene-environment independence (Chatterjee and Carroll, 2005) and directional constraints on genetic effects. We assess the performance of the proposed method and compare it to alternative strategies through both simulation studies and real data examples.
3. Models and Methods
3.1 Comparing Models: Genetic Odds ratios Across Exposure Levels
To demonstrate the features of this new approach, we first compare three alternative models, each of which assumes additive effects of genetic and environment factors on the risk of the disease on a suitable scale. Suppose that for subject i, Gi is a genotype of a SNP, coded as 0, 1, or 2 by the number of copies of a minor allele. Ei is an environmental risk factor that is a continuous variable (or a categorical variable coded as a set of dummy variables). Finally, Di is the disease status, coded as 0 (unaffected) or 1 (affected).
Multiplicative Model (Logistic Model)
Additive effects for Gi and Ei on the logistic scale implies:
| (1) |
According to this model, the genetic effect measured as OR (= exp(βG)) does not vary across different levels of Ei. For rare diseases, i.e. when Pr(Di = 1) is small, the odds-ratio approximates the relative risk (because when p1 ≈ 0 and p2 ≈ 0) and thus logistic models imply that genetic relative risks are approximately constant with respect to levels of environmental exposures.
Additive Model
Alternatively, an additive model assumes Gi and Ei act additively on the risk of the disease itself, i.e.
| (2) |
This model implies that the risk difference, and not the odds-ratio, associated with Gi is constant across level of Ei.
Probit Model (Liability Threshold Model)
Finally, if the effects of Gi and Ei are assumed to be additive on the probit scale, then one has:
| (3) |
This model is equivalent to having a latent trait Yi modeled as: Yi = γ0 + γGGi + γE Ei + εi , where εi ~ N (0, 1) and Di = 1 if Yi > C; that is, one becomes affected by disease if the value of a latent trait is above a certain clinical threshold. This is also known as a liability threshold (LT) model.
Model Comparisons
We compare how the three models described above alternatively specify the genetic odds-ratios by levels of environmental exposures. As an illustration, we assume Ei to be a continuous exposure variable with values from 0 to 10. Further, to make the models comparable, we assigned a set of parameters for each model so it satisfies the following conditions: (i) disease prevalence (probability of the disease in the population) 1%; (ii) population attributable risk due to exposure 50%; (iii) population attributable risk due to gene 10% (See Supplemental Table S1). Figure 1A illustrates how genetic odds-ratio varies as a function of exposure levels under the three models. As expected, under the logistic model, the genetic odds-ratio remains constant across levels of environmental risk factors. Under the probit model, the genetic odds-ratio decreases modestly with the increase of Ei. Under the additive model, the genetic odds-ratio decreases quite dramatically by increasing the level of Ei.
Figure 1.

Comparing three models: additive, multiplicative and probit models. For each model, a set of parameters was assigned so it satisfies the condition: (i) disease prevalence 1%; (ii) population attributable risk due to exposure 50%; (iii) population attributable risk due to gene 10%. For each model: A. Per-allele odds ratio for a SNP was calculated over different levels of exposure: B. A weight on genotype was calculated over different levels of exposure.
3.2 Exposure Weighted Score test for Genetic Association (EWSGA)
In the previous section, we described how different models for gene-environment additive effects correspond to different functional forms of genetic odds-ratios as a function of environmental exposures. This implies that power can be gained if one up- or down-weight individuals in popular odds-ratio-based genetic association tests by levels of environmental exposure. To examine what might be the optimal form of this weight, we first derive a score test for genetic association under the additive model, which reveals a strong inverse relationship between the genetic odds-ratio and environmental exposures.
In Web-Appendix A, we show that under a rare disease approximation, the additive model can be re-written as a logistic model in the form:
| (4) |
In (4), the form of the logistic model is non-standard in that the interaction coefficients are not free parameters, but instead are non-linear functions of the main effect terms. The model, however, can be used to test for the null hypothesis of no genetic effects, i.e. H0 : bG = 0.
Now, under model (4), the score function associated with the log-likelihood for N subjects in a study is given by
| (5) |
where parameters associated with the null model can be obtained by ordinary logistic regression analysis of disease status on environmental exposures (Ei) and possible confounders , ignoring genetic data (Gi). Here we define and . In (5), the term exp(βE Ei)−1— the inverse of the odds ratio for exposure Ei — acts as a weight. For exposures which are positively associated with the disease (βE > 0), subjects are down-weighted with increasing levels of exposure by a degree that depends on the strength of association (βE ) between disease and exposure. We further note that irrespective of the rare disease assumption holds or not, the score-test defined by (5) is unbiased under the null hypothesis of no genetic association as long as the null logistic model is correctly specified. The optimality of the choice for the exp(βE Ei)−1weight function under the additive model (2) depends on the accuracy of rare disease approximation.
Noting that the form of the score of standard logistic regression model has an identical form as (5) except that each subject is equally weighted, we suggest a general class of score function of the form:
| (6) |
where ωi ≡ exp(βE Ei). When the parameter θ is allowed to vary over a range of values including, but not limited to θ = −1 and θ = 0, the class can encompass a wide range of alternative models. The optimal weight function for the probit model, for example, where the genetic odds-ratio decreases by increasing level of an environmental risk factors but the rate of decrease is expected to be much more modest than the additive model (See Figure 1A), θ is expected to lie between −1 and 0. The range of value for θ > 0 will correspond to models where the genetic odds-ratio may get larger with increasing levels of environmental exposure. In principle, θ could range from −∞ to +∞, where the two limits will correspond to models where the genetic effects become negligible in the presence or absence of an environmental risk factor. In practice, it is sufficient to consider a range of θ bounded by some large but finite value. We note that this model allows Ei to be a vector of several exposure variables.
3.3 Constructing a score test
For any given θ, one can construct a score test of the form where . The details of the particular forms of the variance V (θ) are derived in Web-Appendix B-C. As a final step, we propose to use the maximum score-statistics over a range of θ for robust detection of genetic association under a range of alternative models for gene-environment interaction. Thus, the final test statistics takes the form
| (7) |
3.4 Calculating a p-value of the maximum test
To compute a p-value of the proposed test, we apply an extreme value theory related to estimating a tail probability of the maximum of stochastic processes (Davies, 1977; Feingold, Brown and Siegmund, 1993; Leadbetter, Lindgren and Rootzen, 1983); Specifically, we apply the method proposed by Davies (Davies, 1977), developed for the case of non-stationary Gaussian processes, which is given as the following:
| (8) |
where c is an observed maximum test statistic value, Φ(·) is the cumulative distribution function of a standard normal distribution and . The derivations ofρ11 (θ) and corr(T (ϕ), T (θ)) are in Web-Appendix D. The first two terms in (8) are related to typical terms for assessing significance from a single test that follows N (0, 1), and the last term is a penalty term imposed due to maximizing over various risk models.
3.5 Extension to utilize the gene-environment independence assumption
For case-control studies, an assumption of gene-environment independence in the underlying population can be utilized to enhance the power of gene-environment tests. In particular, it has been shown that under both the assumption of independence and the rare disease assumption, the gene-environment interaction parameters in a logistic model can be estimated from case data alone. Such a case-only estimator can be more precise than a traditional case-control based estimator for an interaction obtained from a standard logistic regression analysis (Piegorsch, Weinberg and Taylor, 1994). Subsequently, it has been shown that the gene-environment independence assumption can be exploited using a retrospective likelihood for case-control data to obtain efficient parameter estimates of all of the parameters, including but not limited to the interaction coefficient of a general logistic model (Chatterjee and Carroll, 2005). More recently, a study showed how the retrospective likelihood framework can be used to exploit the gene-environment independence assumption to enhance the power of the test for additive interaction (Han et al., 2012b).
In order to exploit the gene-environment independence assumption in the current setting, we propose a simple modification of the score-statistics based on the retrospective likelihood. In particular, Chatterjee and Carroll (2005) showed that the retrospective likelihood for case-control data can be equivalently expressed as Pr(Di, Gi|Ei, zi, Ri = 1), with Ri being a selection mechanism for case-control design. We propose to modify the score-function in (6) by replacing the centering term Giμ(wi; η) by E (GiDi|Gi, Ei, zi, R = 1)where the expectation is calculated under the retrospective likelihood. Under the gene-environment independence assumption and the null hypothesis of no genetic association, the modified score function can be written as
Further, the variance of the modified score function under retrospective likelihood can be written as:
where and p is a minor allele frequency of a given SNP so that Pr(Gi = 0) = (1 − p)2, Pr(Gi = 1) = 2p(1 − p) and Pr(Gi = 2) = p2 (hence E(Gi|Ei, zi, Ri = 1) = E(Gi) = 2p), under the assumption of the Hardy-Weinberg equilibrium. The forms of IβGψT and IψψT and detailed derivations of VRET RO (θ) are presented in Web-Appendix E. The modified score-statistic is given by , and the final test-statistic under the retrospective likelihood takes the form
| (9) |
4. Simulation under various underlying risk models
Simulation settings
We conducted a simulation study to evaluate the performance of the proposed method, comparing it to the existing standard methods under the truth of various risk models: additive, multiplicative, probit and supra-multiplicative models, i.e. βGE > 0 in (A3). It is assumed that a SNP Gi ∈ {0, 1, 2} has a minor allele frequency of 0.5, and Ei is a categorical variable with three levels 0, 1, and 2 with Pr(Ei = 0) = 0.5, Pr(Ei = 1) = 0.3 and Pr(Ei = 2) = 0.2. The disease prevalence Pr(Di = 1) in the underlying population is 10%.
For simulating data sets under additive, multiplicative and probit models, we varied the population attributable risk due to exposure, i.e. PAR(E), by 10%, 30% or 50% within each risk model. PAR(G) was fixed as 10% and hence three sets of parameters satisfying the given conditions were calculated and used for each risk model (See Web-Appendix F). The parameter values used for simulations are shown in Supplemental Table S2 – S4. For additive models, a sample size of 6000 cases and 6000 controls were assigned. 7000 cases and 7000 controls were assigned for multiplicative and probit models to make the power levels comparable across different risk models. Supra-multiplicative models were simulated using the parameters used for a multiplicative model with PAR(E)= 30%, and by including two interaction parameters βGE1 and βGE2 in the model; the values for βGE2 were varied as log(1.15), log(1.2), log(1.25), or log(1.3) fixing βGE1 = log(1.05) (Supplemental Table S5).
We simulated 1000 data sets from each model and applied (i) our proposed general score test in (7) with θ ∈ [−1, 3], (ii) a standard association test, i.e. testing for H0 : βG = 0 from the model, logit(Pr(Di = 1|Gi, Ei)) = β0 + βGGi + βE1 E1i + βE2 E2i, and (iii) a joint test for association and interactions, i.e. testing for H0 : βG = βGE1 = βGE2 = 0 (3 d.f. test) from the model: logit(Pr(Di = 1|Gi, Ei)) = β0 + βGGi + βE1 E1i + βE2 E2i + βGE1 GiE1i + βGE2 GiE2i. The power was assessed using a significance level α= 1×10−7. Simulations for type I error were performed under the null model: logit(Pr(Di = 1|Gi, Ei)) = β0 + βE1 E1i + βE2 E2i with parameter values the same as the ones used for a multiplicative model for PAR(E) = 30%, with βG = 0 imposed on them. 107 datasets were simulated to assess type I errors under ranges of significance levels.
Lastly, we also compared the performance of the general score test in (7) to the general score test that uses the gene-environment independence assumption in (9). Data sets were simulated from supra-multiplicative models with the four sets of parameter values shown in Supplemental Table S6, with varying magnitude of interactions. For each model, we applied two general score tests with and without the gene-environment independence assumption and obtained the relative efficiency by taking the ratio of the two maximum score statistics, i.e. Relative efficiency = TRET RO /T.
We consider an additional simulation setting motivated from by recent findings of the association between type-2 diabetes risk and a SNP s7903146; the results show that the genetic odds ratio for SNP rs7903146 among individuals with high body mass index (BMI) is smaller compared to the one observed among subjects with low BMI (see Supplemental Table S7) (Perry et al., 2012). We simulated case-control data on genotype and BMI status using observed allele frequencies, prevalence of obesity (BMI>30), ORs for SNP given BMI categories, and the odds-ratio for BMI itself reported in the study. We varied the sample size N=4000, 5000, and 6000 with equal ratios of cases and controls. Power is then calculated with a significance level α = 1×10−8 over 1000 iterations. We applied three tests—standard association test based on the multiplicative model, joint test and general score test in (7).
Simulation results
Supplemental Table S8 shows the type 1 error rates of the proposed general score test. We see that the Type 1 error is well controlled using the method through Davies’ formula. Figure 2A shows the results of power simulations in which data sets were simulated under the truth of additive models. According to the results, the power of a standard association test is lowest due to the fact it does not incorporate heterogeneous genetic effects by exposure levels. A joint test here has increased power compared to the standard test, but is still less powerful than the general score test due to increased degrees of freedom (3 d.f.). Hence the power of the general score test was the highest across different PAR(E) levels. A score test constructed under the true model (θ = −1) is shown in orange color in Figure 2A, which has higher power than the general score test as expected. However, this test is not generally applicable since the true model is unknown.
Figure 2.
Power simulation results under minor allele frequency (MAF) = 0.5 and significance level α =10−7. Under the truth of: A. additive model, B. multiplicative model, C. probit model, and D. supra-multiplicative model. For panels A-C, population attributable risk due to exposure, i.e. PAR(E) varied as 10%, 30%, and 50%. “Test under true model” in Panel A is a score test constructed under the additive model (θ = −1) and a score test under the probit model (θ = −0.2) in Panel C. For D, the interaction parameter value changed from log(1.15) to log(1.3). In E, a general score test exploiting gene-environment independence assumption is compared to a general score test without the assumption. Y-axis is the ratio of maximum score test statistics of two tests and X-axis is the magnitude of interactions in supra-multiplicative models. F. Power comparisons of three tests for simulated data for Interaction between Body Mass Index (BMI) and rs7903146 for type 2 diabetes shown in Supplemental Table S7.
When data sets were generated from multiplicative models (Figure 2B) with corresponding θ = 0, a standard test constructed exactly from this multiplicative model was the most powerful, as expected. However, the general score test incorporating this model within its class of models also had comparable power levels. The joint test, again due to increased degrees of freedom, had the lowest power. When the true model is a probit model (Figure 2C) in which θ is close to the 0, the results were similar to when θ = 0. However, when data sets were simulated from supra-multiplicative models (Figure 2D), the standard association test that did not incorporate gene-environment interactions had the smallest power, while the joint test showed higher power. Overall, the general score test showed the highest power across different magnitudes of interactions. Estimated θ values for each model are shown in Supplemental Table S9, which show that the estimates comply with the simulated models. Power simulation results under a sub-multiplicative model with opposite beta coefficient values show similar patterns (Supplemental Figure S1). Also, simulation results under smaller a MAF of 0.1 are shown in Supplemental Figure S2.
Figure 2E shows the relative efficiency of the general score test for utilizing the gene-environment independence assumption over the regular general score test, which varies from 1.02 to 1.18 depending on the magnitude of interactions. This implies that the method using the independence assumption obtains up to an 18% increase in maximum statistic value compared to the one that doesn’t use the assumption. Finally, Figure 2F shows the power for different tests under a model that mimicked the association of type-2 diabetes susceptibility SNP rs7903146 by BMI categories. The general score test shows the largest power across different sample sizes, and the multiplicative model shows the smallest power. For a sample size of N=5000, for example, the general score test has power 0.68 (α =1×10−8), 0.63 for the joint test and 0.6 for the standard association test.
5. Illustrative data analysis 1: NCI Lung cancer GWAS data
Data description
We applied the proposed method to a dataset from the National Cancer Institute (NCI) GWAS of lung cancer (LC) conducted with a Caucasian sample (Landi et al., 2009). Smoking is known to be a strong risk factor for LC. GWAS in Caucasian of European (CEU) subjects, where the majority of cases (>90%) tend to be smokers, have identified several susceptibility SNPs for LC, some of which have effects only on specific histologic subtypes of the disease (Amos et al., 2008; Hung et al., 2008; McKay et al., 2008; Wang et al., 2008). GWAS conducted in non-smoking Asian populations (Lan et al., 2012), on the other hand, have identified a different set of susceptibility SNPs indicating the presence of gene-environment interactions for this disease. In this report, we analyze 26 SNPs that are identified from previous GWAS (P < 5×10−8), conducted in either Caucasian(Amos et al., 2008; Hung et al., 2008; McKay et al., 2008; Wang et al., 2008) or Asian populations (Hu et al., 2011; Jin et al., 2012; Lan et al., 2012), which are listed on the NHGRI GWAS catalog (http://www.genome.gov/gwastudies/) as susceptibility SNPs for lung cancer, overall or for specific subtypes. The aim of this analysis is to investigate the association of all the SNP within the NCI study, which includes only Caucasian subjects, but explicitly incorporating smoking history.
The data includes 5739 cases and 5848 controls from four studies, including one population-based case-control study and three cohort studies (Landi et al., 2009): specifically, the Environment & Genetics in Lung Cancer Etiology (EAGLE) study (Landi et al., 2008); the Alpha-Tocopherol, Beta-Carotene Cancer Prevention (ATBC) Study (Group, 1994); the Prostate, Lung, Colorectal, and Ovarian Cancer (PLCO) Screening Trial (Hayes et al., 2005); and the Cancer Prevention Study II Nutrition Cohort (CPS-II) (Calle et al., 2002). A total of 26 susceptibility SNPs were used for this analysis (See Supplemental Table S10 for a SNP list) (Saccone et al., 2010). The main environmental variable we incorporated was smoking status that was categorized as “never”, “former” or “current” smokers. Analyses are presented using data for all LC cases and also separately using cases for the two major histologic subtypes. As some of the SNPs are also known to be associated with smoking behavior, we only applied the prospective score-test that does not require any assumption of gene-environment independence. The model is adjusted for age, gender and study. For all analyses, the effect of SNP-genotype was modeled in an additive fashion and disease-model was adjusted for age, gender and study.
Analysis results
The quantile-quantile plots (Q-Q plots) are shown in Supplemental Figure S3 for comparing the p-values of the general score test, the standard association test and the joint test for analyzing all 26 SNPs. Figure 3 shows the Q-Q plot after excluding the top three SNPs (rs1051730, rs8034191 and rs8042374) in the chromosome 15 region that have extremely low p-values. These figures generally show increased evidence of associations of the SNPs using the general score test compared to either the standard association test or the joint test.
Figure 3.
Lung cancer GWAS SNPs analysis: Quantile - Quantile plots of –log10 scaled p-values comparing general score test (Y-axis) to standard test based on multiplicative model (A and C for overall lung cancer and adenocarcinoma only); to the joint test (B and D for overall lung cancer and adenocarcinoma only). Additive genetic model was used. All SNPs except for the top three SNPs rs1051730, rs8034191, rs8042374 were included in Q-Q plots.
More detailed results are shown in Supplemental Table S12 for the top SNPs that have at least one of the p-values from standard association test based on the multiplicative model, joint test or general score for overall lung cancer or for adenocarcinoma (the most common subtype for lung cancer) less than 0.05. It is noteworthy that while SNPs that were previously identified in the CEU sample tend to show super-multiplicative effects (θ>1), i.e. stronger genetic effects for smokers On the other hand, those identified in Asian population consistently show sub-multiplicative effects (θ<1), i.e. stronger effects for non-smokers. It is also noteworthy the SNPs that were initially identified exclusively through non-smoking Asian studies (12th to 15th SNPs especially in adenocarcinoma-only analysis in Supplemental Table S12) consistently achieved higher significance in the Caucasian study once the SNP-smoking interaction was taken into account using the general score test. These results indicate the same set of SNPs may affect the risk of LC in different populations but they have heterogeneous effects by smoking status. It is only the difference in smoking distribution in different studies that have led to differential power for detection of these SNPs in one population or the other.
6. Illustrative data analysis 2: Type 2 Diabetes and BMI analysis
Data description
We applied our method to another GWAS data for type 2 diabetes. Our aim is to examine if the proposed method allowing heterogeneous genetic effects by BMI incorporating various joint effect models can enhance power for detecting susceptibility loci. The data were downloaded from the dbGaP database (http://www.ncbi.nlm.nih.gov/gap). The data includes 2680 cases and 3148 controls from nested case-control studies embedded within the Nurses’ Health Study (NHS) and the Health Professionals’ Follow-up Study (HPFS) (COLDITZ, MANSON and HANKINSON, 1997; Rimm et al., 1991). A total of 52 top SNPs were analyzed, all of which have been identified to be associated with T2D risk from large GWAS (P < 5×10−8) according to the NHGRI GWAS catalog (http://www.genome.gov/gwastudies/) For some GWAS SNPs that were not genotyped in the given data, surrogate SNPs with the highest correlation coefficient were identified and used in the analyses. We apply our methods both in a prospective approach(7) and a retrospective approach (9) with θ ∈ [−1, 3].
The models were adjusted for age and gender and restricted to Caucasian subjects (N=3295). BMI was categorized into three levels (normal, ≤ 25; overweight, 25 – 30; obese > 30), and the analysis using BMI with a binary variable (obese (>30) vs. non-obese (≤ 30) is included in Supplemental Table S11.
Analysis results
Table 1 shows the results of the top ten SNPs identified by the general score test using the retrospective approach (Pretro). The results for all SNPs are shown in Figure 4, using Q-Q plots, excluding the top SNP rs4506565 for which all methods show highly significant results. As shown in Table 1, stratified associations by BMI show heterogeneous genetic odds ratios across different BMI levels for a number of the top SNPs. In particular, the first four SNPs show increased odds ratios and significance for low BMI. The maximized θ’s for these SNPs have values from −1 to −0.15, which imply that the joint effects of the SNPs and BMI follow sub-multiplicative models. Compared to the standard test of association under the multiplicative model, we see that the general score tests (Pretro) show increased significance for several SNPs and comparable significance for the remainders (Figure 4A). The proposed score tests also showed increase level of significance uniformly across all SNPs compared to the multi degree-of-freedom joint test of association (Figure 4B). This pattern is consistent with the results from our simulation studies where we also observed the proposed score-test tends to be uniformly more powerful than the joint test.
Table 1.
Type 2 Diabetes GWAS SNPs analysis. Top 10 SNPs sorted by general score test using retrospective approach
| SNP | BMI | OR | CI | Pstrat | Pmult | Pjoint | Ppro | Pretro | θ |
|---|---|---|---|---|---|---|---|---|---|
| rs4506565 | ltn_25 | 1.36 | (1.19,1.56) | 8.70E-06 | 1.75E-13 | 2.44E-12 | 3.08E-13 | 8.59E-13 | −0.15 |
| 25_30 | 1.47 | (1.29,1.67) | 2.73E-09 | ||||||
| 30plus | 1.18 | (0.98,1.43) | 9.93E-02 | ||||||
|
| |||||||||
| rs1317548 | ltn_25 | 1.28 | (1.09,1.49) | 3.10E-03 | 5.98E-04 | 3.82E-03 | 4.45E-04 | 1.16E-04 | −0.55 |
| 25_30 | 1.15 | (0.99,1.33) | 5.95E-02 | ||||||
| 30plus | 1.1 | (0.88,1.37) | 4.67E-01 | ||||||
|
| |||||||||
| rs10811661 | ltn_25 | 0.72 | (0.59,0.87) | 7.50E-04 | 1.45E-02 | 9.20E-04 | 4.69E-04 | 2.91E-04 | −1 |
| 25_30 | 0.88 | (0.75,1.03) | 9.22E-02 | ||||||
| 30plus | 1.19 | (0.94,1.50) | 1.12E-01 | ||||||
|
| |||||||||
| rs2972144 | ltn_25 | 0.86 | (0.75,0.98) | 2.79E-02 | 2.88E-03 | 9.37E-03 | 1.50E-03 | 1.54E-03 | −0.5 |
| 25_30 | 0.81 | (0.72,0.92) | 7.89E-04 | ||||||
| 30plus | 1.13 | (0.94,1.36) | 1.30E-01 | ||||||
|
| |||||||||
| rs7480855 | ltn_25 | 0.99 | (0.77,1.28) | 9.20E-01 | 2.51E-02 | 3.54E-02 | 1.91E-02 | 2.66E-03 | 0.2 |
| 25_30 | 0.8 | (0.63,1.02) | 6.46E-02 | ||||||
| 30plus | 0.67 | (0.48,0.94) | 1.98E-02 | ||||||
|
| |||||||||
| rs2943634 | ltn_25 | 0.87 | (0.76,1.00) | 4.88E-02 | 6.18E-03 | 3.45E-04 | 3.16E-03 | 2.78E-03 | −0.5 |
| 25_30 | 0.81 | (0.71,0.91) | 5.23E-04 | ||||||
| 30plus | 1.17 | (0.97,1.41) | 6.98E-02 | ||||||
|
| |||||||||
| rs7754840 | ltn_25 | 1.1 | (0.95,1.26) | 2.00E-01 | 4.75E-03 | 3.42E-02 | 6.02E-03 | 5.12E-03 | 0.2 |
| 25_30 | 1.12 | (0.99,1.27) | 6.74E-02 | ||||||
| 30plus | 1.21 | (1.00,1.47) | 6.58E-02 | ||||||
|
| |||||||||
| rs1801282 | ltn_25 | 0.85 | (0.69,1.06) | 1.59E-01 | 5.90E-03 | 4.14E-02 | 9.36E-03 | 7.39E-03 | −0.1 |
| 25_30 | 0.8 | (0.66,0.96) | 1.36E-02 | ||||||
| 30plus | 0.91 | (0.70,1.19) | 4.60E-01 | ||||||
|
| |||||||||
| rs7756992 | ltn_25 | 1.11 | (0.97,1.28) | 1.34E-01 | 2.81E-03 | 1.90E-02 | 3.48E-03 | 1.05E-02 | 0.2 |
| 25_30 | 1.12 | (0.98,1.27) | 7.64E-02 | ||||||
| 30plus | 1.26 | (1.03,1.54) | 3.66E-02 | ||||||
|
| |||||||||
| rs10954284 | ltn_25 | 0.96 | (0.84,1.09) | 5.30E-01 | 6.00E-03 | 2.79E-02 | 7.16E-03 | 2.36E-02 | 0.2 |
| 25_30 | 0.86 | (0.76,0.96) | 7.48E-03 | ||||||
| 30plus | 0.87 | (0.73,1.05) | 1.21E-01 | ||||||
Note: Pstrat, Pmult, Pjoint, Ppro and Pretro denote p-values from association test stratified by BMI, standard association test, joint test, and general score test based on prospective and retrospective approach respectively.
Figure 4.
Type 2 diabetes GWAS SNPs analysis: Quantile - Quantile plots of –log10 scaled p-values comparing general score test with retrospective approach to standard test based on multiplicative model (A); to the joint test (B). The top most significant SNP rs4506565 is excluded.
7. Discussion
We propose a novel association test that can incorporate gene-environment interactions under a wide class of disease risk models. These methods are intuitive and straightforward to implement as they simply involve standard score-tests for genetic association by level of environmental exposures of individual subjects. Simulation studies and analyses of a number of datasets indicate that the method can increase power for detecting genetic association compared to standard association tests that may use misspecified model assumption for gene-environment effects. The method has more robust power compared to some alternative tests for genetic association that can also incorporate environmental exposure at the price of increased degrees-of-freedom.
Some additional features of the method merit attention. In addition to the p-value for the overall association test, the method produces an estimate for θ, the range of which indicates whether the optimal model corresponds to a sub- (θ<1) or supra-multiplicative (θ>1) model. The value of can provide an additional screening tool for evaluating the consistency of interaction patterns in replication studies; this is in the same spirit as the sign of association parameters are used for testing consistency of association findings based only on main effects of SNPs. Another notable feature of the method is that the weighted sum of score-statistics with a weight function constrained to be positive can automatically down-weight SNPs that have estimated effects in opposite directions depending on the level of environmental exposure. In a recent article, we showed that restricting the parameter space of models to exclude extreme “qualitative” interactions can improve power for the detection of interaction, where the effect of one risk factor may be modified, or even nullified, but not reversed by another risk factor (Han, Rosenberg and Chatterjee, 2012a). The proposed method, however, is more flexible and can be powered for detecting qualitative interactions as well as quantitative interactions by considering a larger class of weight functions that allow weights to switch signs according to the exposure level of the subjects. Lastly, our method is implemented in a freely available statistical software R package, CGEN (http://www.bioconductor.org/packages/release/bioc/html/CGEN.html).
Supplementary Material
9. Acknowledgments
This research was supported by the Intramural Research Program of the National Institutes of Health, National Cancer Institute (Division of Cancer Epidemiology and Genetics). We are grateful to Dr. David Siegmund for helpful advice and pointers to the literature for an extreme value theory related to calculations of p-values of the maximum test.
Footnotes
8. Supplementary Materials
Web Appendices A–F referenced in Sections 2 and 3 are available with this paper at the Biometrics website on Wiley Online Library.
10. References
- Amos CI, Wu X, Broderick P, Gorlov IP, Gu J, Eisen T, et al. Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25. 1. Nature Genetics. 2008;40:616–622. doi: 10.1038/ng.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calle EE, Rodriguez C, Jacobs EJ, Almon ML, Chao A, McCullough, et al. The American cancer society cancer prevention study II nutrition cohort. Cancer. 2002;94:2490–2501. doi: 10.1002/cncr.101970. [DOI] [PubMed] [Google Scholar]
- Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005;92:399–418. [Google Scholar]
- Clayton D. Link functions in multi-locus genetic models: implications for testing, prediction, and interpretation. Genetic Epidemiology. 2012;36:409–418. doi: 10.1002/gepi.21635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Colditz GA, Manson JE, Hankinson SE. The Nurses’ Health Study: 20-year contribution to the understanding of health among women. Journal of Women’s Health. 1997;6:49–62. doi: 10.1089/jwh.1997.6.49. [DOI] [PubMed] [Google Scholar]
- Davies RB. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika. 1977;64:247–254. doi: 10.1111/j.0006-341X.2005.030531.x. [DOI] [PubMed] [Google Scholar]
- Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH. Missing heritability and strategies for finding the underlying causes of complex disease. Nature Reviews Genetics. 2010;11:446–450. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Falconer DS. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of human genetics. 1965;29:51–76. [Google Scholar]
- Feingold E, Brown PO, Siegmund D. Gaussian models for genetic linkage analysis using complete high-resolution maps of identity by descent. American Journal of Human Genetics. 1993;53:234–251. [PMC free article] [PubMed] [Google Scholar]
- Group ACPS. The alpha-tocopherol, beta-carotene lung cancer prevention study: design, methods, participant characteristics, and compliance. Annals of Epidemiology. 1994;4:1–10. doi: 10.1016/1047-2797(94)90036-1. [DOI] [PubMed] [Google Scholar]
- Han SS, Rosenberg PS, Chatterjee N. Testing for Gene–Environment and Gene–Gene Interactions Under Monotonicity Constraints. Journal of the American Statistical Association. 2012a;107:1441–1452. [Google Scholar]
- Han SS, Rosenberg PS, Garcia-Closas M, Figueroa JD, Silverman D, Chanock SJ, et al. Likelihood ratio test for detecting gene (G)-environment (E) interactions under an additive risk model exploiting GE independence for case-control data. American journal of epidemiology. 2012b;176:1060–1067. doi: 10.1093/aje/kws166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hayes RB, Sigurdson A, Moore L, Peters U, Huang WY, Pinsky P, et al. Methods for etiologic and early marker investigations in the PLCO trial. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis. 2005;592:147–154. doi: 10.1016/j.mrfmmm.2005.06.013. [DOI] [PubMed] [Google Scholar]
- Hu Z, Wu C, Shi Y, Guo H, Zhao X, Yin Z, et al. A genome-wide association study identifies two new lung cancer susceptibility loci at 13q12. 12 and 22q12. 2 in Han Chinese. Nature genetics. 2011;43:792–796. doi: 10.1038/ng.875. [DOI] [PubMed] [Google Scholar]
- Hung RJ, McKay JD, Gaborieau V, Boffetta P, Hashibe M, Zaridze D, et al. A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25. Nature. 2008;452:633–637. doi: 10.1038/nature06885. [DOI] [PubMed] [Google Scholar]
- Jin G, Ma H, Wu C, Dai J, Zhang R, Shi Y, et al. Genetic variants at 6p21. 1 and 7p15. 3 are associated with risk of multiple cancers in Han Chinese. The American Journal of Human Genetics. 2012;91:928–934. doi: 10.1016/j.ajhg.2012.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Human heredity. 2007;63:111–119. doi: 10.1159/000099183. [DOI] [PubMed] [Google Scholar]
- Lan Q, Hsiung CA, Matsuo K, Hong Y-C, Seow A, Wang Z, et al. Genome-wide association analysis identifies new lung cancer susceptibility loci in never-smoking women in Asia. Nature genetics. 2012;44:1330–1335. doi: 10.1038/ng.2456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landi MT, Chatterjee N, Yu K, Goldin LR, Goldstein AM, Rotunno M, et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. The American Journal of Human Genetics. 2009;85:679–691. doi: 10.1016/j.ajhg.2009.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landi MT, Consonni D, Rotunno M, Bergen AW, Goldstein AM, Lubin JH, et al. Environment And Genetics in Lung cancer Etiology(EAGLE) study: An integrative population-based case-control study of lung cancer. BMC Public Health. 2008;8:203. doi: 10.1186/1471-2458-8-203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leadbetter MR, Lindgren G, Rootzen H. Extremes and related properties of random sequences and processes. Springer, NY: 1983. [Google Scholar]
- McKay JD, Hung RJ, Gaborieau V, Boffetta P, Chabrier A, Byrnes G, et al. Lung cancer susceptibility locus at 5p15. 33. Nature genetics. 2008;40:1404–1406. doi: 10.1038/ng.254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perry JR, Voight BF, Yengo L, Amin N, Dupuis J, Ganser M, et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genetics. 2012;8:e1002741. doi: 10.1371/journal.pgen.1002741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piegorsch WW, Weinberg CR, Taylor JA. Non hierarchical logistic models and case only designs for assessing susceptibility in population based case control studies. Statistics in medicine. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
- Pirinen M, Donnelly P, Spencer CC. Including known covariates can reduce power to detect genetic effects in case-control studies. Nature genetics. 2012;44:848–851. doi: 10.1038/ng.2346. [DOI] [PubMed] [Google Scholar]
- Rimm EB, Giovannucci EL, Willett WC, Colditz GA, Ascherio A, Rosner B, et al. Prospective study of alcohol consumption and risk of coronary disease in men. The Lancet. 1991;338:464–468. doi: 10.1016/0140-6736(91)90542-w. [DOI] [PubMed] [Google Scholar]
- Rothman KJ, Greenland S, Walker AM. Concepts of interaction. American journal of epidemiology. 1980;112:467–470. doi: 10.1093/oxfordjournals.aje.a113015. [DOI] [PubMed] [Google Scholar]
- Saccone NL, Culverhouse RC, Schwantes-An T-H, Cannon DS, Chen X, Cichon S, et al. Multiple independent loci at chromosome 15q25. 1 affect smoking quantity: a meta-analysis and comparison with lung cancer and COPD. PLoS Genetics. 2010;6:e1001053. doi: 10.1371/journal.pgen.1001053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- WWang Y, Broderick P, Webb E, Wu X, Vijayakrishnan J, Matakidou A, et al. Common 5p15. 33 and 6p21. 33 variants influence lung cancer risk. Nature genetics. 2008;40:1407–1409. doi: 10.1038/ng.273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaitlen N, Lindstrm S, Pasaniuc B, Cornelis M, Genovese G, Pollack S, et al. Informed conditioning on clinical covariates increases power in case-control association studies. PLoS Genetics. 2012;8:e1003032. doi: 10.1371/journal.pgen.1003032. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



