Abstract
Identification of gene-environment interaction (GxE) is important in understanding the etiology of complex diseases. However, partially due to the lack of power, there have been very few replicated GxE findings compared to the success in marginal association studies. The existing GxE testing methods mainly focus on improving the power for individual markers. In this paper, we took a different strategy and proposed a Set Based gene EnviRonment InterAction test (SBERIA), which can improve the power by reducing the multiple testing burdens and aggregating signals within a set. The major challenge of the signal aggregation within a set is how to tell signals from noise and how to determine the direction of the signals. SBERIA takes advantage of the established correlation screening for GxE to guide the aggregation of genotypes within a marker set. The correlation screening has been shown to be an efficient way of selecting potential GxE candidate SNPs in case-control studies for complex diseases. Importantly, the correlation screening in case-control combined samples is independent of the interaction test. With this desirable feature, SBERIA maintains the correct type I error level and can be easily implemented in a regular logistic regression setting. We showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants. We also applied SBERIA to real GWAS data of 10,729 colorectal cancer cases and 13,328 controls and found evidence of interaction between the set of known colorectal cancer susceptibility loci and smoking.
Keywords: gene-environment interaction, set based, correlation screening, GWAS, rare variants
Introduction
Both genetic (G) and environmental (E) factors impact common complex diseases, such as cancer, diabetes or cardiovascular diseases. For most of these diseases, several environmental factors and a rapidly increasing number of genetic factors have been identified [Hindorff et al. 2009]. However, little is understood about the interplay between G and E. Some exceptions include an observed interactions between smoking and the GSTM1 deletion and a tag SNP in NAT2 in bladder cancer [García-Closas et al. 2005; Rothman et al. 2010], ADH7 variants and alcohol consumption in upper aerodigestive cancers [Hashibe et al. 2008] or GRIN2A variants and coffee consumption in Parkinson’s disease [Hamza et al. 2011].
While measurement error and data harmonization issues across studies for the environmental factors may have contributed to the limited numbers of confirmed gene-environment interactions (GxE), probably more importantly, the statistical power to detect an interaction is much smaller compared to detecting a main effect. In fact, it has been shown that the detection of an interaction needs at least approximately four times as many subjects as are needed to detect a main genetic effect of comparable effect size [Smith and Day 1984]. A number of methods have been proposed to enhance the power of detecting GxE which includes the case-only test [Piegorsch, Weinberg and Taylor 1994; Chatterjee and Carroll 2005], the empirical Bayes method [Mukherjee and Chatterjee 2008], and the Bayesian Model Averaging method [Li and Conti 2009]. Two types of screening methods have also been proposed to reduce the multiple testing burden in genome-wide GxE search: the correlation based screening [Murcray, Lewinger and Gauderman 2009] and the marginal association based screening [Kooperberg and Leblanc 2008]. Toward this end, several recent methods were developed to combine and take advantage of different screening and testing techniques, such as the hybrid method by Murcray et al. (2011) [Murcray et al. 2011] and cocktail method by Hsu et al. (2012) [Hsu et al. 2012].
The abovementioned efforts focus on improving the power of detecting GxE for individual markers. On the other hand, the set based association testing has attracted increasing interest. A set based method can not only enhance the power by aggregating multiple signals in the same set, but also greatly reduce the number of tests to be performed and thus reduce the multiple testing burden. Most of the existing set based methods are for detecting genetic main effects, which means testing the association between a set of SNPs and a phenotype. Tzeng et al. (2011) [Tzeng et al. 2011] provided a nice summary of those methods, which include burden tests that compute weighted sum of genotypes across markers [Wang and Elston 2007; Gauderman et al. 2007; Wang and Abbott 2008; Li et al. 2009], methods that exploit the pair-wise genetic similarity among samples [Tzeng et al. 2003, 2009; Beckmann et al. 2005; Schaid et al. 2005; Wessel and Schork 2006; Dempfle et al. 2007; Wei et al. 2008; Mukhopadhyay et al. 2010], variance component methods [Goeman et al. 2004; Tzeng and Zhang 2007; Kwee et al. 2008; Wu et al. 2010; Schaid 2010; Neale et al. 2011], a method that combines p-values within a gene [Liu et al. 2010], group additive regression [Luan and Li 2008], Tukey’s model [Chatterjee et al. 2006], and an entropy-based method [Zhao, Boerwinkle and Xiong 2005]. Set based methods have drawn more attention in the sequencing studies because of the rarity of the variants, for example several variations of the burden tests [Morgenthaler and Thilly 2007; Li and Leal 2008, 2009; Madsen and Browning 2009; Han and Pan 2010; Morris and Zeggini 2010; Price et al. 2010] and variance component tests [Neale et al. 2011; Wu et al. 2011] have been proposed for sequencing data. In contrast, few methods have been proposed for set based GxE tests. Tzeng et al. (2011) developed a method to test for interaction between a set of markers and an environment variable by extending the set based genetic similarity method to the GxE setting [Tzeng et al. 2011]. As there is no competing method, they compared the new method with the benchmark minimum p-value method and their method showed favorable performance. However, their method was designed for a continuous outcome and cannot be applied to a case-control study for complex diseases.
A natural approach to developing a set based GxE test is directly extending the set based main effect test by treating the interaction term (usually the product of G and E) as a new genetic variable. For example, the existing burden test computes the (un)weighted sum of the genotypes (minor alleles counts) across SNPs in the set and test whether the sum is associated with the phenotype. A simple extension of burden tests to the GxE setting would be to sum the interaction terms (products) of G and E instead of summing over the G’s alone. However, this kind of approach has several disadvantages. First, assumptions that are reasonable for main effects may not be reasonable for GxE, i.e, the power of burden tests for rare variants depends on the assumption that most rare missense variants are deleterious but it is not reasonable to assume all GxE’s have the same direction. In addition, this simple extension fails to exploit some unique characteristics of GxE. For instance, one major difficulty in the set based main effect test is the lack of prior information on which SNPs are null and what directions the effects are. In contrast, this valuable information can be partially obtained for interaction effect from established screening statistics for GxE tests.
To overcome the aforementioned drawbacks, we proposed a novel Set Based gene EnviRonment InterAction (SBERIA) test for case-control studies. The proposed method uses the correlation between the environmental variable and the SNPs in a set as a guide to aggregate the genotypes. The aggregated genotype is then used to test for interaction in a regular logistic regression model. SBERIA is easy to implement and efficient in computation. It can be applied to both common and rare variants. We demonstrate through simulation that our proposed method is more powerful compared to the benchmark methods under a wide range of scenarios, including both GWAS and rare variant settings. We also applied SBERIA to real GWAS data and found evidence of interaction between the set of previously identified colorectal cancer susceptibility loci and smoking.
Material and Methods
Notations and Models
Suppose there are N subjects and the disease status is denoted by Di (=0 or 1) for subject i, i=1, …N. Assume Ei is the environmental variable, Xi = (X i1, … X iq) is a vector of q potential confounder covariates, and G i = (Gi1, … Gip) is a vector of p genetic markers. The interaction model between the set of p markers and the environmental variable is:
| (1) |
where logit() is the logit link function; α0 is the intercept; α1 is the coefficient for the main effect of Ei ; α2 is the px1 vector of coefficients for Gi ; α3 is the qx1 vector of coefficients for Xi ; EiGi = (EiGi1,…EiGip) ; β = (β1, …,βp)T is the px1 vector of interaction coefficients. The null hypothesis for interaction effects is H 0 : β = 0.
Two benchmark methods
A typical method of testing H 0 : β = 0 is the likelihood ratio test which compares the likelihood of models (1) with and without the interaction terms and then tests the hypothesis with a p degree of freedom (DF) Chi-square test. We will refer this test as the LR test in the rest of the paper. A problem of the LR test in this case is that the relatively large number of markers or high LD among markers could result in numerical instability, leading to inflated type I error, which we will show in the simulation.
Another commonly used method is the so called minimum p-value (min-p) method. The min-p method tests interaction for each marker j in the set individually with the following model:
| (2) |
and the hypothesis to be tested is H0 : β j = 0, for j=1 to p. From the p interaction p-values for the p SNPs, the min-p method selects the smallest p-value and corrects it for multiple comparisons using permutation or by estimating the effective number of DF [Gao, Starmer and Martin 2008; Moskvina and Schmidt 2008]. In our simulation, we will use 10,000 permutations to determine the corrected p-value for the min-p method. As we can see, the min-p method avoids the problem of potential large number of predictors in the LR test by modeling each marker individually instead of jointly. However, the min-p method is not efficient in situations where causal SNPs are in LD with multiple SNPs or when multiple independent signals exist in the set, as it only considers the minimum p-value.
The SBERIA method
The main motivation for performing a set based analysis is that aggregating signals of markers can potentially boost the power. However, as described in the Introduction, one difficulty in the signal aggregation is how to tell signals from noise and how to determine the direction of the signals. In the set based main effect tests, there have been several attempts trying to solve this issue. Han and Pan (2010) used the signs of the marginal effect to determine the direction of the main effect [Han and Pan 2010]. Lin and Tang (2011) used the corresponding regression coefficient plus a constant as the weight for each marker [Lin and Tang 2011]. Cai et al. (2012) proposed to weight each marker based on the z-score of its effect [Cai, Lin and Carroll 2012]. One common characteristic of these methods is that the statistics used to weight the markers are not independent of the main effect test. Hence, permutation is needed to estimate the null distribution and maintain the correct type I error, which is computationally intensive. Fortunately for GxE, there are screening statistics that are informative for weighting the markers but still independent with the interaction test. Therefore it would be appealing to take advantage of this desirable feature of the GxE test.
Correlation screening has been established as an efficient screening tool for the GxE test [Murcray et al. 2009]. Let’s consider the following simple example to see the rationale of the correlation screening. Suppose there is a rare disease D, an environmental variable E (=0 or 1), and a genetic variable G (=0 or 1). G and E are assumed to be independent in controls (and because of the rarity of the disease, also approximately independent in general population). Assume there is a positive interaction between E and G such that the disease risk would only increase when both E=1 and G=1. Then we expect to see more E=1&G=1 combinations in the cases, which means G and E will be positively correlated in the cases. As G and E are independent in controls, they will be also positively correlated in the combined case-control samples. On the other hand, if E and G impact D independently without interaction, it can be shown (Supporting information) that E and G are approximately when the disease is rare. From this simple example, we can see that the correlation between G and E combined case-control samples can be useful as a screening statistic for interaction between G and E. In addition, the direction of the correlation can inform the direction of the interaction. More importantly, as the correlation screening is conducted on the case control combined samples and it does not use the phenotype information, it has been shown both by Murcray et. al [Murcray et al. 2009] and Dai et. al [Dai et al. 2012] that the correlation screening in combined case-control samples is asymptotically independent of the GxE test, no matter whether G and E are independent or not. This motivates us to propose the following method.
We first compute the correlation between Ei and Gij (j=1 to p) in (1) by either fitting a logistic regression (when Ei is binary) or a linear regression (when Ei is continuous) with Ei as the response and Gij as the predictor. Then for each SNP j (j=1 to p), we get a Z-score Z j for the correlation between Ei and Gij. Then we fit the following logistic regression:
| (3) |
where ŵ = (ŵ1, …ŵp)T is the weight vector and ŵ j = I(|Zj |>θ N) sign (Z j) + ε. I (x) is an indicator function which equals 0 when x is false and 1 when x is true. sign(x) = 1 when x>0, −1 when x<0 and 0 when x=0. θ N = o(N1/ 2) and ε are pre-specified positive constants. The hypothesis of interest is H 0 : ρ = 0.
As we can see, Ei G i ŵ is the weighted sum of the interaction terms and the weight, which can be 1, −1 or 0 (if we ignore ε), is determined by correlation Z-score Z j. | Z j | measures the strength of the correlation signal so I (| Z j |>θ N) only selects markers showing correlation signals that are greater than a threshold. θ N = o(N1/ 2) because we expected I (| Z j |>θ N) to converge to 0 as N → ∞ when there is no correlation between G and E in the combined sample and converge to 1 when there is correlation. For the selected marker (markers with I (| Z j |>θ N) =1), the direction of the interaction term is determined by the direction of the correlation (sign(Z j)). This is inspired by the observation that the directions of interaction and correlation tend to agree in the simple example above. The addition of a constant ε ensures that a weight will be assigned if no marker is selected.
θN and ε need to be specified for SBERIA. In practice, we found through simulation that the power of SBERIA did not change substantially as θN changes for a given N between 2,000 to 20,000 (results not shown). Hence in this paper, we set θ N to a constant such that Prob(| Z j |>θ N) ≈ 0.1 under the null. ε is set to a very small value (0.0001) so that it does not affect the weight if I (| Z j |>θ N)sign(Z j) is not 0.
In summary, SBERIA first selects markers of which the correlation signal strength is greater than a threshold. For the selected markers, we compute a weighted sum of their interaction terms, where the weight=1 if the corresponding correlation is positive and −1 otherwise. As the correlation statistic is independent of the interaction test, regular logistic regression can be used to test the hypothesis without requiring permutation. The validity of our method is proved in the Supporting information. We also conduct extensive simulation to evaluate the type I error rate and power of SBERIA.
Simulation
To evaluate the performance of SBERIA, we conducted extensive simulation under various settings.
Set based GxE in GWAS settings
1. A Gene-based marker set
We mimicked the real GWAS data by generating a set of markers based on the realistic LD structure within the SMAD7 gene. SMAD7, short for SMAD family member 7, is a gene located at 18q21.1. It is known to interact with the TGF-beta receptor and several SNPs in this region have been found to associate with colorectal cancer risk [Broderick et al. 2007; Tenesa et al. 2008; Tomlinson et al. 2008; Peters et al. 2011]. SMAD spans from 44,700k bp to 44,731k bp and has 48 SNPs from Hapmap II release 24 [The International HapMap Project. 2003], which is close to the median number (=43) of SNPs per gene [Huang et al. 2011]. Out of the 48 SNPs, 21 were genotyped in Illumina Human1M. We extracted the haplotypes of the 21 SNPs from the phased Hapmap data and randomly paired haplotypes such that the simulated marker set maintains the same LD structure as the 21 SNPs in the Hapmap. The LD structure of the 21 SNPs is shown in Supplemental Figure 1. We chose two SNPs rs4939827 and rs7351039 from the 21 SNPs and make them the hidden causal SNPs in the simulation. The two SNPs were chosen such that one is common (rs4939827, MAF=0.49) and one is less common (rs7351039, 11 MAF=0.08). The two chosen SNPs are not in LD with each other and both SNPs were tagged by some other SNPs. The other 19 SNPs were considered as the marker set in the simulation.
The disease status was generated based on the following model:
| (4) |
where α 0 = exp(− 5), representing a relatively rare disease. Gi1 and Gi2 are the simulated genotypes (=0,1, or 2) for rs4939827 and rs7351039, respectively. Ei is the environmental variable. We tried two ways of generating Ei : 1) Ei is continuous: Ei ~ N(0,1) ; 2) Ei is binary: Ei ~ Bernoulli(p = 0.3) ;
Type I error
To evaluate the type I error rate, we set β 1 = β 2 = 0in (4). We let α 1 = α 2 = 0 or log(1.5). As described above, we used the four different ways to generate Ei. For each simulation scenario, we randomly generated 1,000 cases and 1,000 controls. Then we performed the set based GxE tests using the LR test, the min-p method and SBERIA. The procedure was repeated 2,000 times to estimate the type I error rate with significance level 0.05.
Power
To evaluate the power, we set β1 =log(1.05), log(1.10), log(1.15), log(1.20), log(1.25), or log(1.3) when Ei is continuous and β 1 =log(1.1), log(1.2), log(1.3), log(1.4), log(1.5) or log(1.6) when Ei is binary. The values of β1 were chosen such that the power was in a reasonable range. For each value of β 1, β 2 can take three values β1, − β 1, or 0, which represents situations where two signals are in the same direction, in the different direction or when there is only one signal, respectively. The main effects α 1 and α 2 were set to 0. We also tried other values for the main effects and the results were quantitatively similar. Same as above, we randomly generated 1,000 cases and 1,000 controls. We evaluated SBERIA, the min-p method and the LR test for the power performance. Each parameter setting for the simulation was repeated 2,000 times and we used significance level 0.05.
2. A set of independent markers
In the simulation above, the 21 SNPs in the set were not independent with each other as they were generated based on the LD structure in the SMAD7 gene. In addition to grouping SNPs by genes, there are other ways of forming a marker set in practice. For example, it is common practice to pull together previously identified susceptibility loci for a given trait and study them as a set. To mimic this situation, we generated 20 independent SNPs. For each SNP, its MAF is generated from uniform distribution U (0.1,0.5) under Hardy-Weinberg equilibrium. We randomly chose two SNPs as potential causal SNP for GxE. The disease status was generated based on the following model:
| (5) |
where Gi1 and Gi2 are the genotypes for two chosen causal GxE SNPs. The main effects α j ’s were generated from U (log(1.05), log(1.5)).
A wide adopted way of summarizing information from previously identified susceptibility loci is to calculate the genetic risk score (GRS), which is the sum of risk alleles from all SNPs. Hence in this simulation scenario, we also tried to perform the set based GxE test by computing GRS and test the interaction between GRS and E using a regular logistic regression. The same parameters and procedures as the first simulation scenario were used to evaluate type I error and power for SBERIA, the min-p method, LR test, and the GRS method.
3. Correlated G and E
G and E were assumed to be independent in the simulation so far, which is a reasonable assumption in real applications [Cornelis et al. 2012]. However, in rare situations, G and E can be correlated in the general population. As shown in Murcray et al. (2011) [Murcray et al. 2011] and Hsu et al. (2012) [Hsu et al. 2012], the correlation screening is not efficient when G and E are negatively correlated. Hence, they proposed to use some combinations of correlation screening and marginal screening (which uses the marginal association test of each SNP as a screening for interaction test). In the current simulation scenario, we also tried a simple modification to SBERIA that combines correlation and marginal screenings in way similar to Gauderman et al (2012) [Gauderman, Zhang and Lewinger 2012]. Specifically, instead of using ŵ j = I (|Z j |>θN) sign (Z j) + ε in (3), we define
| (6) |
where and M j is the wald statistic of the marginal association for marker j (j=1 to p); C j = Z j if else C j = M j. τ N is also defined such that Prob(S j >τ N)=0.1 under the null.
The same settings were used as the first simulation scenario when β 1 = β 2, except that Ei was generated to be correlated with G. We considered two scenarios:
E is correlated with the two causal SNPs. In this setting, Ei is either positively correlated with Gi1 and Gi2 : logit (Ei) = logit (0.3) + b1Gi1 + b2Gi2, where b1 = b2 = log(1.2) or Ei is negatively correlated with Gi1 and positively correlated with Gi2 (b1 = −b2 = − log(1.2)).
E is correlated with two random selected null SNPs. Similar as above, Ei can also be positively or negatively correlated with the two null SNPs.
Same procedures as before were used to evaluate the type I error and power of SBERIA, the min-p method, LR test, and the modification to SBERIA.
Set based GxE in rare variant setting
We also conducted simulations to evaluate the performance of SBERIA if the variants in the marker set are less common, as in sequencing data. In the simulation experiment, we followed the simulation set-up proposed in Lin and Tang (2011) to generate the genotypes for rare variants [Lin and Tang 2011]. Specifically, we generated 10 variants Gij (j=1 to 10) with MAF=0.005*j under Hardy-Weinberg equilibrium. As it is less likely for rare variants to correlate with the environmental variable, we generated Ei either as a continuous variable from N(0,1) or a binary environmental variable Ei from Bernoulli(0.3).
The disease status was generated from the following model
| (7) |
where α0 is set to exp(−5) and γ is set to be log(1.2) as in the GWAS simulation. As there is no competing set based GxE method in the rare variant setting, in addition to min-p and LR test, we decided to compare SBERIA with the simple extension of the burden test as described in the introduction. We will denote this method as burden GxE. Specifically, burden GxE creates a new variable , which is the total number of minor alleles across the 10 rare variants.
Then it tests the interaction by fitting the following model and tests H 0 : λ = 0:
| (8) |
Type I error
To evaluate the type I error rate, the coefficients β j ’s (j=1 to 10) were set to 0. We randomly generated α j ’s from a uniform distribution U(log(1.2), log(3)). As before, we randomly sampled 1,000 cases and 1,000 controls. The procedures were replicated 2,000 times to estimate the type I error rate for SBERIA, min-p, LR test and burden GxE with significance level 0.05.
Power
To evaluate the power, we first randomly selected m (=8, 5, or 2) markers as the causal variants from the 10 variants. Then we randomly generated the effect size of the selected variants from U (log(1.2)c, log(3)c). As the sample size of our simulation is only 2,000, c was chosen to be 1.5 such that the power was in a reasonable range. In practice, the sample size should be much larger to study rare variant. As we can see, in this way all effects are positive, which may not be realistic in GxE setting. Hence, we randomly set the direction of the interaction effect for a subset of causal SNPs to negative (proportion = 0.2, 0.4 or 0.5). 1,000 cases and 1,000 controls were generated and the power was estimated from 2,000 replications with significance level 0.05. We only presented the results from the binary Ei ’s because the results from continuous Ei ’s were similar.
A real data application
To evaluate the performance of SBERIA in real application, we applied SBERIA to the GWAS data of Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO). Specifically, GECCO included the following nested case-control studies in prospective US cohorts Health Professionals Follow-up Study (HPFS); Multiethnic Cohort Study (MEC); Nurses’ Health Study (NHS); Physician’s Health Study (PHS); Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO); VITamins And Lifestyle (VITAL); Woman’s Health Initiative (WHI); and the following case-control studies from the US, Canada and Europe [Colorectal Cancer Studies 2&3 (Colo2&3); Darmkrebs: Chancen der Verhuetung durch Screening (DACHS); Diet, Activity and Lifestyle Survey (DALS); Ontario Familial Colorectal Cancer Registry (OFCCR); and Postmenopausal Hormone study-Colon Cancer Family Registry (PMH-CCFR). Numbers of cases and controls, age, and sex distributions are listed in Supplemental Table 1. Study-specific descriptions, including eligibility and matching criteria, is available in Peters et al (2012) [Peters et al. 2012]. Colorectal cancer (CRC) cases were defined as colorectal adenocarcinoma and confirmed by medical records, pathology reports, or death certificates. Colorectal adenoma cases were confirmed by medical records, histopathology, or pathologic reports. Controls for adenoma cases had a negative colonoscopy (except for NHS and HPFS controls matched to cases with distal adenoma, which either had a negative sigmoidoscopy or colonoscopy exam). All participants gave written informed consent and studies were approved by their respective Institutional Review Boards. Genotyping were done on various platforms and imputed to Hapmap II. Please see a detailed description of genotyping, quality control and imputation in GECCO in the supplemental material.
A number of loci have been identified to associate with CRC risk [Zanke et al. 2007; Tomlinson et al. 2007, 2008, 2011; Broderick et al. 2007; Jaeger et al. 2008; Tenesa et al. 2008; Houlston et al. 2008, 2010; Peters et al. 2011; Dunlop et al. 2012]. These CRC susceptibility loci are useful for genetic risk profiling and allow the stratification of population subgroups at different genetic risks [Lubbe et al. 2012]. To get a more comprehensive understanding of CRC risk, it is also of interest to explore possible interactions between the genetic risk factors and environmental variables. In this paper, we included the genotypes of 25 known CRC loci (Table 5) in GECCO and treated them as a marker set. We then tested for interaction between this marker set and smoking status (ever/never). Smoking status is a dichotomous variable harmonized across all studies. Please see the supplemental material for details of the harmonization procedure.
Table 5.
Previously identified CRC susceptibility loci.
| SNP | Refa | Chromosome | Count Allele | CAFb | ORc (95%CId) |
|---|---|---|---|---|---|
| rs6691170 | 1 | 1q41 | G | 0.63 | 0.94 (0.92–0.97) |
| rs6687758 | 1 | 1q41 | A | 0.80 | 0.92 (0.89–0.94) |
| rs10936599 | 1 | 3q26.2 | C | 0.77 | 1.08 (1.04–1.10) |
| rs1321311 | 14 | 6p21 | A | 0.25 | 1.10 (1.07–1.13) |
| rs16892766 | 6 | 8q23.3 | A | 0.92 | 0.80 (0.76–0.84) |
| rs10505477 | 3 | 8q24 | A | 0.50 | 1.17 (1.12–1.23) |
| rs6983267 | 2–5 | 8q24 | G | 0.50 | 1.21 (1.18–1.24) |
| rs7014346 | 7 | 8q24 | A | 0.36 | 1.19 (1.15–1.23) |
| rs719725 | 3,13 | 9p24 | A | 0.62 | 1.07 (1.03–1.12) |
| rs10795668 | 6 | 10p14 | A | 0.31 | 0.89 (0.86–0.91) |
| rs3824999 | 14 | 11q13.4 | G | 0.51 | 1.08 (1.05–1.10) |
| rs3802842 | 7 | 11q23 | A | 0.71 | 0.90 (0.87–0.93) |
| rs7136702 | 1 | 12q13.13 | C | 0.68 | 0.94 (0.93–0.96) |
| rs11169552 | 1 | 12q13.13 | C | 0.73 | 1.09 (1.05–1.11) |
| rs4444235 | 8,9 | 14q22.2 | C | 0.46 | 1.09 (1.06–1.12) |
| rs1957636 | 8 | 14q22.2 | C | 0.59 | 0.92 (0.90–0.95) |
| rs16969681 | 8 | 15q13 | C | 0.91 | 0.84 (0.80–0.90) |
| rs4779584 | 8,10 | 15q13 | C | 0.82 | 0.87 (0.84–0.91) |
| rs11632715 | 8 | 15q13 | A | 0.48 | 1.12 (1.08–1.16) |
| rs9929218 | 9 | 16q22.1 | A | 0.30 | 0.91 (0.89–0.94) |
| rs4939827 | 7,11 | 18q21 | C | 0.48 | 0.83 (0.81–0.86) |
| rs10411210 | 9 | 19q13.1 | C | 0.90 | 1.15 (1.10–1.20) |
| rs961253 | 8,9 | 20p12.3 | A | 0.36 | 1.12 (1.09–1.15) |
| rs4813802 | 8,12 | 20p12.3 | G | 0.34 | 1.09 (1.06–1.12) |
| rs4925386 | 1,12 | 20q13.33 | C | 0.69 | 1.08 (1.05–1.10) |
Ref=references for identifying allele, and for ORs presented;
CAF=count allele frequency in European decent populations;
OR=odds ratio;
CI=confidence interval;
Only the first reference’s OR of the SNPs with more than one reference is shown in the table. The same situation applies for the Studies in Previous Publications column.
References: 1. Houlston et al. Nature Genetics 2010; 2. Tomlinson et al. Nature Genetics 2007; 3. Zanke et al. Nature Genetics 2007; 4. Haiman et al. Nature Genetics 2007; 5. Hutter et al. BMC Cancer 2010; 6. Tomlinson et al. Nature Genetics 2008; 7. Tenesa et al. Nature Genetics 2008; 8. Tomlinson et al. Nature Genetics 2011; 9. COGENT Nature Genetics 2008; 10. Jaeger et al. Nature Genetics 2008; 11. Broderick et al. Nature Genetics 2007; 12. Peters et al. Human Genetics 2011; 13. Kocarnik et al. CEBP 2010. 14. Dunlop et al. Nature Genetics 2012;
Specifically, we created a pooled dataset of 10,729 cases and 13,328 controls for the 25 known CRC loci by combining the studies in GECCO. Each directly genotyped SNP was coded as 0, 1 or 2 copies of the variant allele. For imputed SNPs, we used the expected number of copies of the variant allele (the “dosage”). Both genotyped and imputed SNPs are treated as continuous variable (i.e. log-additive effects). We then applied SBERIA in (3) to the pooled dataset. The covariates X we adjusted for include age, sex, the first three principle components, study indicators and the interaction between principle components and study indicator. As a comparison, we also tried two possible benchmark methods: the min-p method, which computes the interaction p-value for each of the 25 SNPs separately and selects the minimum p-value while correcting for multiple comparisons using the Bonferroni method; the second alternative method is to compute GRS and test the interaction between GRS and smoking status using a regular logistic regression.
Results
Set based GxE in GWAS settings
1. A Gene-based marker set
The estimated type I error for SBERIA, min-p and the LR test are summarized in Table 1. It can be seen that both SBERIA and the min-p method always maintain the correct type I error (0.05). However, the LR test generally gives inflated type I error, which is a result of its numerical instability due to the relatively large number of variables. Figure 1 shows the power comparison results for this simulation scenario. It can be seen that when β 1 = β 2, SBERIA has better power than both min-p and the LR test. The average power gain of SBERIA over min-p is 13.9% with a range of −6% to 24.3% (excluding data points where the power of min-p is less than 0.1 to prevent numeric instability). Also as expected, SBERIA is still more powerful than the min-p method (average percent of power gain is 12.5% with a range from 5.2% to 19.5%) when the two causal SNPs have interaction effects in opposite directions (β 1 = −β 2), which demonstrates that the correlation screening is able to predict the direction of interaction effect fairly well. With inflated type I error, the LR test still only gives power that was close to or less than SBERIA. For the scenario where there was only one causal SNP (β 2=0), SBERIA still performs better than the other two methods (average percent of power gain over min-p is 11.5% with a range from 0.4% to 18.2%). This could be attributed to the fact that SBERIA aggregates information from several LD SNPs of the causal variant and thereby increases the power. The advantage of SBERIA is more apparent if one considers the fact that the min-p method requires often time consuming permutation to get the corrected p-value (otherwise the simple Bonferroni correction using the number of markers in the set would be too conservative).
Table 1.
Type I error rate (95% CI) for SBERIA, min-p, and LR test in simulation scenario 1 (A Gene-based marker set) in GWAS settings.
| SBERIA | Min-p | LR test | |
|---|---|---|---|
|
Ei is continuous and independent of Gi1 and Gi2
| |||
| α1 =α2 =0 | 0.044 (0.035 0.052) | 0.045 (0.036 0.054) | 0.061 (0.051 0.071) |
| α1 =α2 =log(1.5) | 0.045 (0.036 0.054) | 0.042 (0.033 0.05) | 0.062 (0.052 0.073) |
|
| |||
|
Ei is binary and independent of Gi1 and Gi2
| |||
| α1 =α2 =0 | 0.049 (0.04 0.058) | 0.046 (0.037 0.055) | 0.070 (0.059 0.081) |
| α1 =α2 =log(1.5) | 0.042 (0.033 0.051) | 0.046 (0.036 0.055) | 0.063 (0.052 0.074) |
Figure 1.
Power comparison between SBERIA, the min-p method, and the LR test in simulation scenario 1 (A Gene-based marker set) of GWAS settings. The three plots on the left are results when Ei was generated as continuous variable and the plots on the right are for binary Ei ’s. The top plots are for simulation scenarios where β 1 = β 2 ; the plots in the middle are for scenarios where β 1 = −β 2; the bottom plots are for scenarios where β 2= 0.
2. A set of independent markers
The type I error for this simulation scenario was summarized in Table 2. All except the LR test maintain the correct type I error. From Figure 2, it can be seen that SBERIA almost always gives the best power. When β 1 = β 2, the average percent of power gain of SBERIA over min-p is 24.8% with a range from 8.2% to 49.0%; when β 1 = −β 2, the average percent of power gain of SBERIA over min-p is 15.9% with a range from 3.9% to 27.6%; when β 2=0, the average percent of power gain of SBERIA over min-p is 10.7% with a range from −8.6% to 25.2%. The GRS method always gives the lowest power.
Table 2.
Type I error rate (95% CI) for SBERIA, min-p, LR test and GRS method in simulation scenario 2 (A set of independent markers) in GWAS settings.
| SBERIA | Min-p | LR test | GRS test |
|---|---|---|---|
|
Ei is continuous and independent of Gi1 and Gi2
| |||
| 0.044 (0.035 0.054) | 0.042 (0.034 0.051) | 0.059 (0.049 0.069) | 0.044 (0.035 0.053) |
|
| |||
|
Ei is binary and independent of Gi1 and Gi2
| |||
| 0.050 (0.040 0.059) | 0.054 (0.044 0.063) | 0.060 (0.049 0.070) | 0.050 (0.040 0.059) |
Figure 2.
Power comparison between SBERIA, the min-p method, the LR test and the GRS test in simulation scenario 2 (A set of independent markers) of GWAS settings. The three plots on the left are results when Ei was generated as continuous variable and the plots on the right are for binary Ei ’s. The top plots are for simulation scenarios where β 1= β2; the plots in the middle are for scenarios where β1 = −β 2; the bottom plots are for scenarios where β 2= 0.
3. Correlated G and E
From Table 3, it can be seen that only LR test gives inflated type I error. As SBERIA uses the correlation between G and E in case-control combined samples as the screening tool, it is expected that the power of SBERIA would be impacted if gene-environment correlation exists in the general population. The top left plot of Figure 3 show that the power of SBERIA is further boosted if the gene-environment correlations in the general population are positive for both causal variants. As shown in the top right plot of Figure 3, the power of SBERIA drops if the correlation in the general population is in a different direction compared with the interaction, which is in line with expectation. As expected, the simple modification of SBERIA shows a desirable performance in this case. Compared with the unmodified version, it has almost the same magnitude of power gain when the correlations are positive and have little power loss when the correlation is in a different direction compared to the interaction. If there are correlation between null SNPs and E, the two plots on the bottom of Figure 3 shows that the power advantage of SBERIA and SBERIA-M is reduced, which is expected since the correlation between null SNPs and E would make the null SNPs more likely to be selected and therefore dilute the interaction signal. It is worth noting, however, that gene-environment correlation in population is relatively rare in real applications [Cornelis et al. 2012].
Table 3.
Type I error rate (95% CI) for SBERIA, min-p, LR test and the modification of SBERIA in simulation scenario 3 (Correlated G and E) in GWAS settings.
| SBERIA | Min-p | LR test | SBERIA- modified | |
|---|---|---|---|---|
|
Ei is positively correlated with Gi1 and Gi2
| ||||
| α1 =α2 =0 | 0.052 (0.042 0.062) | 0.042 (0.033 0.05) | 0.061 (0.051 0.071) | 0.054 (0.044 0.064) |
| α1 =α2 =log(1.5) | 0.050 (0.040 0.059) | 0.050 (0.040 0.059) | 0.063 (0.052 0.074) | 0.048 (0.038 0.057) |
|
| ||||
|
Ei is negatively correlated with Gi1 and positively correlated with Gi2
| ||||
| α1 =α2 =0 | 0.058 (0.048 0.069) | 0.046 (0.036 0.055) | 0.062 (0.052 0.073) | 0.057 (0.047 0.067) |
| α1 =α2 =log(1.5) | 0.044 (0.035 0.054) | 0.046 (0.037 0.055) | 0.065 (0.054 0.076) | 0.052 (0.042 0.062) |
Figure 3.
Power comparison between SBERIA, the min-p method, the LR test and SBERIA-M, the modification to SBERIA (as defined in equation (6)), in simulation scenario 3 (E correlated with G) of GWAS settings. The two plots on the top are results when Ei was correlated with two causal SNPs Gi1 and Gi2 and the two plots on the bottom are results when Ei was correlated with two randomly selected null SNPs. The plots on the left are for scenarios where Ei is positively with both SNPs and the plots on the right are for scenarios where Ei is positively correlated with one SNP and negatively correlated with the other.
Set based GxE in rare variant setting
From Table 4, it can be seen that both SBERIA and the burden GxE test maintain the correct type I error. However, the min-p method seems to be conservative, which could be due to the rarity of the SNPs. On the other hand, the LR test is highly inflated. Figure 4 shows the power comparison between various methods. LR test always has the best power, however, given its highly inflated type I error, it is not applicable in practice. It can be seen that SBERIA is always more powerful than the min-p and burden GxE method in the simulation. The advantage of SBERIA is most obvious when around half of the causal loci have negative interaction with E and the others have positive interaction. Again, this shows that correlation screening did a good job informing us about the direction of interaction effects.
Table 4.
Type I error rate (95% CI) for SBERIA and burden GxE in rare variant settings.
| SBERIA | Burden GxE | Min-p | LR test | |
|---|---|---|---|---|
| Ei ~ N(0,1) | 0.045 (0.036 0.054) | 0.050 (0.040 0.060) | 0.031 (0.023 0.039) | 0.085 (0.072 0.097) |
| Ei ~ Bernoulli(0.3) | 0.046 (0.037 0.055) | 0.051 (0.041 0.061) | 0.031 (0.023 0.039) | 0.095 (0.082 0.108) |
Figure 4.
Power comparisons between SBERIA (S), min-p (M), LR test (L), and the burden GxE method (B) for different simulation scenarios in rare variant settings. The results are categorized by combinations of the number of causal variants and the proportion that a causal variant has a negative interaction effect.
A real data application
The results for testing for interaction between the known CRC marker set and smoking status using GECCO GWAS data are summarized in Table 6. It can be seen that SBERIA reaches the significance level 0.05 and the GRS method also gives a p-value close to the significance level. Hence, there is evidence that the genetic risk of CRC is interacting with the smoking status. On the other hand, SBERIA gives a more significant p-value compared to the min-p and the GRS method, which demonstrates the potential advantage of SBERIA. In addition, when exploring which SNPs contribute to the interaction signal in the marker set, we found that rs10936599 shows the strongest evidence – it was selected by the correlation screening of SBERIA and it also has the smallest interaction p-value in min-p.
Table 6.
The results for testing interaction between the known CRC loci marker set and smoking status using different methods.
| SBERIA | Min-p | GRS | |
|---|---|---|---|
| p-value | 5.92×10−3 | 0.28 | 5.41×10−2 |
Discussion
In this paper, we proposed a novel method to test for interaction between a set of markers and an environmental variable in case-control studies. SBERIA takes advantage of the unique features of GxE test by using the correlation screening to inform the aggregation of interaction effects within the marker set. Since the correlation screening in combined case-control samples is independent of the interaction test, SBERIA maintains the correct type I error without requiring permutation. SBERIA uses the regular logistic regression model so it is computationally efficient and easy to be implemented. We showed that SBERIA has appealing power compared with the benchmark methods in both GWAS and rare variant settings.
While applying SBERIA to real data, we found evidence of interaction between genetic risk and smoking status for colorectal cancer. rs10936599, the SNP showing the strongest signal, is located at 3q26.2 in the MYNN gene. MYNN encodes a zinc finger domain-containing protein family which is involved in the control of gene expression. Given that the function of MYNN is largely unknown so far, further functional characterization is needed in order to evaluate and interpret this potential interaction. In the real data application, we included the advanced colorectal adenomas because they are well known precursor lesions of colorectal cancer. As a result, this improves our statistical power to identify GxE that act early in the adenoma-cancer sequence, where adenomas and cancer have a shared etiology. We recognize that the adenoma cases will not show signals for GxE’s that act later in the carcinogenic process (i.e. on progression from adenoma to cancer) or GxE’s that act through adenoma independent pathways.
There are several possible improvements that can be made to SBERIA. First, we chose θ N such that it corresponds to p-value cut-off 0.1. We also tried other p-value cutoffs such as 0.05 and 0.2 in the simulation and the power of SBERIA does not change substantially (results not shown). However, it should be noted that the minor allele frequency affects the power of the correlation screening, and the SNPs with larger MAF will be more likely to pass the screening compared to less common SNPs. Hence, it is of interest to let the threshold vary with MAF. More work should be done to find an optimal θ N. In addition, the current weighting of SBERIA is either 1, −1 or 0. Further work should explore whether the use of more advanced weight, such as the effect size of the correlation screening or the main effect, would increase power. In SBERIA, the main effect is modeled separately for each SNP in the set. It would be interesting to model main effects also in a set-based manner, which could potentially increase power. Furthermore, more sophisticated methods can be built upon the framework of our method. For example, SBERIA drops the markers that are not selected based on screening. However, as the screening is not perfect, those SNPs can still contain useful information. Hence, it could potentially increase power to apply the traditional method (i.e. variance component based method) to the unselected SNPs and combine the results from the selected and unselected SNPs. SBERIA uses the correlation screening to combine SNPs in case-control studies. The strength of the correlation screening is mainly driven by the correlation between G and E in cases when there is GxE interaction. Hence, it is expected that if there are much more controls than cases, the correlation signal will be weakened and the power of correlation screening will be reduced.
In summary, SBERIA shows a promising performance both in simulation and real data application. With its easy implementation and fast computation time, SBERIA provides an attractive approach to detecting set based gene-environment interactions.
Supplementary Material
Acknowledgments
National Cancer Institute, National Institutes of Health, U.S. Department of Health and Human Services (U01 CA137088; R01 CA059045).
ASTERISK: a Hospital Clinical Research Program (PHRC) and supported by the Regional Council of Pays de la Loire, the Groupement des Entreprises Françaises dans la Lutte contre le Cancer (GEFLUC), the Association Anne de Bretagne Génétique and the Ligue Régionale Contre le Cancer (LRCC).
COLO2&3: National Institutes of Health (R01 CA60987).
DACHS: German Research Council (Deutsche Forschungsgemeinschaft, BR 1704/6-1, BR 1704/6-3, BR 1704/6-4 and CH 117/1-1), and the German Federal Ministry of Education and Research (01KH0404 and 01ER0814).
DALS: National Institutes of Health (R01 CA48998 to M.L.S);
Guangzhou-1: National Key Scientific and Technological Project – 2011ZX09307-001-04 and the National Basic Research Program - 2011CB504303, People’s Republic of China. HPFS is supported by the National Institutes of Health (P01 CA 055075, UM1 CA167552, R01 137178, and P50 CA 127003), NHS by the National Institutes of Health (R01 137178, P01 CA 087969 and P50 CA 127003,) and PHS by the National Institutes of Health (CA42182).
MEC: National Institutes of Health (R37 CA54281, P01 CA033619, and R01 CA63464).
OFCCR: National Institutes of Health, through funding allocated to the Ontario Registry for Studies of Familial Colorectal Cancer (U01 CA074783); see CCFR section below. OFCCR is supported by a GL2 grant from the Ontario Research Fund, the Canadian Institutes of Health Research, and the Cancer Risk Evaluation (CaRE) Program grant from the Canadian Cancer Society Research Institute. Thomas J. Hudson and Brent W. Zanke are recipients of Senior Investigator Awards from the Ontario Institute for Cancer Research, through generous support from the Ontario Ministry of Economic Development and Innovation.
PLCO: Intramural Research Program of the Division of Cancer Epidemiology and Genetics and supported by contracts from the Division of Cancer Prevention, National Cancer Institute, NIH, DHHS. Control samples were genotyped as part of the Cancer Genetic Markers of Susceptibility (CGEMS) prostate cancer scan, supported by the Intramural Research Program of the National Cancer Institute. The datasets used in this analysis were accessed with appropriate approval through the dbGaP online resource (http://www.cgems.cancer.gov/data_acess.html) through dbGaP accession number 000207v.1p1.c1 (National Cancer Institute (2009) Cancer Genetic Markers of Susceptibility (CGEMS) data website. http://cgems.cancer.gov/data_access.html; Yeager et al. 2007). Control samples were also genotyped as part of the GWAS of Lung Cancer and Smoking [Landi et al. 2009]. Funding for this work was provided through the National Institutes of Health, Genes, Environment and Health Initiative [NIH GEI] (Z01 CP 010200). The human subjects participating in the GWAS are derived from the Prostate, Lung, Colon and Ovarian Screening Trial and the study is supported by intramural resources of the National Cancer Institute. Assistance with genotype cleaning, as well as with general study coordination, was provided by the Gene Environment Association Studies, GENEVA Coordinating Center (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NIH GEI (U01 HG 004438). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000093.
PMH: National Institutes of Health (R01 CA076366 to P.A.N.).
VITAL: National Institutes of Health (K05 CA154337).
WHI: The WHI program is funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through contracts HHSN268201100046C, HHSN268201100001C, HHSN268201100002C, HHSN268201100003C, HHSN268201100004C, and HHSN271201100004C.
ASTERISK: We are very grateful to Dr. Bruno Buecher without whom this project would not have existed. We also thank all those who agreed to participate in this study, including the patients and the healthy control persons, as well as all the physicians, technicians and students.
DACHS: We thank all participants and cooperating clinicians, and Ute Handte-Daub, Renate Hettler-Jensen, Utz Benscheid, Muhabbet Celik and Ursula Eilber for excellent technical assistance.
GECCO: The authors would like to thank all those at the GECCO Coordinating Center for helping bring together the data and people that made this project possible.
HPFS, NHS and PHS: We would like to acknowledge Patrice Soule and Hardeep Ranu of the Dana Farber Harvard Cancer Center High-Throughput Polymorphism Core who assisted in the genotyping for NHS, HPFS, and PHS under the supervision of Dr. Immaculata Devivo and Dr. David Hunter, Qin (Carolyn) Guo and Lixue Zhu who assisted in programming for NHS and HPFS, and Haiyan Zhang who assisted in programming for the PHS. We would like to thank the participants and staff of the Nurses’ Health Study and the Health Professionals Follow-Up Study, for their valuable contributions as well as the following state cancer registries for their help: AL, AZ, AR, CA, CO, CT, DE, FL, GA, ID, IL, IN, IA, KY, LA, ME, MD, MA, MI, NE, NH, NJ, NY, NC, ND, OH, OK, OR, PA, RI, SC, TN, TX, VA, WA, WY.
PLCO: The authors thank Drs. Christine Berg and Philip Prorok, Division of Cancer Prevention, National Cancer Institute, the Screening Center investigators and staff or the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial, Mr. Tom Riley and staff, Information Management Services, Inc., Ms. Barbara O’Brien and staff, Westat, Inc., and Drs. Bill Kopp, Wen Shao, and staff, SAIC-Frederick. Most importantly, we acknowledge the study participants for their contributions to making this study possible.
PMH: The authors would like to thank the study participants and staff of the Hormones and Colon Cancer study.
WHI: The authors thank the WHI investigators and staff for their dedication, and the study participants for making the program possible. A full listing of WHI investigators can be found at: https://cleo.whi.org/researchers/Documents%20%20Write%20a%20Paper/WHI%20Investigator%20Short%20List.pdf
Footnotes
We declare that no conflict of interest exists.
References
- Beckmann L, Thomas DC, Fischer C, Chang-Claude J. Haplotype sharing analysis using mantel statistics. Human heredity. 2005;59:67–78. doi: 10.1159/000085221. [DOI] [PubMed] [Google Scholar]
- Broderick P, Carvajal-Carmona L, Pittman AM, Webb E, Howarth K, Rowan A, et al. A genome-wide association study shows that common alleles of SMAD7 influence colorectal cancer risk. Nature genetics. 2007;39:1315–7. doi: 10.1038/ng.2007.18. [DOI] [PubMed] [Google Scholar]
- Cai T, Lin X, Carroll RJ. Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test. Biostatistics (Oxford, England) 2012;13:776–90. doi: 10.1093/biostatistics/kxs015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005;92:399–418. [Google Scholar]
- Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. American journal of human genetics. 2006;79:1002–16. doi: 10.1086/509704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cornelis MC, Tchetgen EJT, Liang L, Qi L, Chatterjee N, Hu FB, et al. Gene-environment interactions in genome-wide association studies: a comparative study of tests applied to empirical studies of type 2 diabetes. American journal of epidemiology. 2012;175:191– 202. doi: 10.1093/aje/kwr368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dai JY, Kooperberg C, Leblanc M, Prentice RL. Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika. 2012;99:929– 944. doi: 10.1093/biomet/ass044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dempfle A, Hein R, Beckmann L, Scherag A, Nguyen TT, Schäfer H, et al. Comparison of the power of haplotype-based versus single- and multilocus association methods for gene x environment (gene x sex) interactions and application to gene x smoking and gene x sex interactions in rheumatoid arthritis. BMC proceedings. 2007;1(Suppl 1):S73. doi: 10.1186/1753-6561-1-s1-s73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunlop MG, Dobbins SE, Farrington SM, Jones AM, Palles C, Whiffin N, et al. Common variation near CDKN1A, POLD3 and SHROOM2 influences colorectal cancer risk. Nature genetics. 2012;44:770–6. doi: 10.1038/ng.2293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao X, Starmer J, Martin ER. A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genetic epidemiology. 2008;32:361–9. doi: 10.1002/gepi.20310. [DOI] [PubMed] [Google Scholar]
- García-Closas M, Malats N, Silverman D, Dosemeci M, Kogevinas M, Hein DW, et al. NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: results from the Spanish Bladder Cancer Study and meta-analyses. Lancet. 2005;366:649–59. doi: 10.1016/S0140-6736(05)67137-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gauderman JW, Zhang P, Lewinger PJ. Finding GWAS Signals in the Lower Manhattan by Testing GxE Interactions. International Genetic Epidemiology Society Annual Conference; Stevenson, WA. 2012. [Google Scholar]
- Gauderman WJ, Murcray C, Gilliland F, Conti DV. Testing association between disease and multiple SNPs in a candidate gene. Genetic epidemiology. 2007;31:383–95. doi: 10.1002/gepi.20219. [DOI] [PubMed] [Google Scholar]
- Goeman JJ, Van de Geer SA, De Kort F, Van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics (Oxford, England) 2004;20:93–9. doi: 10.1093/bioinformatics/btg382. [DOI] [PubMed] [Google Scholar]
- Hamza TH, Chen H, Hill-Burns EM, Rhodes SL, Montimurro J, Kay DM, et al. Genome-wide gene-environment study identifies glutamate receptor gene GRIN2A as a Parkinson’s disease modifier gene via interaction with coffee. PLoS genetics. 2011;7:e1002237. doi: 10.1371/journal.pgen.1002237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Human heredity. 2010;70:42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hashibe M, McKay JD, Curado MP, Oliveira JC, Koifman S, Koifman R, et al. Multiple ADH genes are associated with upper aerodigestive cancers. Nature genetics. 2008;40:707–9. doi: 10.1038/ng.151. [DOI] [PubMed] [Google Scholar]
- Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America. 2009;106:9362–7. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Houlston RS, Cheadle J, Dobbins SE, Tenesa A, Jones AM, Howarth K, et al. Meta-analysis of three genome-wide association studies identifies susceptibility loci for colorectal cancer at 1q41, 3q26.2, 12q13.13 and 20q13.33. Nature genetics. 2010;42:973–7. doi: 10.1038/ng.670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Houlston RS, Webb E, Broderick P, Pittman AM, Di Bernardo MC, Lubbe S, et al. Meta-analysis of genome-wide association data identifies four new susceptibility loci for colorectal cancer. Nature genetics. 2008;40:1426–35. doi: 10.1038/ng.262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsu L, Jiao S, Dai JY, Hutter C, Peters U, Kooperberg C. Powerful Cocktail Methods for Detecting Genome-Wide Gene-Environment Interaction. Genetic Epidemiology. 2012;36:183–94. doi: 10.1002/gepi.21610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang H, Chanda P, Alonso A, Bader JS, Arking DE. Gene-based tests of association. PLoS genetics. 2011;7:e1002177. doi: 10.1371/journal.pgen.1002177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jaeger E, Webb E, Howarth K, Carvajal-Carmona L, Rowan A, Broderick P, et al. Common genetic variants at the CRAC1 (HMPS) locus on chromosome 15q13.3 influence colorectal cancer risk. Nature genetics. 2008;40:26–8. doi: 10.1038/ng.2007.41. [DOI] [PubMed] [Google Scholar]
- Kooperberg C, Leblanc M. Increasing the power of identifying gene x gene interactions in genome-wide association studies. Genetic epidemiology. 2008;32:255–63. doi: 10.1002/gepi.20300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. A powerful and flexible multilocus association test for quantitative traits. American journal of human genetics. 2008;82:386–97. doi: 10.1016/j.ajhg.2007.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landi MT, Chatterjee N, Yu K, Goldin LR, Goldstein AM, Rotunno M, et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. American journal of human genetics. 2009;85:679–91. doi: 10.1016/j.ajhg.2009.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li B, Leal SM. Discovery of rare variants via sequencing: implications for the design of complex trait association studies. PLoS genetics. 2009;5:e1000481. doi: 10.1371/journal.pgen.1000481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. American journal of human genetics. 2008;83:311–21. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li D, Conti DV. Detecting gene-environment interactions using a combined case-only and case-control approach. American journal of epidemiology. 2009;169:497–504. doi: 10.1093/aje/kwn339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li M, Wang K, Grant SFA, Hakonarson H, Li C. ATOM: a powerful gene-based association test by combining optimally weighted markers. Bioinformatics (Oxford, England) 2009;25:497–503. doi: 10.1093/bioinformatics/btn641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. American journal of human genetics. 2011;89:354–67. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A versatile gene-based test for genome-wide association studies. American journal of human genetics. 2010;87:139–45. doi: 10.1016/j.ajhg.2010.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luan Y, Li H. Group additive regression models for genomic data analysis. Biostatistics (Oxford, England) 2008;9:100–13. doi: 10.1093/biostatistics/kxm015. [DOI] [PubMed] [Google Scholar]
- Lubbe SJ, Di Bernardo MC, Broderick P, Chandler I, Houlston RS. Comprehensive evaluation of the impact of 14 genetic variants on colorectal cancer phenotype and risk. American journal of epidemiology. 2012;175:1–10. doi: 10.1093/aje/kwr285. [DOI] [PubMed] [Google Scholar]
- Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutation research. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
- Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic epidemiology. 2010;34:188–93. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moskvina V, Schmidt KM. On multiple-testing correction in genome-wide association studies. Genetic epidemiology. 2008;32:567–73. doi: 10.1002/gepi.20331. [DOI] [PubMed] [Google Scholar]
- Mukherjee B, Chatterjee N. Exploiting gene-environment independence for analysis of case-control studies: an empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics. 2008;64:685–94. doi: 10.1111/j.1541-0420.2007.00953.x. [DOI] [PubMed] [Google Scholar]
- Mukhopadhyay I, Feingold E, Weeks DE, Thalamuthu A. Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genetic epidemiology. 2010;34:213–21. doi: 10.1002/gepi.20451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murcray CE, Lewinger JP, Conti DV, Thomas DC, Gauderman WJ. Sample size requirements to detect gene-environment interactions in genome-wide association studies. Genetic epidemiology. 2011;35:201–10. doi: 10.1002/gepi.20569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murcray CE, Lewinger JP, Gauderman WJ. Gene-environment interaction in genome-wide association studies. American journal of epidemiology. 2009;169:219–26. doi: 10.1093/aje/kwn353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, et al. Testing for an unusual distribution of rare variants. PLoS genetics. 2011;7:e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peters U, Hutter CM, Hsu L, Schumacher FR, Conti DV, Carlson CS, et al. Meta-analysis of new genome-wide association studies of colorectal cancer risk. Human genetics. 2011;131:217–34. doi: 10.1007/s00439-011-1055-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peters U, Jiao S, Schumacher FR, Hutter CM, Aragaki AK, Baron JA, et al. Identification of Genetic Susceptibility Loci for Colorectal Tumors in a Genome-Wide Meta-Analysis. Gastroenterology. 2012 doi: 10.1053/j.gastro.2012.12.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statistics in medicine. 1994;13:153–62. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
- Price AL, Kryukov GV, De Bakker PIW, Purcell SM, Staples J, Wei LJ, et al. Pooled association tests for rare variants in exon-resequencing studies. American journal of human genetics. 2010;86:832–8. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rothman N, Garcia-Closas M, Chatterjee N, Malats N, Wu X, Figueroa JD, et al. A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nature genetics. 2010;42:978–84. doi: 10.1038/ng.687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaid DJ. Genomic Similarity and Kernel Methods I: Advancements by Building on Mathematical and Statistical Foundations. Human heredity. 2010;70:109–131. doi: 10.1159/000312641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN. Nonparametric tests of association of multiple genes with human disease. American journal of human genetics. 2005;76:780–93. doi: 10.1086/429838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith PG, Day NE. The design of case-control studies: the influence of confounding and interaction effects. International journal of epidemiology. 1984;13:356–65. doi: 10.1093/ije/13.3.356. [DOI] [PubMed] [Google Scholar]
- Tenesa A, Farrington SM, Prendergast JGD, Porteous ME, Walker M, Haq N, et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk loci at 8q24 and 18q21. Nature genetics. 2008;40:631–7. doi: 10.1038/ng.133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The International HapMap Project. Nature. 2003;426:789–96. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
- Tomlinson I, Webb E, Carvajal-Carmona L, Broderick P, Kemp Z, Spain S, et al. A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21. Nature genetics. 2007;39:984–8. doi: 10.1038/ng2085. [DOI] [PubMed] [Google Scholar]
- Tomlinson IPM, Carvajal-Carmona LG, Dobbins SE, Tenesa A, Jones AM, Howarth K, et al. Multiple Common Susceptibility Variants near BMP Pathway Loci GREM1, BMP4, and BMP2 Explain Part of the Missing Heritability of Colorectal Cancer. PLoS genetics. 2011;7:e1002105. doi: 10.1371/journal.pgen.1002105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tomlinson IPM, Webb E, Carvajal-Carmona L, Broderick P, Howarth K, Pittman AM, et al. A genome-wide association study identifies colorectal cancer susceptibility loci on chromosomes 10p14 and 8q23.3. Nature genetics. 2008;40:623–30. doi: 10.1038/ng.111. [DOI] [PubMed] [Google Scholar]
- Tzeng JY, Devlin B, Wasserman L, Roeder K. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. American journal of human genetics. 2003;72:891–902. doi: 10.1086/373881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzeng JY, Zhang D. Haplotype-based association analysis via variance-components score test. American journal of human genetics. 2007;81:927–38. doi: 10.1086/521558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzeng JY, Zhang D, Chang SM, Thomas DC, Davidian M. Gene-trait similarity regression for multimarker-based association analysis. Biometrics. 2009;65:822–32. doi: 10.1111/j.1541-0420.2008.01176.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, et al. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. American journal of human genetics. 2011;89:277–88. doi: 10.1016/j.ajhg.2011.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang K, Abbott D. A principal components regression approach to multilocus genetic association studies. Genetic epidemiology. 2008;32:108–18. doi: 10.1002/gepi.20266. [DOI] [PubMed] [Google Scholar]
- Wang T, Elston RC. Improved power by use of a weighted score test for linkage disequilibrium mapping. American journal of human genetics. 2007;80:353–60. doi: 10.1086/511312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei Z, Li M, Rebbeck T, Li H. U-statistics-based tests for multiple genes in genetic association studies. Annals of human genetics. 2008;72:821–33. doi: 10.1111/j.1469-1809.2008.00473.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wessel J, Schork NJ. Generalized genomic distance-based regression methodology for multilocus association analysis. American journal of human genetics. 2006;79:792–806. doi: 10.1086/508346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, et al. Powerful SNP-set analysis for case-control genome-wide association studies. American journal of human genetics. 2010;86:929–42. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. American journal of human genetics. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zanke BW, Greenwood CMT, Rangrej J, Kustra R, Tenesa A, Farrington SM, et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nature genetics. 2007;39:989–94. doi: 10.1038/ng2089. [DOI] [PubMed] [Google Scholar]
- Zhao J, Boerwinkle E, Xiong M. An entropy-based statistic for genomewide association studies. American journal of human genetics. 2005;77:27–40. doi: 10.1086/431243. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




