Abstract
The etiology of most complex diseases involves genetic variants, environmental factors, and gene-environment interaction (G × E) effects. Compared with marginal genetic association studies, G × E analysis requires more samples and detailed measure of environmental exposures, and this limits the possible discoveries. Large-scale population-based biobanks with detailed phenotypic and environmental information, such as UK-Biobank, can be ideal resources for identifying G × E effects. However, due to the large computation cost and the presence of case-control imbalance, existing methods often fail. Here we propose a scalable and accurate method, SPAGE (SaddlePoint Approximation implementation of G × E analysis), that is applicable for genome-wide scale phenome-wide G × E studies. SPAGE fits a genotype-independent logistic model only once across the genome-wide analysis in order to reduce computation cost, and SPAGE uses a saddlepoint approximation (SPA) to calibrate the test statistics for analysis of phenotypes with unbalanced case-control ratios. Simulation studies show that SPAGE is 33–79 times faster than the Wald test and 72–439 times faster than the Firth’s test, and SPAGE can control type I error rates at the genome-wide significance level even when case-control ratios are extremely unbalanced. Through the analysis of UK-Biobank data of 344,341 white British European-ancestry samples, we show that SPAGE can efficiently analyze large samples while controlling for unbalanced case-control ratios.
Keywords: gene-environment interaction, saddlepoint approximation, unbalanced case-controlratios, UK Biobank, biobank data analysis
Introduction
Most complex diseases, such as type 2 diabetes and cancers, have an etiology influenced by genetic variants, lifestyles, and environmental factors. Besides their marginal effects, the gene-environment interaction (G × E) also plays an important role for complex diseases and is worthy of comprehensive investigation. Identifying G × E effects is particularly important for personalized and stratified prevention and treatment. However, compared to identifying genetic marginal effects, more samples and detailed environmental exposure information are required in order to identify G × E effects, and this limits the possible discoveries.1, 2, 3, 4, 5, 6, 7, 8, 9
The advances in genotyping technologies and electronic health records (EHRs) make it possible to genotype hundreds of thousands of samples and identify a large number of traits.10, 11, 12, 13, 14, 15 For example, UK Biobank includes 500,000 genotyped samples and more than 1,000 phenotypes and exposures from ICD billing codes, web surveys, and lab measurements.16 Through genome-wide × phenome-wide association analysis, these massive datasets have resulted in a considerable number of new genetic associations across different phenotypes, and the associations could provide evidence for pleiotropy or shared pathways for disease pathogenesis.17, 18, 19 All these motivate the development of genome-wide scale phenome-wide G × E study (PheWIS).
Currently, however, no scalable statistical methods exist for a genome-wide scale G × E study of thousands of phenotypes in large biobanks. For the analysis of genetic marginal effects, the score test has provided fast computation. In contrast to Wald and likelihood ratio tests, the score test does not require us to fit the model under the alternative hypothesis. Score tests use the parameter estimates under the null hypothesis to calculate test statistics and p values. Because the null model for marginal genetic effects does not include genetic variants, for a single phenotype, score tests require fitting one null model only and use it for the genome-wide tests.20,21 However, for the analysis of G × E, because the null model includes genetic variants to adjust for genetic marginal effects, the same trick cannot be used. Score tests for G × E need to fit a separate model for each variant, which results in tens of millions of model fittings, like Wald and likelihood ratio tests.22 For example, suppose that fitting a single model takes ∼1.7 s (as in Supplementary Methods in the Supplemental Data: a standard logistic regression with 400,000 samples and five covariates); in that case, fitting 20 million models would take more than 1 year. Although more optimized tools such as CGEN23 and GxEScan24 have been developed, because these tools mainly implement the Wald test, the computation burden is still very high.
Alternatively, a two-step procedure which screens out variants based on marginal genetic associations can be used instead.25, 26, 27, 28, 29, 30 However, because this approach excludes a majority of variants in the screening step, it can miss potential G × E and cannot generate genome-wide summary statistics of G × E; these summary statistics can be useful resources for phenome-wide analysis, for meta-analysis, or as a validation dataset.31, 32, 33 Another possible two-step procedure is a case-only (CO) analysis for screening followed by a case-control test to validate the association.34, 35, 36 This is a computationally efficient approach, especially when case-control ratio is low. However, as shown in our simulation studies, CO analysis can be less powerful than case-control analysis for a cohort study design. In addition, when the gene-environment independence assumption is violated, CO analysis can be biased.5
Given that the data are collected in large cohorts, unbalanced case-control ratios are commonly observed. For example, most binary phenotypes in UK Biobank (1,431 out of 1,688; 84.8%) have case-control ratios lower than 1:100.15 These unbalanced case-control ratios will result in incorrect type I error rates. For the genetic marginal effect test, saddlepoint approximation (SPA) has been used to control the type I error rates in such situations.12, 13, 14, 15,37 However, the effect of case-control imbalance for the G × E analysis has not been well studied.
In this paper, we propose a SaddlePoint Approximation implementation of G × E analysis (SPAGE), a fast and accurate method that is scalable for a genome-wide scale single-variant G × E analysis and is well calibrated for controlling type I error rates even under unbalanced case-control ratios. The proposed method fits a genotype-independent logistic model only once for the genome-wide analysis and then uses a conditional expectation to exclude the marginal genetic effect. The SPA, instead of normal approximation, is used to calibrate p values so that type I error rates can be controlled under unbalanced case-control ratios. The method is valid for analyzing both hard-called genotypes and imputed dosage values. Through simulation studies and applications to UK Biobank data of 344,341 unrelated samples from white British participants, we demonstrate that SPAGE is computationally feasible, can control type I error rates, and is sufficiently powerful to identify several G × E signals for a number of diseases including chronic airway obstruction (CAO), cardiac dysrhythmias (CDR), and hyperlipidemia (HLD).
Material and Methods
Logistic Regression Model and Score Statistics
For a single-variant test, we consider the following logistic model :
where is the probability of a binary phenotype (e.g., disease status) for subject i, conditional on the covariates, genotypes, and an environmental factor. We let denote a vector of covariates including the intercept; the hard-called genotypes , genotypes following a dominant or recessive model,38 or dosage values of the genetic variant to be tested; the environmental factor of interest; a coefficient vector corresponding to covariates; the marginal genetic effect; the marginal environmental effect; and the G × E effect. N is the total number of all samples. Suppose that covariates matrix X, genotype vector G, and G × E interaction vector are
then the matrix form of the model is where .
Under the model , we are interested in testing for the marginal G × E effect. The classic score test first fits the null model to estimate , and then calculates and uses as a test statistic with a mean of zero. Because the null models vary for different variants, this strategy requires a separate model fitting for each variant, which is computationally expensive for a genome-wide analysis.
Instead of fitting the null model , we fit a genotype-independent model, , to estimate , and then we calculated as the estimated probability of being a case under the model . Suppose that W is an matrix with as the ith diagonal element, and
are covariate-adjusted vectors in which covariate effects are projected out from genotype and G × E interaction vectors, respectively. We propose a test statistic , where . In Appendix A, we use Taylor expansion to show that S approximates , and the variance of , , is approximated by .
Using statistic to approximate greatly reduces the computation time because requires fitting the genotype-independent model only once for a genome-wide analysis. However, because the approximation is based on Taylor expansion around , it can provide inaccurate results when the marginal genetic effect is large. To avoid possible inflated type one error rates caused by an inaccurate approximation, we use a hybrid strategy to determine the appropriate test statistic. We first conduct a standard score test to test for the marginal genotype effect by using score test statistic . If the score test p value is greater than a pre-selected cutoff , we use as the test statistic. Otherwise, we estimate and calculate as the test statistic. This hybrid strategy is a pragmatic compromise between efficiency and accuracy. We set the cutoff in simulation studies and real data analyses.
For the subset of variants whose marginal genetic effect p value , we use a method developed by Dey et al. to calculate . Instead of fitting the model , this method estimates the genetic effect based on score statistics while adjusting for covariates.39 Compared to the Firth’s method, this method reduces the computational complexity from O(Nk2+k3) to O(N), where k is the number of non-genetic covariates. In this paper, we first use this method to estimate the marginal genetic effect . Then we update and estimate and .
The hybrid strategy above is different from the conventional two-step methods. The two-step methods only calculate G × E p values of the variants whose marginal genetic effect p values are below the threshold, but the proposed approach obtains G × E p values across the whole genome regardless of their marginal genetic effects. Compared to the constrained marginal G × E test under a constrained model , the proposed test statistic adjusts for the marginal genetic main effect (see Appendix A).
p Value Calculation with Saddlepoint Approximation
The classical likelihood-based test approximates the null distribution of statistics through the use of a normal distribution with a mean and a variance . The normal approximation works well when the statistic is near the mean of the distribution, but it performs poorly at the tails, especially when the underlying distribution is highly skewed, such as in an unbalanced case-control setting. In this situation, SPA performs well because higher moments can be incorporated. Because our test statistic can be written as a weighted sum of mean adjusted given ,
and , the entire cumulant-generating function (CGF) of is
where . The distribution of at the observed statistic can be approximated by , where , is the solution to the equation , and is the distribution function of a standard normal distribution. When the testing is based on the classic score statistic , i.e., marginal genetic effect p value , we can simply follow the SPA proposed by Dey et al.12
Implementation Details and Approaches to Reducing Computation Time
Because the normal approximation behaves well near the mean of the distribution, we can use it to obtain the p values when the observed score statistic lies close to the mean of 0.12 We apply the normal approximation to obtain a p value if the absolute value of the observed score statistic , where and is a pre-specified value. For example, we use in our simulation studies and real-data analyses. When , we use the SPA to obtain calibrated p values in tail areas. Because using the normal approximation takes less time than using the SPA, this approach also reduces the computation time.12
Similar to the fastSPA method12 designed for genetic marginal effects, the SPA method requires only computations and can be further decreased to computations, in which is the number of non-zero elements in . Since matrix can be pre-calculated, and matrix is diagonal, the calculations of and require multiplications. Given and , the calculations of the score statistics and the corresponding variances take multiplications. Hence, the total computation complexity is still .
Numeric Simulations
We carried out extensive simulation studies to evaluate computation time, type I error rates, and powers of SPAGE. Three case-control ratios were considered: balanced (case:control = 1:1), moderately unbalanced (case:control = 1:9), and unbalanced (case:control = 1:99). For each choice of case-control ratios, a binary phenotype for individual was simulated from the following logistic model:
(1) |
where a binary covariate was simulated following a Bernoulli(0.5) distribution, a continuous covariate was simulated following a standard normal distribution, an environmental factor was simulated following a standard normal distribution, and a genotype was simulated following a binomial distribution where is the minor allele frequency (MAF). Parameters and are log odds ratios of the marginal genetic effect and the G × E effect, respectively. Intercept was chosen to correspond to the given case-control ratio.
To evaluate computation time in realistic scenarios, we randomly sampled MAFs from the MAF distribution in the UK-Biobank dataset and then simulated 10,000 variants with . Two scenarios were considered in order to compare computation time for different methods. First, we fixed the sample size at 400,000 and increased the total number of covariates from 5 to 30. Then, we fixed the number of covariates at 15 and increased the sample size from 10,000 to 400,000. Besides the two covariates and in Equation (1), the other covariates were simulated following a standard normal distribution. We compared the computation time of six different tests: Wald test for the logistic regression that fits a complete model for each variant (Wald), Firth’s penalized likelihood ratio test (Firth’s test), the normal-approximation-based test (SPAGE-NoSPA), the fast SPA-based test with a standard deviation threshold (SPAGE), the constrained maximum likelihood method (CML) implemented in R package CGEN (version: 3.18.0), and the CO approach implemented in GxEScan (version: 1.0). As the same as CO, CML assumes the gene-environment independence. The CO approach in GxEScan uses a polytomous logistic regression to adjust for covariates. We modified the core codes (C in CGEN and C++ in GxEScanR) to suppress the unnecessary parts so that we can accurately record the computation time of CML and CO.
To evaluate type I error rates under the null model , we fixed the sample size at 50,000 and simulated variants of which 99.9% had no marginal genetic effect and the other 0.1% had marginal genetic effects with an odds ratio of 1.4. This corresponds to having 1,000 causal variants in an analysis with one million variants. We compared empirical type I error rates of Wald, Firth’s test, SPAGE-NoSPA, and SPAGE at significance levels and . In addition, we also evaluated SPAGE when the marginal genetic odds ratio ranged from 1.1 to 1.5. Due to the heavy computational burden, it is practically impossible to perform Wald and Firth’s test times. Following Dey et al.,12 we performed a hybrid approach in which we used Wald and the Firth’s test to calculate p values only when the SPAGE p values were smaller than .
To evaluate powers under the alternative model, we fixed the sample size at 50,000, considered a wide range of , and simulated variants for each choice of . We compared the empirical powers of SPAGE, Wald, and Firth’s tests at significance levels and . We also evaluated CML and empirical Bayes (EB) implemented in CGEN. Note that all datasets were simulated following a cohort study design.
Application to UK Biobank Data
To illustrate the performance in a real-data application, we applied the proposed methods to UK Biobank. Environmental factors and phenotypes were defined based on UK Biobank field ID (FID) and PheWAS codes (PheCodes), respectively.12,13,15,16,18,40 We selected 79 pairs of environmental factors and phenotypes (see Table S3) including five environmental factors: smoking status (FID: 20116), vigorous physical activity (FID: 904), moderate physical activity (FID: 804), gender (FID: 31), and alcohol intake frequency (FID: 1558). More details about these environmental factors can be seen in Supplementary Methods in the Supplemental Data.
We randomly selected 344,341 unrelated samples from white British participants and restricted our analysis to markers directly genotyped or imputed by the Haplotype Reference Consortium (HRC)41 panel due to quality control issues of non-HRC markers reported by UK Biobank. Approximately 28 million markers with minor allele counts (MAC) 20 and imputation info scores > 0.3 were used in the analysis. For each binary phenotype, we further removed markers with less than five minor alleles in the cohort of cases.
We incorporated the first four principal components plus birth year, gender, and the environmental factor of interest as covariates to fit null models. Smoking status was encoded to numeric variables of 0, 1, and 2 to represent never, former, and current smoker, respectively. Vigorous and moderate physical activities were encoded to categorical variables ranging from 0 to 7 based on the number of days per week the individual exercised for 10+ minutes. Alcohol intake frequency was encoded to categorical variables ranging from 1 (daily or almost daily) to 6 (never). When fitting a null model, we considered the physical activities and alcohol intake frequency as categorical variables in order to avoid inaccurate type I error rates because most complex diseases were not additively affected by these variables.42 On the other hand, when calculating G × E interaction vector , we considered the physical activities and alcohol intake frequency as numeric variables in order to avoid testing with multiple degrees of freedom. We applied the SPAGE-NoSPA and SPAGE methods to the genome-wide analyses for all 79 pairs of environmental factors and phenotypes, and we used Wald and Firth’s test for only one pair of alcohol and colorectal cancer. In addition, we also used Wald and Firth’s test for all variants identified by SPAGE method at a significance level of .
Results
Comparison of Computation Time
The projected computation time for testing 1,500 phenotypes across 20 million variants via different methods is presented in Figure 1 and Table S1, which shows that SPAGE performed 72–439 times faster than Firth’s test and 33–79 times faster than Wald test. CO took similar time as SPAGE did when the case-control ratio was 1:99, and both were much faster than CML. This is because CO uses case samples only, but CML uses both case and control samples. For example, in an unbalanced case-control setup of 4,000 cases and 396,000 controls, when analyzing 20 million variants across 1,500 phenotypes while adjusting for 15 covariate variables, Firth’s test, Wald test, and CML would require 13,032, 3,010, and 4,517 CPU years, respectively, whereas SPAGE and CO would require only 48.7 and 40.9 CPU years. Hence, SPAGE and CO required 18 days and 15 days (without data reading) on a cluster with 1,000 CPU cores, but Firth’s test, Wald, and CML needed 12.96, 3, and 4.5 years, respectively. Interestingly, when case-control ratio was 1:9, SPAGE was 3.5 times faster than CO (47.57 versus 166.34 CPU years), although CO used only 10% of samples. This may be due to the fact that the model fitting of a polytomous regression is generally slow. Both Wald and Firth’s test took more time when case-control ratio was more unbalanced. This is because the regression took more iteration steps to get a converged parameter estimation (see Supplementary Methods in Supplemental Data).
Type I Error Simulation Results
The results of empirical type I error rates based on simulated variants with are presented in Figure 2 and Table S2. Note that 0.1% of variants ( variants) were simulated with nonzero = Under balanced and moderately unbalanced case-control ratios, SPAGE and Firth’s test controlled type I error rates regardless of common, low-frequency, or rare variants. Meanwhile, Wald had deflated type I error rates and SPAGE-NoSPA had inflated type I error rates, especially when testing rare variants with a MAF of 0.001. Under an unbalanced case-control ratio, Wald and SPAGE-NoSPA had more deflated and inflated type I error rates, respectively, while SPAGE and Firth could still control type I error rates reasonably well.
The results of empirical type I error rates based on simulated variants with are presented in Figure S1. If the hybrid strategy was not used to adjust for statistics (denoted as RAW SPAGE and RAW SPAGE-NoSPA), as an increase in marginal genetic effect , the type one error rates of p values increased slowly but constantly. Meanwhile, in these situations, the hybrid strategy provided a better type I error rates control.
Power Simulation Results
Next, we compared the empirical powers of SPAGE, Wald, Firth’s test, CML, CO, and EB. Because SPAGE, Wald and Firth’s test are all based on a prospective likelihood, the empirical powers of these three methods were similar (see Figure 3 and Figure S2). Only when we tested low-frequency or rare variants under an unbalanced case-control ratio, Firth’s test was more powerful than SPAGE; both of these were more powerful than Wald. Although more powerful, Firth’s test and SPAGE still required very a large effect size to detect a low-frequency or rare variant in G × E analysis; this result is not commonly observed in a practical application.
The results of power comparisons of SPAGE, CML, and EB are presented in Figure 4 and Figure S3. Interestingly, the power of SPAGE was generally larger than that of CML and EB. The differences among SPAGE, CML, and EB depended on the case-control ratio, e.g. the prevalence of disease in the cohort. When the case-control ratio was 1:1 or 1:9, SPAGE was more powerful than CML regardless of minor allele frequencies and effect sizes. When the case-control ratio was 1:99, the powers of CML, EB, and SPAGE were similar. The advantage of SPAGE over CML and EB was mainly due to the fact that the data are from a cohort study design instead of a case-control study design. Figure S4 compared empirical powers of SPAGE, CML, and EB methods under different study designs. Under a case-control study design, CML and EB were more powerful than SPAGE, and under a cohort study design, SPAGE was more powerful than EB and CML. We do not show CO because it had a nearly identical power as CML. When testing low-frequency or rare variants with moderate or high G × E effects, CO was generally unstable, especially when case-control ratio was 1:99 (see Figure S5). In terms of powers (assuming G-E independence), under a case-control study design, we still recommend CML and EB methods, and under a cohort-based study design, we recommend the SPAGE method.
Application to UK Biobank Data
We applied the proposed SPAGE to UK Biobank to analyze 79 combinations of environmental factors and phenotypes. Under a genome-wide level , 34 significant G × E signals were identified (see Table S4 for a complete list). Under a Bonferroni corrected threshold , there was one signal (rs1906609, p = 1.4210−12, environmental exposure is gender, and phenotype is CDR) left. Since some phenotypes are strongly correlated, the Bonferroni correction would be over-conservative. Three combinations are highlighted: smoking status and CAO (8,701 cases and 314,750 controls), vigorous physical activity and HLD (27,622 cases and 299,859 controls), and gender and CDR (20,754 cases and 320,152 controls). The complete genome-wide summary statistics for all 79 combinations can be found on our website (see Web Resources).
The Manhattan plots (Figure 5) and the QQ-plots (Figure S6) showed that SPAGE-NoSPA produced a large number of potentially spurious associations for G × E association analyses, especially when testing low-frequency and rare variants, whereas the p values of SPAGE closely followed a uniform distribution. Under a significance level × , we identified several G × E signals. The top SNPs and a complete list of SNPs whose p values less than × are presented in Table 1 and Table S5, respectively. For each of the top SNPs, the overall and stratified associations of phenotype × genotype and phenotype × environmental factors are presented in Figure 6 and Figure S7, respectively. In addition, we reported the p values of Wald and Firth’s test for the top SNPs. For common and low-frequency variants, the p values of Wald and Firth’s test were similar to the p value of SPAGE, and for rare variants, the p values of Wald were larger than the p values of SPAGE and Firth’s test. For the pair of Alcohol × Colorectal Cancer, QQ-plots and Manhattan plots of Wald and Firth’s test can be seen in Figure S8, from which we can see that the Wald test was conservative when testing low-frequency and rare variants, and p values of the Firth’s test closely followed a uniform distribution. These results are consistent with the simulation results.
Table 1.
Environ. Factor | Phenotype | RSID | CHR | Imputation Info | MAF | p Value (G effect)∗ | p Value (SPAGE) | p Value (Firth) | p Value (Wald) | Func.refGene | Gene.refGene |
---|---|---|---|---|---|---|---|---|---|---|---|
Smoking status | chronic airway obstruction | rs55781567 | chr15 | 1 | 0.3343 | 7.76E-12 | 2.87E-08 | 2.55E-08 | 2.64E-08 | UTR5 | CHRNA5 |
Gender | cardiac dysrhythmias | rs1906609 | chr4 | 0.99 | 0.1612 | 3.06E-68 | 1.42E-12 | 9.12E-13 | 1.11E-12 | intergenic | PITX2; C4orf32 |
Vigorous physical activity | hyperlipidemia | rs10950866 | chr7 | 0.99 | 0.4230 | 0.2568 | 3.64E-09 | 3.82E-09 | 3.75E-09 | intronic | DNAH11 |
p value of the marginal genetic effect
In the analysis of CAO, we identified a significant G × E effect of smoking status and a variant rs55781567 in CHRNA5 (MIM: 118505). The allele G of the variant rs55781567 is a risk allele in the whole population, and its risk effect will increase significantly for smoker. Smoking is an important risk factor to the CAO, and CHRNA5 is well known to be associated with the smoking behavior and some smoking-related diseases such as chronic obstructive pulmonary disease.43, 44, 45 In the analysis of CDR, we identified a significant G × E effect of gender and a variant rs1906609 near PITX2 (MIM: 601542). The allele G of the variant rs1906609 is a protective allele in the whole population (p < 110−100), and its effect in males (p < 110−100) is significantly larger than that in females (p = 6.710−8). The gene PITX2 plays an important role in cardiac development and diseases, and the incidence of cardiac arrhythmias is known to be different for males and females.46,47 In the analysis of HLD, we identified a significant G × E effect of vigorous physical activity and a variant rs10950866 in DNAH11 (MIM: 603339). This variant is not significantly associated with HLD in the whole population (p = 0.28), but its G allele is a protective allele for people who take vigorous exercise more than two days per week (p = 4.110−6). The gene DNAH11 has been reported to be associated with serum lipid levels.48,49
Discussion
In this paper, we have proposed SPAGE, an accurate and scalable method to perform a genome-wide scale phenome-wide G × E analysis for binary phenotypes in large cohorts. SPAGE can adjust for covariates and accurately calibrate p values regardless of minor allele frequencies even in extremely unbalanced case-control settings. Through extensive numerical studies, we have demonstrated that SPAGE can perform 33–79 times faster than the Wald test and 72–439 times faster than the Firth’s test while retaining similar powers and well-controlled type I error rates. Because SPAGE is based on a prospective likelihood method, the genotype-environment independence assumption is not required. The UK Biobank data analysis illustrates that SPAGE can identify G × E signals while controlling for type I error rates, even for binary phenotypes with a small number of cases and a large number of controls.
The current G × E approaches need to fit a null model or a complete model, both of which require adjusting for genotypes separately for each variant. Our method fits a genotype-independent logistic model only once across a genome-wide analysis and then uses a hybrid strategy to exclude marginal genetic effects from the G × E effect. This strategy greatly reduces the computation time so that it is computationally feasible for SPAGE to analyze a large cohort. To calibrate p values, we utilize the SPA when the test statistics deviate from the mean value by more than a pre-specified standard-derivation threshold. Here, we follow the recommendation of Dey et al. to use a threshold of two.12 Both simulations and application to UK Biobank data showed that the SPA (i.e, SPAGE) performs better than the normal approximation (i.e, SPAGE-NoSPA), so we recommend using SPAGE.
The three commonly used methods for G × E analysis include case-control, CO (or CML), and EB approaches.5 Of these, the powers of the case-control approaches increase as the control group size increases. Following a cohort study design, large biobanks collect far more controls than cases for most diseases. In this situation, as a case-control approach, SPAGE can be more powerful than the other methods while remaining computationally efficient. A case-only approach can be a scalable method if the number of cases is moderate or small. However, because the case-only approach requires the gene-environment independence assumption, it cannot be as robust as SPAGE.
Several two-step approaches have been proposed to improve the efficiency of G × E analysis. However, if the screening step is to test the marginal genetic effect or gene-environment independence, it could miss some potential G × E and cannot generate the genome-wide summary statistics. As an accurate and scalable solution, the proposed SPAGE can calculate the genome-wide summary statistics of G × E, which can be of great value for the G × E community. First, phenome-wide G × E analysis can utilize the G × E statistics across multiple phenotypes to provide evidences for pleiotropy. Second, meta-analysis can use the G × E statistics across different studies to improve the power. Third, the genome-wide summary statistics can also facilitate a two-stage discovery-validation study.
Family relatedness is commonly observed in large biobank datasets. To adjust for the sample relatedness, a generalized linear mixed model (GLMM) is widely used.50, 51, 52 BOLT-LMM and SAIGE methods used several optimization strategies so that the GLMM could be computationally feasible for large cohorts.15,52 In the future, we plan to extend the current method to a genome-wide scale G × E analysis with related samples. Another future research area of interest is to design an accurate and fast algorithm to identify rare variants with G × E effect based on gene- or region-based multiple-variant tests.
In summary, we have proposed an accurate and scalable method for genome-wide scale phenome-wide G × E analysis. Large-scale biobanks can be great resources for identifying G × E effects across the genome-wide scale. Our SPAGE method provides a scalable solution for this large-scale problem and contributes to finding novel G × E effects of complex disease. All of our tests are implemented in an R package SPAGE.
Declaration of Interests
The authors declare no competing interests.
Acknowledgments
This research has been conducted using the UK Biobank Resource under application number 45227. S.L. and W.B. were supported by National Institutes of Health grant number R01 HG008773.
Published: November 14, 2019
Footnotes
Supplemental Data can be found online at https://doi.org/10.1016/j.ajhg.2019.10.008.
Appendix A. The Approximation of
A naive approach is to use to approximate and then to test the G × E marginal effect . However, this strategy ignores the main genetic effect on phenotype and is only valid under a constrained model . To better approximate , we adjust for the main genetic effect by deducting . From the first-order Taylor expansion,
The above equation also implies that . We assume that the weight matrix changes slowly with respect to the conditional mean (following Breslow and Clayton53), then our estimate of the variance of S is .
Web Resources
CGEN R package, https://bioconductor.org/packages/release/bioc/html/CGEN.html
Genome-wide summary statistics, https://www.leelabsg.org/resources
GxEScan R Package, https://github.com/USCbiostats/GxEScanR
OMIM, https://www.omim.org
SPAGE R package, https://github.com/WenjianBI/SPAGE
UK Biobank, https://www.ukbiobank.ac.uk/
Supplemental Data
References
- 1.Hunter D.J. Gene-environment interactions in human diseases. Nat. Rev. Genet. 2005;6:287–298. doi: 10.1038/nrg1578. [DOI] [PubMed] [Google Scholar]
- 2.Thomas D. Gene--environment-wide association studies: emerging approaches. Nat. Rev. Genet. 2010;11:259–272. doi: 10.1038/nrg2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Thompson W.D. Effect modification and the limits of biological inference from epidemiologic data. J. Clin. Epidemiol. 1991;44:221–232. doi: 10.1016/0895-4356(91)90033-6. [DOI] [PubMed] [Google Scholar]
- 4.Le Marchand L., Wilkens L.R. Design considerations for genomic association studies: importance of gene-environment interactions. Cancer Epidemiol. Biomarkers Prev. 2008;17:263–267. doi: 10.1158/1055-9965.EPI-07-0402. [DOI] [PubMed] [Google Scholar]
- 5.Gauderman W.J., Mukherjee B., Aschard H., Hsu L., Lewinger J.P., Patel C.J., Witte J.S., Amos C., Tai C.G., Conti D. Update on the state of the science for analytical methods for gene-environment interactions. Am. J. Epidemiol. 2017;186:762–770. doi: 10.1093/aje/kwx228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.McAllister K., Mechanic L.E., Amos C., Aschard H., Blair I.A., Chatterjee N., Conti D., Gauderman W.J., Hsu L., Hutter C.M. current challenges and new opportunities for gene-environment interaction studies of complex diseases. Am. J. Epidemiol. 2017;186:753–761. doi: 10.1093/aje/kwx227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Simonds N.I., Ghazarian A.A., Pimentel C.B., Schully S.D., Ellison G.L., Gillanders E.M., Mechanic L.E. Review of the gene-environment interaction literature in cancer: What do we know? Genet. Epidemiol. 2016;40:356–365. doi: 10.1002/gepi.21967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Thomas D. Methods for investigating gene-environment interactions in candidate pathway and genome-wide association studies. Annu. Rev. Public Health. 2010;31:21–36. doi: 10.1146/annurev.publhealth.012809.103619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ritz B.R., Chatterjee N., Garcia-Closas M., Gauderman W.J., Pierce B.L., Kraft P., Tanner C.M., Mechanic L.E., McAllister K. Lessons learned from past gene-environment interaction successes. Am. J. Epidemiol. 2017;186:778–786. doi: 10.1093/aje/kwx230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bush W.S., Oetjens M.T., Crawford D.C. Unravelling the human genome-phenome relationship using phenome-wide association studies. Nat. Rev. Genet. 2016;17:129–145. doi: 10.1038/nrg.2015.36. [DOI] [PubMed] [Google Scholar]
- 11.Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26:1205–1210. doi: 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dey R., Schmidt E.M., Abecasis G.R., Lee S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 2017;101:37–49. doi: 10.1016/j.ajhg.2017.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dey R., Nielsen J.B., Fritsche L.G., Zhou W., Zhu H., Willer C.J., Lee S. Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes. Genet. Epidemiol. 2019;43:462–476. doi: 10.1002/gepi.22197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Nielsen J.B., Thorolfsdottir R.B., Fritsche L.G., Zhou W., Skov M.W., Graham S.E., Herron T.J., McCarthy S., Schmidt E.M., Sveinbjornsson G. Biobank-driven genomic discovery yields new insight into atrial fibrillation biology. Nat. Genet. 2018;50:1234–1239. doi: 10.1038/s41588-018-0171-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhou W., Nielsen J.B., Fritsche L.G., Dey R., Gabrielsen M.E., Wolford B.N., LeFaive J., VandeHaar P., Gagliano S.A., Gifford A. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 2018;50:1335–1341. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sudlow C., Gallacher J., Allen N., Beral V., Burton P., Danesh J., Downey P., Elliott P., Green J., Landray M. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12 doi: 10.1371/journal.pmed.1001779. e1001779–e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Denny J.C., Bastarache L., Roden D.M. Phenome-wide association studies as a tool to advance precision medicine. Annu. Rev. Genomics Hum. Genet. 2016;17:353–373. doi: 10.1146/annurev-genom-090314-024956. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Denny J.C., Bastarache L., Ritchie M.D., Carroll R.J., Zink R., Mosley J.D., Field J.R., Pulley J.M., Ramirez A.H., Bowton E. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 2013;31:1102–1110. doi: 10.1038/nbt.2749. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wolford B.N., Willer C.J., Surakka I. Electronic health records: the next wave of complex disease genetics. Hum. Mol. Genet. 2018;27(R1):R14–R21. doi: 10.1093/hmg/ddy081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Han S.S., Rosenberg P.S., Ghosh A., Landi M.T., Caporaso N.E., Chatterjee N. An exposure-weighted score test for genetic associations integrating environmental risk factors. Biometrics. 2015;71:596–605. doi: 10.1111/biom.12328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Song M., Wheeler W., Caporaso N.E., Landi M.T., Chatterjee N. Using imputed genotype data in the joint score tests for genetic association and gene-environment interactions in case-control studies. Genet. Epidemiol. 2018;42:146–155. doi: 10.1002/gepi.22093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Han S.S., Chatterjee N. Review of statistical methods for gene-environment interaction analysis. Curr. Epidemiol. Rep. 2018;5:39–45. [Google Scholar]
- 23.Bhattacharjee S., Chatterjee N., Wheeler W. 2010. CGEN: An R package for analysis of case-control studies in genetic epidemiology. [Google Scholar]
- 24.Morrison J., Kim A.E., Gauderman J. University of Southern California, Los Angeles, University of Southern California, Los Angeles. 2018. GxEScanR: An R package to detect GxE interactions in a genomewide association study. [Google Scholar]
- 25.Gauderman W.J., Thomas D.C., Murcray C.E., Conti D., Li D., Lewinger J.P. Efficient genome-wide association testing of gene-environment interaction in case-parent trios. Am. J. Epidemiol. 2010;172:116–122. doi: 10.1093/aje/kwq097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Murcray C.E., Lewinger J.P., Gauderman W.J. Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 2009;169:219–226. doi: 10.1093/aje/kwn353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Murcray C.E., Lewinger J.P., Conti D.V., Thomas D.C., Gauderman W.J. Sample size requirements to detect gene-environment interactions in genome-wide association studies. Genet. Epidemiol. 2011;35:201–210. doi: 10.1002/gepi.20569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kooperberg C., Leblanc M. Increasing the power of identifying gene x gene interactions in genome-wide association studies. Genet. Epidemiol. 2008;32:255–263. doi: 10.1002/gepi.20300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Dai J.Y., Kooperberg C., Leblanc M., Prentice R.L. Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika. 2012;99:929–944. doi: 10.1093/biomet/ass044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hsu L., Jiao S., Dai J.Y., Hutter C., Peters U., Kooperberg C. Powerful cocktail methods for detecting genome-wide gene-environment interaction. Genet. Epidemiol. 2012;36:183–194. doi: 10.1002/gepi.21610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Winham S.J., Biernacka J.M. Gene-environment interactions in genome-wide association studies: current approaches and new directions. J. Child Psychol. Psychiatry. 2013;54:1120–1134. doi: 10.1111/jcpp.12114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Van der Auwera S., Peyrot W.J., Milaneschi Y., Hertel J., Baune B., Breen G., Byrne E., Dunn E.C., Fisher H., Homuth G., Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium Genome-wide gene-environment interaction in depression: A systematic evaluation of candidate genes: The childhood trauma working-group of PGC-MDD. Am. J. Med. Genet. B. Neuropsychiatr. Genet. 2018;177:40–49. doi: 10.1002/ajmg.b.32593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rask-Andersen M., Karlsson T., Ek W.E., Johansson Å. Gene-environment interaction study for BMI reveals interactions between genetic factors and physical activity, alcohol consumption and socioeconomic status. PLoS Genet. 2017;13:e1006977. doi: 10.1371/journal.pgen.1006977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Chatterjee N., Carroll R.J. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005;92:399–418. [Google Scholar]
- 35.Bhattacharjee S., Wang Z., Ciampa J., Kraft P., Chanock S., Yu K., Chatterjee N. Using principal components of genetic variation for robust and powerful detection of gene-gene interactions in case-control and case-only studies. Am. J. Hum. Genet. 2010;86:331–342. doi: 10.1016/j.ajhg.2010.01.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Piegorsch W.W., Weinberg C.R., Taylor J.A. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat. Med. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
- 37.Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika. 1999;86:929–935. [Google Scholar]
- 38.Bi W., Kang G., Pounds S.B. Statistical selection of biological models for genome-wide association analyses. Methods. 2018;145:67–75. doi: 10.1016/j.ymeth.2018.05.019. [DOI] [PubMed] [Google Scholar]
- 39.Dey R., Lee S. Technical Note: Efficient and accurate estimation of genotype odds ratios in biobank-based unbalanced case-control studies. bioRxiv. 2019 [Google Scholar]
- 40.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.McCarthy S., Das S., Kretzschmar W., Delaneau O., Wood A.R., Teumer A., Kang H.M., Fuchsberger C., Danecek P., Sharp K., Haplotype Reference Consortium A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.He Z., Zhang M., Lee S., Smith J.A., Kardia S.L.R., Diez Roux A.V., Mukherjee B. Set-based tests for the gene–environment interaction in longitudinal studies. J. Am. Stat. Assoc. 2017;112:966–978. doi: 10.1080/01621459.2016.1252266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Jensen K.P., DeVito E.E., Herman A.I., Valentine G.W., Gelernter J., Sofuoglu M. A CHRNA5 smoking risk variant decreases the aversive effects of nicotine in humans. Neuropsychopharmacology. 2015;40:2813–2821. doi: 10.1038/npp.2015.131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Lassi G., Taylor A.E., Timpson N.J., Kenny P.J., Mather R.J., Eisen T., Munafò M.R. The CHRNA5-A3-B4 gene cluster and smoking: From discovery to therapeutics. Trends Neurosci. 2016;39:851–861. doi: 10.1016/j.tins.2016.10.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Wang J., Spitz M.R., Amos C.I., Wilkinson A.V., Wu X., Shete S. Mediating effects of smoking and chronic obstructive pulmonary disease on the relation between the CHRNA5-A3 genetic locus and lung cancer risk. Cancer. 2010;116:3458–3462. doi: 10.1002/cncr.25085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Villareal R.P., Woodruff A.L., Massumi A. Gender and cardiac arrhythmias. Tex. Heart Inst. J. 2001;28:265–275. [PMC free article] [PubMed] [Google Scholar]
- 47.Wolbrette D., Naccarelli G., Curtis A., Lehmann M., Kadish A. Gender differences in arrhythmias. Clin. Cardiol. 2002;25:49–56. doi: 10.1002/clc.4950250203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Shen S.-W., Yin R.-X., Huang F., Wu J.-Z., Cao X.-L., Chen W.-X. DNAH11 rs12670798 variant and G× E interactions on serum lipid levels, coronary heart disease, ischemic stroke and the lipid-lowering efficacy of atorvastatin. Int. J. Clin. Exp. Pathol. 2017;10:11147–11158. [PMC free article] [PubMed] [Google Scholar]
- 49.Zhou Y.-G., Yin R.-X., Wu J., Zhang Q.-H., Chen W.-X., Cao X.-L. The association between the DNAH11 rs10248618 SNP and serum lipid traits, the risk of coronary artery disease, and ischemic stroke. Int. J. Clin. Exp. Pathol. 2018;11:4585–4594. [PMC free article] [PubMed] [Google Scholar]
- 50.Chen H., Wang C., Conomos M.P., Stilp A.M., Li Z., Sofer T., Szpiro A.A., Chen W., Brehm J.M., Celedón J.C. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 2016;98:653–666. doi: 10.1016/j.ajhg.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Zhou X., Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Loh P.R., Tucker G., Bulik-Sullivan B.K., Vilhjálmsson B.J., Finucane H.K., Salem R.M., Chasman D.I., Ridker P.M., Neale B.M., Berger B. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Breslow N.E., Clayton D.G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 1993;88:9–25. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.