Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2018 May 3;34(19):3412–3414. doi: 10.1093/bioinformatics/bty379

VarExp: estimating variance explained by genome-wide GxE summary statistics

Vincent Laville 1,, Amy R Bentley 2, Florian Privé 1,3, Xiaofeng Zhu 4, Jim Gauderman 5, Thomas W Winkler 6, Mike Province 7, D C Rao 8, Hugues Aschard 1
Editor: Oliver Stegle
PMCID: PMC6157079  PMID: 29726908

Abstract

Summary

Many genome-wide association studies and genome-wide screening for gene–environment (GxE) interactions have been performed to elucidate the underlying mechanisms of human traits and diseases. When the analyzed outcome is quantitative, the overall contribution of identified genetic variants to the outcome is often expressed as the percentage of phenotypic variance explained. This is commonly done using individual-level genotype data but it is challenging when results are derived through meta-analyses. Here, we present R package, ‘VarExp’, that allows for the estimation of the percentage of phenotypic variance explained using summary statistics only. It allows for a range of models to be evaluated, including marginal genetic effects, GxE interaction effects and both effects jointly. Its implementation integrates all recent methodological developments and does not need external data to be uploaded by users.

Availability and implementation

The R package is available at https://gitlab.pasteur.fr/statistical-genetics/VarExp.git.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Many genome-wide association studies (GWAS) or genome-wide screenings incorporating gene–environment (GxE) interactions (Aschard et al., 2012) have been performed to better understand underlying mechanisms of human traits and diseases. When the analyzed outcome is continuous, a commonly used measure to judge the overall impact of the significant associations is the percentage of phenotypic variance explained. A standard way of estimating this percentage is to compare the coefficients of determination between the models including or not the significantly associated variants and/or interactions. This requires individual genotype and phenotype data which can be challenging in meta-analyses performed in big consortia as pooling data from multiple cohorts raises practical and ethical issues. However, an alternative is to use only GWAS or genome-wide GxE summary statistics. Recently, several methods (Pare et al., 2016; Shi et al., 2016) have been developed to estimate the variance explained by marginal genetic effects while taking into account linkage disequilibrium between variants, and addressing statistical issues related to finite sample size and Single Nucleotide Polymorphisms (SNP) correlation matrices. Yet, these works only focused on marginal genetic effects, while genome-wide GxE and joint effect GWAS are now commonly performed and face the same need. In this work, we address this gap by extending the methodology to GxE screening and implementing R package VarExp to rapidly and easily estimate the percentage explained by variants and/or interactions of interest using only meta-analysis summary statistics from GWAS.

2 Materials and methods

2.1 Percentage of variance explained

Consider a set of K SNPs Gkk=1K, coded additively as {0, 1, 2} and a quantitative outcome Y. The marginal genetic effect αG.k of SNP Gk is estimated in the marginal model:

Y=α0+αG.kGk+ε.

Shi et al. (2016) proposed a first naïve estimator to derive the variance explained by genetic effects, fG using summary statistics:

fG=αGTΣ*αG/varY

where αG=αG.1σ1 αG.kσkαG.KσKT, σk denotes the standard deviation of SNP Gk and Σ* is the Moore–Penrose generalized inverse of the genotype correlation matrix. However, finite sample size implies statistical noise in both the effect sizes and the correlation matrix estimations which can induce bias in the estimation of fG. Shi et al. (2016) derived a general formula that addresses this issue:

fG*=N×αGTΣ*αG-q/N-q×varY

where N and q denote respectively the sample size and the rank of the correlation matrix.

Now consider an exposure E (either binary or quantitative). The main effect αG.k of the SNP Gk and the interaction effect αINT.k can be estimated using a single-SNP model with an interaction term:

Y=α0+αG.kGk+αEE+αINT.kGk×E+ε.

We show in Supplementary Material that, when re-parameterizing the effect estimates of the above model to obtain parameters from a fully standardized model, the percentage of variance explained by interactions effects fI or jointly by genetic and interaction effects fG+I can also be derived using summary statistics only:

fI=αINTTΣ*αINT/varY
fG+I=fG+fI

where αINT=αINT.1σ1σE αINT.kσkσEαINT.KσKσET and σE is the standard deviation of E. In this model, fG is computed using effect sizes from the interaction model. However, for the reasons discussed above, we define our final estimators, fI* and fG+I*, by applying the same corrections as proposed for the fG estimator by Shi et al. (see fG* equation).

2.2 Estimating the genotype correlation matrix

When the genotype correlation is not available from the data, it can be estimated using genotype data from a reference panel such as the 1000 Genomes (Abecasis et al., 2012) . We implemented a transparent function that derives this correlation matrix from 1000 Genomes Phase 3 data either through a web access for small number of SNPs or from local data files for larger number of SNPs, as computational time is dramatically reduced when querying local files (see Supplementary Material and Supplementary Fig. S3). To avoid matrix inversion issues, we also implemented an option to prune SNPs with perfect correlation of 1 with another SNP in the matrix.

3 Application example

In practice, application is performed in three main steps (see Supplementary Material and Supplementary Fig. S4): (i) estimating the SNP correlation matrix, (ii) computing mean and variance for both the outcome and the exposure in the pooled sample and (iii) finally, estimating the percentage of phenotypic variance explained by main genetic effects and/or interaction effects. To illustrate the performances of our package, we performed a simulation study (see Supplementary Material) comparing the adjusted coefficients of determination from regressions and the estimates obtained using VarExp across 1000 replicates. Figure 1 and Supplementary Figures S1 and S2 demonstrate the high accuracy of our estimator with an intraclass correlation coefficient between the coefficients of determination and their estimations equal to 0.99, 0.98 and 0.99 for the marginal genetic effects, interaction effects and joint effects, respectively.

Fig. 1.

Fig. 1.

Percentage of phenotypic variance explained using summary statistics (estimated) and individual-level data (observed) for (a) main genetic (b) interaction and (c) joint effects. The line corresponds to y=x and ICC is the intraclass correlation coefficient

4 Concluding remarks

In this work, we provide R package VarExp to easily estimate the percentage of phenotypic variance explained by genetic effects, GxE interaction effects or their joint contribution using summary statistics only, making it straightforward in large-scale consortia. Importantly, several limitations of GxE screenings have previously been discussed [(Aschard, 2016; Robinson et al., 2017), see also Supplementary Material] and have to be taken into account by users before applying our approach.

Supplementary Material

Supplementary Data

Acknowledgements

We gratefully acknowledge all contributors to the CHARGE Gene-Lifestyle Interactions Working Group.

Funding

This work was supported by the R01HL118305 grant from the NHLBI. H.A. was also supported by R21HG007687 from NHGRI. A.R.B. was supported by the Intramural Research Program of the National Human Genome Research Institute in the Center for Research in Genomics and Global Health (CRGGH, Z01HG200362).

Conflict of Interest: none declared.

References

  1. Abecasis G.R. et al. (2012) An integrated map of genetic variation from 1, 092 human genomes. Nature, 491, 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Aschard H. (2016) A perspective on interaction effects in genetic association studies. Genet. Epidemiol., 40, 678–688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Aschard H. et al. (2012) Challenges and opportunities in genome-wide environmental interaction (GWEI) studies. Hum. Genet., 131, 1591–1613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Pare G. et al. (2016) A method to estimate the contribution of regional genetic associations to complex traits from summary association statistics. Sci. Rep., 6, 27644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Robinson M.R. et al. (2017) Genotype-covariate interaction effects and the heritability of adult body mass index. Nat. Genet., 49, 1174–1181. [DOI] [PubMed] [Google Scholar]
  6. Shi H. et al. (2016) Contrasting the genetic architecture of 30 complex traits from summary association data. Am. J. Hum. Genet., 99, 139–153. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES