Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics

Alvaro N Barbeira; Scott P Dickinson; Rodrigo Bonazzola; Jiamao Zheng; Heather E Wheeler; Jason M Torres; Eric S Torstenson; Kaanan P Shah; Tzintzuni Garcia; Todd L Edwards; Eli A Stahl; Laura M Huckins; GTEx Consortium; Dan L Nicolae; Nancy J Cox; Hae Kyung Im

doi:10.1038/s41467-018-03621-1

. 2018 May 8;9:1825. doi: 10.1038/s41467-018-03621-1

Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics

Alvaro N Barbeira ¹, Scott P Dickinson ¹, Rodrigo Bonazzola ¹, Jiamao Zheng ¹, Heather E Wheeler ^2,³, Jason M Torres ⁴, Eric S Torstenson ⁵, Kaanan P Shah ¹, Tzintzuni Garcia ⁶, Todd L Edwards ⁷, Eli A Stahl ^8,⁹, Laura M Huckins ^8,⁹; GTEx Consortium, Dan L Nicolae ¹, Nancy J Cox ⁵, Hae Kyung Im ^1,^✉

¹Section of Genetic Medicine, The University of Chicago, Chicago, IL 60637 USA

²Department of Biology, Loyola University Chicago, Chicago, IL 60660 USA

³Department of Computer Science, Loyola University Chicago, Chicago, IL 60660 USA

⁴Committee on Molecular Metabolism and Nutrition, The University of Chicago, Chicago, IL 60637 USA

⁵Vanderbilt Genetic Institute, Vanderbilt University Medical Center, Nashville, TN 37232 USA

⁶Center for Research Informatics, The University of Chicago, Chicago, IL 60615 USA

⁷Division of Epidemiology, Department of Medicine, Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232 USA

⁸Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, NYC, NY 10029 USA

⁹Department of Genetics and Genomics, Icahn School of Medicine at Mount Sinai, NYC, NY 10029 USA

¹⁰The Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02142 USA

¹¹Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114 USA

¹²Massachusetts General Hospital Cancer Center, Department of Pathology, Massachusetts General Hospital, Boston, MA 02114 USA

¹³Department of Genetics, Harvard Medical School, Boston, MA 02114 USA

¹⁴Department of Genetics, Stanford University, Stanford, CA 94305 USA

¹⁵Department of Pathology, Stanford University, Stanford, CA 94305 USA

¹⁶Department of Clinical Biochemistry and Pharmacology, Faculty of Health Sciences, BenGurion University of the Negev, Beer-Sheva, 84105 Israel

¹⁷Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA

¹⁸Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, 08003 Barcelona, Spain

¹⁹Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain

²⁰Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland

²¹Institute for Genetics and Genomics in Geneva (iG3), University of Geneva, 1211 Geneva, Switzerland

²²Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland

²³Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104 USA

²⁴New York Genome Center, New York, NY 10013 USA

²⁵Department of Systems Biology, Columbia University Medical Center, New York, NY 10032 USA

²⁶Department of Public Health Sciences, The University of Chicago, Chicago, IL 60637 USA

²⁷McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63108 USA

²⁸Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108 USA

²⁹Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO 63108 USA

³⁰Department of Computer Science, Center for Statistics and Machine Learning, Princeton University, Princeton, NJ 08540 USA

³¹Department of Computer Science, University of California, Los Angeles, CA 90095 USA

³²Department of Human Genetics, University of California, Los Angeles, CA 90095 USA

³³Instituto de Investigação e Inovação em Saúde (i3S), Universidade do Porto, 4200-135 Porto, Portugal

³⁴Institute of Molecular Pathology and Immunology (IPATIMUP), University of Porto, 4200-625 Porto, Portugal

³⁵Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee 37232 USA

³⁶Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands

³⁷Department of Psychiatry, Academic Medical Center, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands

³⁸Lewis Sigler Institute, Princeton University, Princeton, NJ 08540 USA

³⁹Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08540 USA

⁴⁰Biomedical Informatics Program, Stanford University, Stanford, CA 94305 USA

⁴¹Institut Hospital del Mar d’Investigacions Mèdiques (IMIM), 08003 Barcelona, Spain

⁴²Department of Medicine, Washington University School of Medicine, St. Louis, MO 63108 USA

⁴³Department of Convergence Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, 138-736 South Korea

⁴⁴Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA

⁴⁵Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032 USA

⁴⁶Department of Biology, Stanford University, Stanford, CA 94305 USA

⁴⁷Wellcome Trust Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, OX3 7BN UK

⁴⁸Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Oxford, OX3 7LE UK

⁴⁹Oxford NIHR Biomedical Research Centre, Churchill Hospital, Oxford, OX3 7LJ UK

⁵⁰Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, NC 27708 USA

⁵¹Human Genetics Department, McGill University, Montreal, QC H3A 0G1 Canada

⁵²Departament d’Estadística i Investigació Operativa, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain

⁵³Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599 USA

⁵⁴Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599 USA

⁵⁵Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL 60637 USA

⁵⁶Institute for Genomics and Systems Biology, The University of Chicago, Chicago, IL 60637 USA

⁵⁷Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030 USA

⁵⁸Computational Sciences, Pfizer Inc, Cambridge, MA 02139 USA

⁵⁹Universitat de Barcelona, 08028 Barcelona, Spain

⁶⁰Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA

⁶¹Department of Statistics, Stanford University, Stanford, CA 94305 USA

⁶²Institute of Biophysics Carlos Chagas Filho (IBCCF), Federal University of Rio de Janeiro (UFRJ), 21941902 Rio de Janeiro, Brazil

⁶³Department of Psychiatry, University of Utah, Salt Lake City, UT 84108 USA

⁶⁴Department of Statistics, The University of Chicago, Chicago, IL 60637 USA

⁶⁵Department of Human Genetics, The University of Chicago, Chicago, IL 60637 USA

⁶⁶Center for Data Intensive Science, The University of Chicago, Chicago, IL 60637 USA

⁶⁷Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, CA 90095 USA

⁶⁸Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109 USA

⁶⁹Bioinformatics Research Center and Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC 27695 USA

⁷⁰National Institute for Biotechnology in the Negev, Beer-Sheva, 84105 Israel

⁷¹European Molecular Biology Laboratory, 69117 Heidelberg, Germany

⁷²Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ 08540 USA

⁷³Altius Institute for Biomedical Sciences, Seattle, Washington, 98121 USA

⁷⁴Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA 02215 USA

⁷⁵University of Hohenheim, 70599 Stuttgart, Germany

⁷⁶Huntsman Cancer Institute, Department of Population Health Sciences, University of Utah, Salt Lake City, UT 84112 USA

⁷⁷Center for Epigenetics, Johns Hopkins University School of Medicine, Baltimore, MD 21205 USA

⁷⁸Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205 USA

⁷⁹Department of Mental Health, Johns Hopkins University School of Public Health, Baltimore, MD 21205 USA

⁸⁰McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21205 USA

⁸¹Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205 USA

⁸²Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA

⁸³Department of Medicine, University of Washington, Seattle, Washington, 98195 USA

⁸⁴Division of Cardiology, University of Washington, Seattle, Washington, 98195 USA

⁸⁵Institute for Systems Genetics, New York University Langone Medical Center, New York, NY 10016 USA

⁸⁶Department of Genome Sciences, University of Washington, Seattle, WA 98195 USA

⁸⁷Office of Strategic Coordination, Division of Program Coordination, Planning and Strategic Initiatives, Office of the Director, NIH, Rockville, MD 20852 USA

⁸⁸Biorepositories and Biospecimen Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD 20892 USA

⁸⁹National Institute of Dental and Craniofacial Research, Bethesda, MD 20892 USA

⁹⁰Division of Genomic Medicine, National Human Genome Research Institute, Rockville, MD 20852 USA

⁹¹Division of Neuroscience and Basic Behavioral Science, National Institute of Mental Health, NIH, Bethesda, MD 20892 USA

⁹²Division of Neuroscience and Behavior, National Institute on Drug Abuse, NIH, Bethesda, MD 20892 USA

⁹³Washington Regional Transplant Community, Falls Church, VA 22003 USA

⁹⁴Gift of Life Donor Program, Philadelphia, PA 19103 USA

⁹⁵LifeGift, Houston, TX 77055 USA

⁹⁶Center for Organ Recovery and Education, Pittsburgh, PA 15238 USA

⁹⁷LifeNet Health, Virginia Beach, VA 23453 USA

⁹⁸National Disease Research Interchange, Philadelphia, PA 19103 USA

⁹⁹Unyts, Buffalo, NY 14203 USA

¹⁰⁰Pharmacology and Therapeutics, Roswell Park Cancer Institute, Buffalo, NY 14263 USA

¹⁰¹Van Andel Research Institute, Grand Rapids, MI 49503 USA

¹⁰²Brain Endowment Bank, Miller School of Medicine, University of Miami, Miami, FL 33136 USA

¹⁰³National Institute of Allergy and Infectious Diseases, NIH, Rockville, MD 20852 USA

¹⁰⁴Biospecimen Research Group, Clinical Research Directorate, Leidos Biomedical Research, Inc., Rockville, MD 20852 USA

¹⁰⁵Leidos Biomedical Research, Inc, Frederick, MD 21701 USA

¹⁰⁶Temple University, Philadelphia, PA 19122 USA

¹⁰⁷Department of Health Behavior and Policy, School of Medicine, Virginia Commonwealth University, Richmond, VA 23298 USA

¹⁰⁸European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, CB10 1SD UK

¹⁰⁹UCSC Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064 USA

^✉

Corresponding author.

PMCID: PMC5940825 PMID: 29739930

Abstract

Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations are tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.

Phenotypic variation and diseases are influenced by factors such as genetic variants and gene expression. Here, Barbeira et al. develop S-PrediXcan to compute PrediXcan results using summary data, and investigate the effects of gene expression variation on human phenotypes in 44 GTEx tissues and >100 phenotypes.

Introduction

Over the last decade, GWAS have been successful in robustly associating genetic loci to human complex traits. However, the mechanistic understanding of these discoveries is still limited, hampering the translation of the associations into actionable targets. Studies of enrichment of expression quantitative trait loci (eQTLs) among trait-associated variants^1–3 show the importance of gene expression regulation. Functional class quantification showed that 80% of the common variant contribution to phenotype variability in 12 diseases can be attributed to DNAase I hypersensitivity sites, further highlighting the importance of transcript regulation in determining phenotypes⁴.

Many transcriptome studies have been conducted where genotypes and expression levels are assayed for a large number of individuals^5–8. The most comprehensive transcriptome dataset, in terms of examined tissues, is the Genotype-Tissue Expression Project (GTEx): a large-scale effort where DNA and RNA were collected from multiple tissue samples from nearly 1000 individuals and sequenced to high coverage^9,10. This remarkable resource provides a comprehensive cross-tissue survey of the functional consequences of genetic variation at the transcript level.

To integrate knowledge generated from these large-scale transcriptome studies and shed light on disease biology, we developed PrediXcan¹¹, a gene-level association approach that tests the mediating effects of gene expression levels on phenotypes. PrediXcan is implemented on GWAS or sequencing studies (i.e., studies with genome-wide interrogation of DNA variation and phenotypes). It imputes transcriptome levels with models trained in measured transcriptome datasets (e.g., GTEx). These predicted expression levels are then correlated with the phenotype in a gene association test that addresses some of the key limitations of GWAS¹¹.

Meta-analysis efforts that aggregate results from multiple GWAS have been able to identify an increasing number of associations that were not detected with smaller sample sizes^12–14. We will refer to these results as Genome-wide association meta-analysis (GWAMA) results. In order to harness the power of these increased sample sizes while keeping the computational burden manageable, methods that use summary level data rather than individual level data are needed.

Methods similar to PrediXcan that estimate the association between intermediate gene expression levels and phenotypes, but use summary statistics have been reported: TWAS (summary version)¹⁵ and Summary Mendelian Randomization (SMR)¹⁶. Another class of methods that integrate eQTL information with GWAS results are based on colocalization of eQTL and GWAS signals. Colocalized signals provide evidence of possible causal relationship between the target gene of an eQTL and the complex trait. These include RTC¹, Sherlock¹⁷, COLOC¹⁸, and more recently eCAVIAR¹⁹ and ENLOC²⁰.

Here we derive a mathematical expression that allows us to compute the results of PrediXcan without the need to use individual-level data, greatly expanding its applicability. We compare with existing methods and outline a best practices framework to perform integrative gene mapping studies, which we term MetaXcan.

We apply the MetaXcan framework by first training over one million elastic net prediction models of gene expression traits, covering protein coding genes across 44 human tissues from GTEx, and then performing gene-level association tests over 100 phenotypes from 40 large meta-analysis consortia and dbGaP.

Results

Computing PrediXcan results using summary statistics

We have derived an analytic expression to compute the outcome of PrediXcan using only summary statistics from genetic association studies. Details of the derivation are shown in the Methods section. In Fig. 1a we illustrate the mechanics of Summary-PrediXcan (S-PrediXcan) in relation to traditional GWAS and the individual-level PrediXcan method¹¹.

Fig. 1 — Comparison between GWAS, PrediXcan, and S-PrediXcan. a Compares GWAS, PrediXcan, and Summary-PrediXcan. Both GWAS and PrediXcan take genotype and phenotype data as input. GWAS computes the regression coefficients of Y on X_l using the model $Y = a + X_{l} b + ϵ$ , where Y is the phenotype and X_l the individual SNP dosage. The output is a table of SNP-level results. PrediXcan, in contrast, starts first by predicting/imputing the transcriptome. Then it calculates the regression coefficients of the phenotype Y on each gene’s predicted expression $T_{g}$ . The output is a table of gene-level results. Summary-PrediXcan directly computes the gene-level association results using the output from GWAS. b Shows the components of the formula to calculate PrediXcan gene-level association results using summary statistics. The different sets involved as input data are shown. The regression coefficient between the phenotype and the genotype is obtained from the study set. The training set is the reference transcriptome dataset where the prediction models of gene expression levels are trained. The reference set (1000G, or training set having some advantages) is used to compute the variances and covariances (LD structure) of the markers used in the predicted expression levels. Both the reference set and training set values are precomputed and provided to the user so that only the study set results need to be provided to the software. The crossed out term was set to 1 as an approximation. We found this approximation to have negligible impact on the results

We find high concordance between PrediXcan and S-PrediXcan results indicating that in most cases, we can use the summary version without loss of power to detect associations. Figure 2 shows the comparison of PrediXcan and S-PrediXcan Z-scores for a simulated phenotype (under the null hypothesis), a cellular growth phenotype and two disease phenotypes: type 1 diabetes and bipolar disorder from the WTCCC Consortium²¹; see Supplementary Notes 1, 2 and 3 for details. For the simulated phenotype, the study sets (in which GWAS is performed) and the reference set (in which LD between SNPs is computed) were African, East Asian, and European subsets from 1000 Genomes. The training set (in which prediction models are trained) was European (DGN Cohort⁵) in all cases. The high correlation between PrediXcan and S-PrediXcan demonstrates the robustness of our method to mismatches between reference and study sets. Despite the generally good concordance between the summary and individual level methods, there were a handful of false positive results with S-PrediXcan much more significant than PrediXcan. This underscores the need to use closely matched LD information whenever possible. Supplementary Fig. 11 shows S-PrediXcan’s performance on a phenotype simulated under the alternative hypothesis.

Fig. 2 — Comparison of PrediXcan and S-PrediXcan results in real and simulated traits. This figure shows a comparison of PrediXcan vs. S-PrediXcan for a a simulated phenotype under null hypothesis of no genetic component; b a cellular phenotype (=intrinsic growth); and c bipolar disorder and type 1 diabetes studies from Wellcome Trust Case Control Consortium (WTCCC). Gene expression prediction models were based on the DGN cohort presented in ref. ¹¹. For the simulated phenotype, study sets (GWAS set) and reference sets (LD calculation set) consisted of African (661), East Asian (504), and European (503) individuals from the 1000 Genomes Project. When the same study set is used as reference set, we obtained a high correlation (coefficient of determination): r² > 0.99999. For the intrinsic growth phenotype, study sets were a subset of 140 individuals from each of the African, Asian, and European groups from 1000 Genomes Project. The reference set was the same as for the simulated phenotype. For the disease phenotypes, the study set consisted of British individuals, and the LD calculation set was the European population subset of the 1000 Genomes Project

Notice that we are not testing here whether PrediXcan itself is robust to population differences between training and study sets. Robustness of the prediction across populations has been previously reported²². We further corroborated this in Supplementary Fig. 10.

Next we compare with other summary result-based methods such as S-TWAS, SMR, and COLOC.

Colocalization estimates complement PrediXcan results

One class of methods seeks to determine whether eQTL and GWAS signals are colocalized or are distinct although linked by LD. This class includes COLOC¹⁸, Sherlock¹⁷, and RTC¹, and more recently eCAVIAR¹⁹, and ENLOC²⁰. Thorough comparison between these methods can be found in refs. ^18,19. HEIDI, the post filtering step in SMR that estimates heterogeneity of GWAS and eQTL signals, can be included in this class. We focus here on COLOC, whose quantification of the probability of five configurations complements well with S-PrediXcan results.

COLOC provides the probability of five hypotheses: H0 corresponds to no eQTL and no GWAS association, H1 and H2 correspond to association with eQTL but no GWAS association or vice-versa, H3 corresponds to eQTL and GWAS association but independent signals, and finally H4 corresponds to shared eQTL and GWAS association. P0, P1, P2, P3, and P4 are the corresponding probabilities for each configuration. The sum of the five probabilities is 1. The authors¹⁸ recommend to interpret H0, H1, and H2 as limited power; for convenience we will aggregate these three hypotheses into one event with probability 1-P3-P4.

Figure 3 shows ternary plots²³ with P3, P4, and 1-P3-P4 as vertices. The blue region, top subtriangle, corresponds to high probability of colocalized eQTL and GWAS signals (P4). The orange region at bottom left corresponds to high probability of distinct eQTL and GWAS signals (P3). The gray region at center and bottom right corresponds to low probability of both colocalization and independent signals.

Fig. 3 — Colocalization status of S-PrediXcan results. a Shows a ternary plot that represents the probabilities of various configurations from COLOC. This plot conveniently constrains the values such that the sum of the probabilities is 1. All points in a horizontal line have the same probability of “colocalized” GWAS and eQTL signals (P4), points on a line parallel to the right side of the triangle (NW to SE) have the same probability of “Independent signals” (P3), and lines parallel to the left side of the triangle (NE to SW) correspond to constant P0+P1+P2. Top sub-triangle in blue corresponds to high probability of colocalization (P4 > 0.5), lower left sub-triangle in orange corresponds to probability of independent signals (P3 > 0.5), and lower right parallelogram corresponds to genes without enough power to determine or reject colocalization. The following panels present ternary plots of COLOC probabilities with a density overlay for S-PrediXcan results of the Height phenotype. b Shows the colocalization probabilities for all gene-tissue pairs. Most results fall into the “undetermined” region. c Shows that if we keep only Bonferroni-significant S-PrediXcan results, associations tend to cluster into three distinct regions: “independent signals,” “colocalized,” and “undertermined.” d Shows that HEIDI significant genes (to be interpreted as high heterogeneity between GWAS and eQTL signals, i.e., distinct signals) tightly cluster in the “independent signal” region, in concordance with COLOC. A few genes fall in the “colocalized” region, in disagreement with COLOC classification. Unlike COLOC results, HEIDI does not partition the genes into distinct clusters and an arbitrary cutoff p-value has to be chosen. e Shows genes with large HEIDI p-value (no evidence of heterogeneity) which fall in large part in the “colocalized” region. However a substantial number fall in “independent signal” region, disagreeing with COLOC’s classification

Figure 3b shows association results for all gene-tissue pairs with the height phenotype. We find that most associations fall in the gray, “undetermined,” region. When we restrict the plot to S-PrediXcan Bonferroni-significant genes (Fig. 3c), three distinct peaks emerge in the high P4 region (P4 > 0.5, “colocalized signals”), high P3 region (P3 > 0.5, “independent signals” or “non-colocalized signals”), and “undetermined” region. Moreover, when genes with low prediction performance are excluded (Supplementary Fig. 6d) the “undetermined” peak significantly diminishes.

These clusters provide a natural way to classify significant genes and complement S-PrediXcan results. Depending on false positive/false negative trade-off choices, genes in the “independent signals” or both “independent signals” and “undetermined” can be filtered out. The proportion of colocalized associations (P4 > 0.5) ranged from 5 to 100% depending on phenotype with a median of 27.6%. The proportion of “non-colocalized” associations ranged from 0 to 77% with a median of 27.0%. Supplementary Table 1 summarizes the percentages of significant associations that fall into the different colocalization regions.

This post-filtering idea was first implemented in the SMR approach using HEIDI. Comparison of COLOC results with HEIDI is shown in Supplementary Fig. 6e to 6h.

Comparison of S-PrediXcan to S-TWAS

Gusev et al. have proposed Transcriptome-Wide Association Study based on summary statistics (STWAS), which imputes the SNP level Z-scores into gene level Z-scores. This is not the same as computing the results of individual level TWAS. We show (in Methods section) that the difference between the individual level and summary level TWAS is given by the factor $\sqrt{\frac{1 - R_{l}^{2}}{1 - R_{g}^{2}}}$ , where R_l is the proportion of variance in the phenotype explained by a SNP’s allelic dosage, and R_g is the proportion explained by gene expression (see Methods section). For most practical purposes we have found that this factor is very close to 1 so that if the same prediction models were used, no substantial difference between S-TWAS and S-PrediXcan should be expected.

Figure 4a shows a diagram of S-PrediXcan and S-TWAS. Both use SNP to phenotype associations results (Z_X,Y) and prediction weights (w_X,Tg) to infer the association between the gene expression level (T_g) and phenotype (Y).

Fig. 4 — Comparison between S-PrediXcan and S-TWAS. a Depicts how summary-TWAS and PrediXcan test the mediating role of gene expression level T_g. Multiple SNPs are linked to the expression level of a gene via weights w_X,Tg. b Shows the significance of Summary-TWAS (BSLMM) vs. summary-PrediXcan (elastic net), for the height phenotype across 44 GTEx tissues. There is a small bias caused by using S-TWAS results available from²⁴, which only lists significant hits. S-PrediXcan tends to yield a larger number of significant associations (see Supplementary Fig. 12). P-values were thresholded at 10−50 for visualization purposes. c Shows the proportion of non-colocalized associations (distinct eQTL and GWAS signals) from S-TWAS significant vs. S-PrediXcan significant results. For all phenotypes, S-TWAS has a higher proportion of LD-contaminated signals compared to S-PrediXcan, as estimated via COLOC. d Shows the proportion of colocalized associations (shared eQTL and GWAS signals) from S-TWAS significant vs. S-PrediXcan significant results. For most phenotypes, TWAS has lower proportion of colocalized signals compared to S-PrediXcan, as estimated via COLOC. Phenotype abbreviations are as follows: FNBD Femoral Neck Bone Density, LSBD Lumbar Spine Bone Density, BMI Body Mass Index, HEIGHT Height, LDL Low-Density Lipoprotein Cholesterol, HDL High-Density Lipoprotein Cholesterol, TRYG Tryglicerides, CROHN Crohn’s Disease, INFBOWEL Inflammatory Bowel’s Disease, ULCERC Ulcerative Colitis, HBA1C Hemogoblin Levels, HOMA-IR HOMA Insulin Response, SCZ Schizophrenia, RA Rheumatoid Arthritis, COLLEGE College Completion, EDUCYEARS Education Years

Figure 4b compares S-TWAS significance (as reported in ref. ²⁴) to S-PrediXcan significance. The difference between the two approaches is mostly driven by the different prediction models: TWAS uses BSLMM²⁵ whereas PrediXcan uses elastic net²⁶. BSLMM allows two components: one sparse (small set of large effect predictors) and one polygenic (all variants contribute some marginal effect to the prediction). For PrediXcan we have chosen to use a sparse model (elastic net) based on the finding that the genetic component of gene expression levels is mostly sparse²⁷.

Figure 4c shows that the COLOC-estimated proportion of non-colocalized (independent) GWAS and eQTL signals is larger among TWAS significant genes than among S-PrediXcan significant ones. We believe this is due to the polygenic component of BSLMM models, a wider set of SNPs increasing the chances of COLOC yielding a non-colocalized result. Figure 4d shows that, for most traits, the COLOC-estimated proportion of colocalized signals is larger among S-PrediXcan significant genes than S-TWAS significant genes.

Comparison of S-PrediXcan to SMR

Zhu et al. have proposed Summary Mendelian Randomization (SMR)¹⁶, a summary data based Mendelian randomization that integrates eQTL results to determine target genes of complex trait-associated GWAS loci. They derive an approximate $χ_{1}^{2}$ -statistic (Eq 5 in ref. ¹⁶) for the mediating effect of the target gene expression on the phenotype. Figure 5a depicts this mechanism.

Fig. 5 — Comparison between summary-PrediXcan and SMR. a Depicts how SMR tests the mediating role of gene expression level $T_{g}$ . The top eQTL is linked to the phenotype as an instrumental variable in a Mendelian Randomization approach. b Shows the significance of SMR vs. the significance of Summary-PrediXcan. As expected, SMR associations tend to be smaller than S-PrediXcan ones. c and d show that the SMR statistics significance is bounded by GWAS and eQTL p-values. The p-values (−log10) of the SMR statistics are plotted against the GWAS p-value of the top eQTL SNP (c), and the gene’s top eQTL p-value (d). e Shows a QQ plot for simulated values of $T_{SMR}$ . Under the null hypothesis of significant eQTL signal and no GWAS association, we generated random values for $Z_{GWAS}^{2}$ and $Z_{eQTL}^{2}$ following the simulations from ref. ¹⁶. $T_{SMR}$ statistic was calculated from these values, and compared to a $χ_{1}^{2}$ distribution to illustrate this statistics’ deflation. f shows the sample mean of $T_{SMR}$ from 1000 simulations, centered close to 0.93, instead of the expected value of 1 for a $χ_{1}^{2}$ -distributed variable. g shows the proportion of non-colocalized significant associations to total significant associations in PrediXcan and SMR. h Shows the proportion of colocalized significant associations (shared eQTL and GWAS signals). As expected, SMR shows a higher proportion of colocalized and non-colocalized associations than PrediXcan. This is caused by SMR’s high eQTL significance threshold, that rules out most of the genes with low colocalization power (P0 + P1 + P2 > 0.5). For some of the associations, GWAS and eQTL p-values were more significant than shown since they were thresholded at 10⁻⁵⁰ to improve visualization. Phenotype abbreviations are as follows: FNBD Femoral Neck Bone Density, LSBD Lumbar Spine Bone Density, BMI Body Mass Index, HEIGHT Height, LDL Low-Density Lipoprotein Cholesterol, HDL High-Density Lipoprotein Cholesterol, TRYG Tryglicerides, CROHN Crohn’s Disease, INFBOWEL Inflammatory Bowel’s Disease, ULCERC Ulcerative Colitis, HBA1C Hemogoblin Levels, HOMA-IR HOMA Insulin Response, SCZ Schizophrenia, RA Rheumatoid Arthritis, COLLEGE College Completion, EDUCYEARS Education Years

Unfortunately, the derived statistic is not well calibrated. A QQ plot comparing the SMR statistic (under the null hypothesis of genome-wide significant eQTL signal and no GWAS association) shows deflation. The sample mean of the statistic is ≈0.93 instead of 1, the expected value for the mean of a $χ_{1}^{2}$ random variable (see Fig. 5e, f and Methods section for details). The $χ^{2}$ approximation is only valid in two extreme cases: when the eQTL association is much stronger than the GWAS association or vice versa, when the GWAS association is much stronger than the eQTL association (see Methods section for details).

One limitation is that the significance of the SMR statistic is the lower of the top eQTL association (genotype to expression) or the GWAS association (genotype to phenotype) as shown in Fig. 5e, f. Given the much larger sample sizes of GWAS studies, for most genes, the combined significance will be determined by the eQTL association. The combined statistic forces us to apply multiple testing correction for all genes, even those that are distant to GWAS associated loci, which is unnecessarily conservative. Keep in mind that currently both SMR and PrediXcan only use cis associations. An example may clarify this further. Let us suppose that for a given phenotype there is only one causal SNP and that the GWAS yielded a highly significant p-value, say 10⁻⁵⁰. Let us also suppose that there is only one gene (gene A) in the vicinity (we are only using cis predictors) associated with the causal SNP with p = 10⁻⁵. SMR would compute the p-values of all genes and yield a p-value ≈ 10⁻⁵ for gene A (the less significant p-value). However, after multiple correction this gene would not be significantly associated with the phenotype. Here it is clear that we should not be adjusting for testing of all genes when we know a priori that only one is likely to produce a gene level association. In contrast, the PrediXcan p-value would be ≈10⁻⁵⁰ for gene A and would be distributed uniformly from 0 to 1 for the remaining genes. Most likely only gene A (or perhaps a handful of genes, just by chance) would be significant after Bonferroni correction. If we further correct for prediction uncertainty (here = eQTL association), a p-value of ≈10⁻⁵ would remain significant since we only need to correct for the (at most) handful of genes that were Bonferroni significant for the PrediXcan p-value.

Another potential disadvantage of this method is that only top-eQTLs are used for testing the gene level association. This does not allow to aggregate the effect on the gene across multiple variants.

Figure 5b compares S-PrediXcan (elastic net) and SMR association results. As expected, SMR p-values tend to be less significant than S-PrediXcan’s in large part due to the additional adjustment for the uncertainty in the eQTL association. Figure 5c, d show that the SMR significance is bounded by the eQTL and GWAS association strengths of the top eQTL. Figure 5g shows a comparison between SMR’s and S-PrediXcan’s proportion of non-colocalization, while Fig. 5h compares proportion of colocalization, as estimated by COLOC. SMR shows a higher proportion of colocalized and independent signals. This is expected since SMR uses a more stringent eQTL association criterion so that there are few significant genes in the undetermined region.

SMR introduces a post filtering step via an approach called HEIDI, which is compared to COLOC in Fig. 3e and Supplementary Fig. 6.

MetaXcan framework

Building on S-PrediXcan and existing approaches, we define a general framework (MetaXcan) to integrate eQTL information with GWAS results and map disease-associated genes. This evolving framework can incorporate models and methods to increase the power to detect causal genes and filter out false positives. Existing methods fit within this general framework as instances or components (Fig. 6a).

Fig. 6 — MetaXcan framework and application. a Shows a general framework (MetaXcan) which encompasses methods such as PrediXcan, TWAS, SMR, COLOC among others. b Summarizes the application of the MetaXcan framework with S-PrediXcan using 44 GTEx tissue transcriptomes and over 100 GWAS and meta analysis results. We trained prediction models using elastic-net²⁶ and deposited the weights and SNP covariances in the publicly available resource (http://predictdb.org/). The weights, covariances, and over 100 GWAS summary results were processed with S-PrediXcan. Colocalization status was computed and the full set of results was deposited in gene2pheno.org

The framework starts with the training of prediction models for gene expression traits followed by a selection of high-performing models. Next, a mathematical operation is performed to compute the association between each gene and the downstream complex trait. Additional adjustment for the uncertainty in the prediction model can be added. To avoid capturing LD-contaminated associations, which can occur when expression predictor SNPs and phenotype causal SNPs are different but in LD, we use colocalization methods that estimate the probability of shared or independent signals.

PrediXcan implementations use elastic net models motivated by our observation that gene expression variation is mostly driven by sparse components²⁷. TWAS implementations have used Bayesian Sparse Linear Mixed Models²⁵ (BSLMM). SMR fits into this scheme with prediction models consisting solely of the top eQTL for each gene (weights are not necessary here since only one SNP is used at a time).

For the last step, we chose COLOC to estimate the probability of colocalization of GWAS and eQTL signals. COLOC probabilities cluster more distinctly into different classes and thus, unlike other methods, suggests a natural cut off threshold at P = 0.5. Another advantage of COLOC is that for genes with low probability of colocalization, it further distinguishes distinct GWAS and eQTL signals from low power. This is a useful feature that future development of colocalization methods should also offer. SMR, on the other hand, uses its own estimate of “heterogeneity” of signals calculated by HEIDI.

Suggested association analysis pipeline

Perform PrediXcan or S-PrediXcan using all tissues. Use Bonferroni correction for all gene-tissue pairs: keep p < 0.05/number of gene-tissue pairs tested.
Keep associations with significant prediction performance adjusting for number of PrediXcan significant gene-tissue pairs: keep prediction performance p-values < 0.05/(number of significant associations from previous step).
Filter out LD-contaminated associations, i.e., gene-tissue pairs in the “independent signal” (=”non-colocalized”) region of the ternary plot (See Fig. 3a): keep COLOC P3 < 0.5 (Blue and gray regions in Fig. 3a).
If further reduction of number of genes to be taken to replication or validation is desired, keep only hits with explicit evidence of colocalization: P4 > 0.5 (Blue region in Fig. 3a).

Any choice of thresholds has some level of arbitrariness. Depending on the false positive and negative trade off, these numbers may be changed.

Gene expression variation is associated to diverse traits

We downloaded summary statistics from meta analyses of over 100 phenotypes from 40 consortia. The full list of consortia and phenotypes is shown in Supplementary Data 2. We tested association between these phenotypes and the predicted expression levels using elastic net models in 44 human tissues from GTEx as described in the Methods section, and a whole blood model from the DGN cohort presented in ref. ¹¹. We illustrate this application in Fig. 6b.

S-PrediXcan’s results tend to be more significant as the genetic component of gene expression increases (larger cross-validated prediction performance $R^{2}$ ). Similarly, S-PrediXcan associations tend to be more significant when prediction is more reliable (p-values of association between predicted and observed expression levels are more significant, i.e., when prediction performance p-value is smaller). The trend is seen both when results are averaged across all tissues for a given phenotype or across all phenotypes for a given tissue, as displayed in Supplementary Figs. 1-4. This trend was also robust across different monotonic functions of the Z-scores.

We used a Bonferroni threshold accounting for all the gene-tissue pairs that were tested (0.05/total number of gene-tissue pairs ≈2.5e-7). This approach is conservative because the correlation between tissues would make the total number of independent tests smaller than the total number of gene-tissue pairs. Height had the largest number of significantly associated unique genes at 1686 (based on a GWAMA of 250 K individuals). Other polygenic diseases with a large number of associations include schizophrenia with 305 unique significant genes (n = 150 K individuals), low-density lipoprotein cholesterol (LDL-C) levels with 296 unique significant genes (n = 188 K), other lipid levels, glycemic traits, and immune/inflammatory disorders such as rheumatoid arthritis and inflammatory bowel disease. For other psychiatric phenotypes, a much smaller number of significant associations was found, with eight significant genes for bipolar disorder (n = 16,731) and one for major depressive disorder (n = 18,759), probably due to smaller sample sizes, but also smaller effect sizes.

When step 2 from the suggested pipeline is applied, keeping only reliably predicted genes, we are left with 739 genes for height, 150 for schizophrenia, 117 for LDL-C levels.

After step 3, which keeps genes that are without strong evidence of LD-contamination, these numbers dropped to 264 for height, 58 for schizophrenia, and 60 for LDL-C levels. After step 4, which keeps only genes with strong evidence of colocalization, we find 215 genes for height, 49 for schizophrenia, and 35 for LDL-C. The counts for the full set of phenotypes can be found in Supplementary Data 2.

Mostly, genome-wide significant genes tend to cluster around known SNP-level genome-wide significant loci or sub-genome-wide significant loci. Regions with sub-genome-wide significant SNPs can yield genome-wide significant results in S-PrediXcan, because of the reduction in multiple testing and the increase in power arising from the combined effects of multiple variants. Supplementary Table 2 lists a few examples where this occurs.

The proportion of colocalized associations (P4 > 0.5) ranged from 5 to 100% depending on phenotype with a median of 27.6%. The proportion of “non colocalized” associations ranged from 0 to 77% with a median of 27.0%.

See full set of results in our online catalog (gene2pheno.org). Significant gene-tissue pairs are included in Supplementary Data 3. To facilitate comparison, the catalog contains all SMR results we generated and the S-TWAS results reported by ref. ²⁴ for 30 GWAS traits and GTEx BSLMM models. Note that SMR application to 28 phenotypes was reported by ref. ²⁸ using whole blood eQTL results from ref. ²⁹.

Small gene expression changes associated to mild phenotypes

We reasoned that if complete knock out of monogenic disease genes cause severe forms of the disease, more moderate alterations of gene expression levels (as affected by regulatory variation in the population) could cause more moderate forms of the disease. Thus moderate alterations in expression levels of monogenic disease genes (such as those driven by eQTLs) may have an effect on related complex traits, and this effect could be captured by S-PrediXcan association statistics. To test this hypothesis, we obtained genes listed in the ClinVar database³⁰ for obesity, rheumatoid arthritis, diabetes, Alzheimer’s, Crohn’s disease, ulcerative colitis, age-related macular degeneration, schizophrenia, and autism. Figure 7 displays the QQ plot for all associations and compares to those in ClinVar database. As postulated, we found enrichment of significant S-PrediXcan associations for ClinVar genes for all tested phenotypes except for autism and schizophrenia. The lack of significance for autism is probably due to insufficient power: the distribution of p-values is close to the null distribution. In contrast, for schizophrenia, many genes were found to be significant in the S-PrediXcan analysis. There are several reasons that may explain this lack of enrichment: genes identified with GWAS and subsequently with S-PrediXcan have rather small effect sizes, so that it would not be surprising that they were missed until very large sample sizes were aggregated; ClinVar genes may originate from rare mutations that are not well covered by our prediction models, which are based on common variation (due to limited sample sizes of eQTL studies and the minor allele frequency –MAF filter used in GWAS studies); or the mechanism of action of the schizophrenia linked ClinVar genes may be different than the alteration of expression levels. Also, the pathogenicity of some of the ClinVar entries has been questioned³¹. The list of diseases in ClinVar used to generate the enrichment figures can be found in Supplementary Data 1, along with the corresponding association results.

Fig. 7 — ClinVar genes show significant S-PrediXcan associations. Genes implicated in ClinVar tended to be more significant in S-PrediXcan for most diseases tested, except for schizophrenia and autism. This suggests that more moderate alteration of monogenic disease genes may contribute in a continuum of more moderate but related phenotypes. Alternatively, a more complex interplay between common and rare variation could be taking place such as higher tolerance to loss of function mutations in lower expressing haplotypes which could induce association with predicted expression. Blue circles correspond to the QQ plot of genes in ClinVar that were annotated with the phenotype and black circles correspond to all genes

Agnostic scanning across GTEx tissues improves discovery

Most genes were found to be significantly associated in a handful of tissues as illustrated in Fig. 8b. For example, for LDL-C levels, liver was the most enriched tissue in significant associations as expected given known biology of this trait (See Supplementary Fig. 5). This prominent role of liver was apparent despite the smaller sample size available for building liver models (n = 97), which was less than a third of the numbers available for muscle (n = 361) or lung (n = 278).

However, in general, tissues expected to stand out as more enriched for diseases given currently known biology did not consistently do so when we looked at the average across all (significant) genes, using various measures of enrichment. For example, the enrichment in liver was less apparent for high-density lipoprotein cholesterol (HDL-C) or triglyceride levels. We find for many significant associations that the evidence is present across multiple tissues. This may be caused by a combination of context specificity and sharing of regulatory mechanism across tissues.

Next, we illustrate the challenges of identifying disease relevant tissues based on eQTL information using three genes with well established biology: C4A for schizophrenia³² and SORT1³³ and PCSK9 both for LDL-C and cardiovascular disease. S-PrediXcan results for these genes and traits, and regulatory activity by tissue (as measured by the proportion of expression explained by the genetic component), are shown in Fig. 8a. Representative results are shown in Supplementary Tables 3, 4 and 5. Supplementary Data 4 contains the full set of MetaXcan results (i.e., association, colocalization, and HEIDI) for these genes.

SORT1 is a gene with strong evidence for a causal role in LDL-C levels, and as a consequence, is likely to affect risk for cardiovascular disease³³. This gene is most actively regulated in liver (close to 50% of the expression level of this gene is determined by the genetic component) with the most significant S-PrediXcan association in liver (p-value ≈ 0, Z = −28.8), consistent with our prior knowledge of lipid metabolism. In this example, tissue specific results suggest a causal role of SORT1 in liver.

However, in the following example, association results across multiple tissues do not allow us to discriminate the tissue of action. C4A is a gene with strong evidence of causal effect on schizophrenia risk via excessive synaptic pruning in the brain during development³². Our results show that C4A is associated with schizophrenia risk in all tissues (p < 2.5 × 10⁻⁷ in 36 tissue models and p < 0.05 for the remaining four tissue models).

PCSK9 is a target of several LDL-C lowering drugs currently under trial to reduce cardiovascular events³⁴. The STARNET study³⁵ profiled gene expression levels in cardiometabolic disease patients and showed tag SNP rs12740374 to be a strong eQTL for PCSK9 in visceral fat but not in liver. Consistent with this, our S-PrediXcan results also show a highly significant association between PCSK9 and LDL-C (p ≈ 10⁻¹³) in visceral fat and not in liver (our training algorithm did not yield a prediction model for PCSK9, i.e., there was no evidence of regulatory activity). In our results, however, the statistical evidence is much stronger in tibial nerve (p ≈ 10⁻²⁷). Accordingly, in our training set (GTEx), there is much stronger evidence of regulation of this gene in tibial nerve compared to visceral fat.

Most associations highlighted here have high colocalization probabilities. See Supplementary Tables 3, 4, and 5. However, visceral fat association shows evidence of non colocalization (probability of independent signals P3 = 0.69 in LDL-C). It is possible that the relevant regulatory activity in visceral adipose tissue was not detected in the GTEx samples for various reasons but it was detected in tibial nerve. Thus by looking into all tissues’ results we increase the window of opportunities where we can detect the association.

PCSK9 yields colocalized signals for LDL-C levels in Tibial Nerve, Lung, and Whole blood. SORT1 shows colocalization with LDL-C in liver (P4 ≈ 1) and pancreas (P4 = 0.90). C4A is colocalized with schizophrenia risk for the majority of the tissues (29/40) with a median colocalization probability of 0.82.

These examples demonstrate the power of studying regulation in a broad set of tissues and contexts and emphasize the challenges of determining causal tissues of complex traits based on in-silico analysis alone. Based on these results, we recommend to scan all tissues’ models to increase the chances to detect the relevant regulatory mechanism that mediates the phenotypic association. False positives can be controlled by Bonferroni correcting for the additional tests.

Replication in an independent cohort

We used data from the Resource for Genetic Epidemiology Research on Adult Health and Aging study (GERA, phs000674.v1.p1)^36,37. This is a study led by the Kaiser Permanente Research Program on Genes, Environment, and Health (RPGEH) and the UCSF Institute for Human Genetics with over 100,000 participants. We downloaded the data from dbGaP and performed GWAS followed by S-PrediXcan analysis of 22 conditions available in the European subset of the cohort.

For replication, we chose Coronary Artery Disease (CAD), LDL cholesterol levels, Triglyceride levels, and schizophrenia, which had closely related phenotypes in the GERA study and had a sufficiently large number of Bonferroni significant associations in the discovery set. Analysis and replication of the type 2 diabetes phenotype can be found in ref. ³⁸. Coronary artery disease hits were compared with “Any cardiac event,” LDL cholesterol and triglyceride level signals were compared with “Dyslipidemia,” and schizophrenia was compared to “Any psychiatric event” in GERA.

High concordance between discovery and replication is shown in Fig. 9 where dyslipidemia association Z-scores are compared to LDL cholesterol Z-scores. The majority of gene-tissue pairs (92%, among the ones with Z-score magnitude greater than 2 in both sets) have concordant direction of effects in the discovery and replication sets. The high level of concordance is supportive of an omnigenic trait architecture³⁹.

Following standard practice in meta-analysis, we consider a gene to be replicated when the following three conditions are met: the p-value in the replication set is <0.05, the direction of discovery and replication effects are the same, and the meta analyzed p-value is Bonferroni significant with the discovery threshold.

We display summary statistics for this replication analysis in Table 1. Among the 56 genes significantly associated with CAD in the discovery set, 6 (11%) were significantly associated with “Any cardiac event” in GERA. Using “Dyslipidemia” as the closest matching phenotype, 78.5% and 43.5% of LDL and triglyceride genes replicated, respectively. Among the 285 genes associated with schizophrenia in the discovery set, 51 (21%) replicated. The low replication rate for CAD and Schizophrenia is likely due to the broad phenotype definitions in the replication.

Table 1.

Replication of results in GERA

Discovery phenotype	Replication phenotype	# Signif genes in disc set	# Replicated genes	π₁(all) in repl	π₁(sig) in repl	% Replicated genes	# Replicated coloc or undeterm
Coronary artery disease	Any cardiac event	56	6	0.4%	49.1%	10.7%	6
LDL cholesterol	Dyslipidemia	282	219	5.8%	90.8%	78.5%	184
Triglycerides	Dyslipidemia	233	100	5.8%	73.1%	43.5%	69
Schizophrenia	Any psychiatric event	285	60	1.2%	47.6%	21.1%	51

Open in a new tab

Significant genes/tissue pairs were replicated using a closely matched phenotype in an independent dataset from the GERA cohort³⁶. The criteria consisted in significance threshold for replication at p < 0.05, concordant directions of effect, and meta analysis p-value less than the Bonferroni threshold in the discovery set. π₁ is an estimate of proportion of true positives in the replication set. π₁(all) uses all gene–tissue pairs whereas π₁(sig) is computed using only gene-tissue pairs that were significant in the discovery set. The column ‘# replicated genes coloc or undeterm’ is the number of replicated genes excluding the ones for which there was strong evidence of independent GWAS and eQTL signals

We found no consistent replication pattern difference between colocalized and non-colocalized genes.

This is not unexpected if the LD pattern is similar between discovery and replication sets.

The full list of significant genes can be queried in gene2pheno.org.

Discussion

Here we derive a mathematical expression to compute PrediXcan results without using individual level data, which greatly expands its applicability and is robust to study and reference set mismatches. This has not been done before. TWAS, which for the individual level approach only differs from PrediXcan on the prediction model used in the implementation, has been extended to use summary level data. When Gaussian imputation is used, the relationship between individual level and summary versions of TWAS is clear. This is not the case when extended to general weights (such as BSLMM). Our mathematical derivation shows the analytic difference between them explicitly.

We also add a post filtering step, to mitigate issues with LD-contamination. Based on consistency with PrediXcan and interpretability of results, we have chosen to use COLOC for filtering. COLOC has the limitation of assuming a single causal variant, and has reduced power in the presence of multiple causal variants. However, colocalization estimation is an active area of research and improved versions or methods will be adopted in the future. We find that BSLMM-based TWAS results have a larger proportion of non-colocalized genes as estimated by COLOC. This could be due to the single variant assumption in COLOC but we believe this is rather a consequence of the polygenic component of BSLMM predictors. Given the predominantly sparse architecture of gene expression traits²⁷, we believe that adding a polygenic component unnecessarily increases the exposure to LD-contamination.

Despite the generally good concordance between the summary and individual level methods, there were a handful of false positive results with S-PrediXcan much more significant than PrediXcan. This underscores the need to use closely matched LD information whenever possible.

We applied our framework to over 100 phenotypes using transcriptome prediction models trained in 44 tissues from the GTEx Consortium and generated a catalog of downstream phenotypic association results of gene expression variation, a growing resource for the community.

The enrichment of monogenic disease genes among related phenotype associations suggests that moderate alteration of expression levels as affected by common genetic variation may cause a continuum of phenotypic changes. Alternatively, a more complex interplay between common and rare variation could be taking place such as higher tolerance to loss of function mutations in lower expressing haplotypes which could induce association with predicted expression⁴⁰.

We are finding that most trait associations are tissue specific; i.e., they are detected in a handful of tissues. However, we also find that expected tissues given known biology do not necessarily rank among the top enriched tissues. This suggests context specificity of the pathogenic mechanism; specific developmental stage or environmental conditions may be necessary to detect the regulatory event. On the other hand, we are detecting associations in unexpected tissues which suggests a sharing of regulation across multiple tissues/contexts or perhaps novel biology that takes place in these tissues. In either case, agnostic scanning of a broad set of tissues is necessary to discover these mechanisms.

Methods

Summary-PrediXcan formula

Figure 1b shows the main analytic expression used by Summary-PrediXcan for the Z-score (Wald statistic) of the association between predicted gene expression and a phenotype. The input variables are the weights used to predict the expression of a given gene, the variance and covariances of the markers included in the prediction, and the GWAS coefficient for each marker. The last factor in the formula can be computed exactly in principle, but we would need additional information that is unavailable in typical GWAS summary statistics output such as phenotype variance and sample size. Dropping this factor from the formula does not affect the accuracy of the results as demonstrated in the close to perfect concordance between PrediXcan and Summary-PrediXcan results on the diagonal of Fig. 2a.

The approximate formula we use is:

Z_{g} \approx \sum_{l \in {Model}_{g}} w_{l g} \frac{\hat{σ_{l}}}{\hat{σ_{g}}} \frac{\hat{β_{l}}}{se (\hat{β} l)},

where w_lg is the weight of SNP l in the prediction of the expression of gene g; ${\hat{β}}_{l}$ is the GWAS regression coefficients for SNP l; se( ${\hat{β}}_{l}$ ) is standard error of $\hat{β}$ , ${\hat{σ}}_{l}$ is the estimated variance of SNP l, and ${\hat{σ}}_{g}$ is the estimated variance of the predicted expression of gene g; and dosage and alternate allele are assumed to be the same.

The inputs are based, in general, on data from three different sources: study set (e.g., GWAS study set), expression training set (e.g., GTEx, DGN), population reference set (e.g., the training set or 1000 Genomes).

The study set is the main dataset of interest from which the genotype and phenotypes of interest are gathered. The regression coefficients and standard errors are computed based on individual-level data from the study set or a SNP-level meta-analysis of multiple GWAS. Training sets are the reference transcriptome datasets used for the training of the prediction models (GTEx, DGN, Framingham, etc.) thus the weights w_lg are computed from this set. Training sets can also be used to generate variance and covariances of genetic markers, which will usually be different from the study sets. When individual level data are not available from the training set we use population reference sets such as 1000 Genomes data. In the most common use scenario, users will need to provide only GWAS results using their study set. The remaining parameters are pre-computed, and published in PredictDB.

Association enrichment

We display the enrichment for selected phenotypes in Supplementary Fig. 5, measured as mean(Z²). For visualization purposes, we selected 25 phenotypes from different categories such as anthropometric traits, cardiometabolic traits, autoimmune diseases, and psychiatric conditions (please see figure caption for the list of selected phenotypes). The simple mean of Z² for all gene-tissue pairs in a phenotype was taken.

Derivation of summary-PrediXcan formula

The goal of summary-PrediXcan is to infer the results of PrediXcan using only GWAS summary statistics. Individual level data are not needed for this algorithm. We will introduce some notations for the derivation of the analytic expressions of S-PrediXcan.

Notation and preliminaries

Y is the n-dimensional vector of phenotype for individuals i = 1, n. X_l is the allelic dosage for SNP l. T_g is the predicted expression (or estimated GREx, genetically regulated expression). w_lg are weights to predict expression $T_{g} = \sum_{l \in Model g} w_{l g} X_{l}$ , derived from an independent training set.

We model the phenotype as linear functions of X_l and T_g

Y = α_{1} + X_{l} β_{l} + η

Y = α_{2} + T_{g} γ_{g} + ϵ

where α₁ and α₂ are intercepts, η and $ϵ$ error terms independent of X_l and T_g, respectively. Let ${\hat{γ}}_{g}$ and ${\hat{β}}_{l}$ be the estimated regression coefficients of Y regressed on T_g and X_l, respectively. ${\hat{γ}}_{g}$ is the result (effect size for gene g) we get from PrediXcan whereas ${\hat{β}}_{l}$ is the result from a GWAS for SNP l.

We will denote as $\hat{Var}$ and $\hat{Cov}$ the operators that compute the sample variance and covariance, i.e.,: $\hat{Var} (Y) = {\hat{σ}}_{Y}^{2} = \sum_{i = 1, n} {(Y_{i} - Ȳ)}^{2} ∕ (n - 1)$ with $Ȳ = \sum_{i = 1, n} Y_{i} ∕ n$ . Let ${\hat{σ}}_{l}^{2} = \hat{Var} (X_{l})$ , ${\hat{σ}}_{g}^{2} = \hat{Var} (T_{g})$ and $Γ_{g} = (X - \bar{X})' (X - \bar{X}) ∕ n$ , where X′ is the p × n matrix of SNP data and $\bar{X}$ is a n × p matrix where column l has the column mean of X_l (p being the number of SNPs in the model for gene g, typically p ≪ n).

With this notation, our goal is to infer PrediXcan results ( ${\hat{γ}}_{g}$ and its standard error) using only GWAS results ( ${\hat{β}}_{l}$ and their standard error), estimated variances of SNPs ( ${\hat{σ}}_{l}^{2}$ ), estimated covariances between SNPs in each gene model (Γ_g), and prediction model weights w_lg.

Input: ${\hat{β}}_{l}$ , se( ${\hat{β}}_{l}$ ), ${\hat{σ}}_{l}^{2}$ , $Γ_{g}$ , w_lg. Output: ${\hat{γ}}_{g}$ /se( ${\hat{γ}}_{g}$ ).

Next we list the properties and definitions used in the derivation

{\hat{γ}}_{g} = \frac{\hat{Cov} (T_{g}, Y)}{\hat{Var} (T_{g})} = \frac{\hat{Cov} (T_{g}, Y)}{{\hat{σ}}_{g}^{2}}

and

\hat{β_{l}} = \frac{\hat{Cov} (X_{l}, Y)}{\hat{Var} (X_{l})} = \frac{\hat{Cov} (X_{l}, Y)}{{\hat{σ}}_{l}^{2}}

The proportion of variance explained by the covariate (T_g or X_l) can be expressed as

R_{g}^{2} = {\hat{γ}}_{g}^{2} \frac{{\hat{σ}}_{g}^{2}}{{\hat{σ}}_{Y}^{2}}

R_{l}^{2} = {\hat{γ}}_{l}^{2} \frac{{\hat{σ}}_{l}^{2}}{{\hat{σ}}_{Y}^{2}}

By definition

T_{g} = \sum_{l \in {Model}_{g}} w_{l g} X_{l}

Thus $\hat{Var} (T_{g}) = {\hat{σ}}_{g}^{2}$ can be computed as

\begin{matrix} {\hat{σ}}_{g}^{2} & = & \hat{Var} (\sum_{l \in {Model}_{g}} w_{l g} X_{l}) \\ = & \hat{Var} (W_{g} X_{g}) \\ = & W_{g}^{'} \hat{Var} (X_{g}) W_{g} \end{matrix}

where W_g is the vector of $w_{l g}$ for SNPs in the model of g. By definition, $Γ_{g}$ is $\hat{Var} (X_{g})$ , the sample covariance of X_g, so that we arrive to

{\hat{σ}}_{g}^{2} = W_{g}^{'} {ΓW}_{g}

Calculation of regression coefficient ${\hat{γ}}_{g}$

${\hat{γ}}_{g}$ can be expressed as

\begin{matrix} {\hat{γ}}_{g} & = & \frac{\hat{Cov} (T_{g}, Y)}{{\hat{σ}}_{g}^{2}} \\ = & \frac{\hat{Cov} (\sum_{l \in {Model}_{g}} w_{l g} X_{l}, Y)}{{\hat{σ}}_{g}^{2}} \\ = & \sum_{l \in {Model}_{g}} \frac{w_{l g} \hat{Cov} (X_{l}, Y)}{{\hat{σ}}_{g}^{2}} \end{matrix}

where we used the linearity of $\hat{Cov}$ in the last step. Using Eq. (3), we arrive to

{\hat{γ}}_{g} = \sum_{l \in {Model}_{g}} \frac{w_{l g} {\hat{β}}_{l} {\hat{σ}}_{l}^{2}}{{\hat{σ}}_{g}^{2}}

Calculation of standard error of ${\hat{γ}}_{g}$

Also from the properties of linear regression we know that

{se}^{2} ({\hat{γ}}_{g}) = Var ({\hat{γ}}_{g}) = \frac{{\hat{σ}}_{ϵ}^{2}}{n {\hat{σ}}_{g}^{2}} = \frac{{\hat{σ}}_{Y}^{2} (1 - R_{g}^{2})}{n {\hat{σ}}_{g}^{2}}

In this equation, ${\hat{σ}}_{Y}^{2} ∕ n$ is not necessarily known but can be estimated using the equation analogous to (6) for β_l

{se}^{2} ({\hat{β}}_{l}) = \frac{{\hat{σ}}_{Y}^{2} (1 - R_{l}^{2})}{n {\hat{σ}}_{l}^{2}}

Thus:

\frac{{\hat{σ}}_{Y}^{2}}{n} = \frac{{se}^{2} ({\hat{β}}_{l}) {\hat{σ}}_{l}^{2}}{(1 - R_{l}^{2})}

Notice that the right hand side of (8) is dependent on the SNP l while the left hand side is not. This equality will hold only approximately in our implementation since we will be using approximate values for ${\hat{σ}}_{l}^{2}$ , i.e., from reference population, not the actual study population.

Calculation of Z-score

To assess the significance of the association, we need to compute the ratio of the estimated effect size ${\hat{γ}}_{g}$ and standard error se( ${\hat{γ}}_{g}$ ), or Z-score,

Z_{g} = \frac{{\hat{γ}}_{g}}{se ({\hat{γ}}_{g})}

with which we can compute the p-value as p = 2Φ(−|Z_g|) where Φ(.) is the normal CDF function. Thus

\begin{matrix} Z_{g} = \frac{{\hat{γ}}_{g}}{se ({\hat{γ}}_{g})} \\ = \sum_{l \in {Model}_{g}} \frac{w_{l g} {\hat{β}}_{l} {\hat{σ}}_{l}^{2}}{{\hat{σ}}_{g}^{2}} \sqrt{\frac{n}{{\hat{σ}}_{Y}^{2}} \frac{{\hat{σ}}_{g}^{2}}{(1 - R_{g}^{2})}} \\ = \sum_{l \in {Model}_{g}} \frac{w_{l g} {\hat{β}}_{l} {\hat{σ}}_{l}^{2}}{{\hat{σ}}_{g}^{2}} \sqrt{\frac{(1 - R_{l}^{2})}{{se}^{2} ({\hat{β}}_{l}) {\hat{σ}}_{l}^{2}} \frac{{\hat{σ}}_{g}^{2}}{(1 - R_{g}^{2})}} \end{matrix}

where we used Eqs. (5) and (6) in the second line and Eq. (8) in the last step. So

Z_{g} = \sum_{l \in {Model}_{g}} w_{l g} \frac{{\hat{σ}}_{l}}{{\hat{σ}}_{g}} \frac{{\hat{β}}_{l}}{se ({\hat{β}}_{l})} \sqrt{\frac{(1 - R_{l}^{2})}{(1 - R_{g}^{2})}}

\approx \sum_{l \in {Model}_{g}} w_{l g} \frac{{\hat{σ}}_{l}}{{\hat{σ}}_{g}} \frac{{\hat{β}}_{l}}{se ({\hat{β}}_{l})}

Based on results with actual and simulated data for realistic effect size ranges, we have found that the last approximation does not affect our ability to identify the association. The approximation becomes inaccurate only when the effect sizes are very large. But in these cases, the small decrease in statistical efficiency induced by the approximation is compensated by the large power to detect the larger effect sizes.

Calculation of σ_g in reference set

The variance of predicted expression is computed using Eq. (4) which takes weights for each SNP in the prediction model and the correlation (LD) between the SNPs. The correlation is computed in a reference set such as 1000G or in the training set.

Expression model training

To train our prediction models, we obtained genotype data and normalized gene expression data collected by the GTEx Project. We used 44 different tissues sampled by GTEx and thus generated 44 different tissue-wide models (dbGaP Accession phs000424.v6.p1). Sample sizes for different tissues range from 70 (Uterus) to 361 (Muscle—Skeletal). The models referenced in this paper make use of the GTEx Project’s V6p data, a patch to the version 6 data and makes use of improved gene-level annotation. We removed ambiguously stranded SNPs from genotype data, i.e., ref/alt pairs A/T, C/G, T/A, G/C. Genotype data was filtered to include only SNPs with MAF > 0.01. For each tissue, normalized gene expression data was adjusted for covariates such as gender, sequencing platform, the top three principal components from genotype data and top PEER Factors. The number of PEER Factors used was determined by sample size: 15 for n < 150, 30 for n between 150 and 250, and 35 for n > 250. Covariate data was provided by GTEx. For our analysis, we used protein-coding genes only.

For each gene-tissue pair for which we had adjusted expression data, we fit an Elastic-Net model based on the genotypes of the samples for the SNPs located within 1 Mb upstream of the gene’s transcription start site and 1 Mb downstream of the transcription end site. We used the R package glmnet with mixing parameter alpha equal to 0.5, and the penalty parameter lambda was chosen through 10-fold

Cross-validation

Once we fit all models, we retained only those with q-value less than 0.05⁴¹ For each tissue examined, we created a sqlite database to store the weights of the prediction models, as well as other statistics regarding model training. Supplementary Table 6 contains summary statistics on the models for each GTEx tissue. These databases have been made available for download at PredictDB.org.

Online Catalog and SMR, COLOC, TWAS

Supplementary Data 2 shows the list of GWA/GWAMA studies we considered in this analysis. We applied S-PrediXcan to these studies using the transcriptome models trained on GTEx studies for patched version 6. For simplicity, S-PrediXcan only considers those SNPs that have a matching set of alleles in the prediction model, and adjusts the dosages (2 dosage) if the alleles are swapped.

To make the results of this study broadly accessible, we built a Postgre SQL relational database to store S-PrediXcan results, and serve them via a web application http://gene2pheno.org.

We also applied SMR¹⁶ to the same set of GWAMA studies, using the GTEx eQTL associations. We downloaded version 0.66 of the software from the SMR website, and ran it using the default parameters. We converted the GWAMA and GTEx eQTL studies to SMR input formats. In order to have SMR compute the colocalization test, for those few GWAMA studies where allele frequency was not reported, we filled in with frequencies from the 1000 Genomes Project⁴² as an approximation. We also used the 1000 Genomes genotype data as reference panel for SMR.

Next we ran COLOC¹⁸ (as downloaded from the Comprehensive R Archive Network) over the same set of GWAMA and eQTL studies. We used the Approximate Bayes Factor colocalization analysis, with effect sizes, their standard errors, allele frequencies and sample sizes as arguments. When the frequency information was missing from the GWAS, we filled in with data from the 1000 Genomes Project.

For comparison purposes, we have also included the results of the application of Summary-TWAS to 30 traits publicly shared by the authors²⁴.

Comparison with TWAS

Formal similarity with TWAS can be made more explicit by rewriting S-PrediXcan formula in matrix form. With the following notation and definitions

\begin{matrix} {\tilde{W}}_{g} & = & {(σ_{1} w_{1 g}, \dots, σ_{p} w_{p g})}^{'} \\ Z_{SNPs} & = & {(Z_{1}, \dots, Z_{p})}^{'} \\ = & (\frac{{\hat{β}}_{1}}{s e ({\hat{β}}_{1})}, \dots, \frac{{\hat{β}}_{p}}{s e ({\hat{β}}_{p})})' \end{matrix}

and correlation matrix of SNPs in the model for gene g

$Σ_{g} = diag (\frac{1}{{\hat{σ}}_{1}}, \dots, \frac{1}{{\hat{σ}}_{p}}) \cdot Γ_{g} \cdot diag (\frac{1}{{\hat{σ}}_{1}}, \dots, \frac{1}{{\hat{σ}}_{p}})$

it is quite straightforward to write the numerator in (1) and (11) as

${\tilde{W}}_{g} \cdot Z_{SNPs}$ and in the denominator, the variance of the predicted expression level of gene g, as

{\tilde{W}}_{g}^{'} \cdot Σ_{g} \cdot {\tilde{W}}_{g}

Thus

Z_{g} = \frac{{\tilde{W}}_{g} \cdot Z_{SNPs}}{\sqrt{{\tilde{W}}_{g}^{'} \cdot Σ_{g} \cdot {\tilde{W}}_{g}}}

This equation has the same form as the TWAS expression if we use the scaled weight vector ${\tilde{W}}_{g}$ instead of W_g. Summary-TWAS imputes the Z-score for the gene-level result assuming that under the null hypothesis, the Z-scores are normally distributed with the same correlation structure as the SNPs; whereas in S-PrediXcan we compute the results of PrediXcan using summary statistics. Thus, S-TWAS and S-PrediXcan yield equivalent mathematical expressions (after setting the factor $\sqrt{\frac{(1 - R_{l}^{2})}{(1 - R_{g}^{2})}} \approx 1$ ).

Summary-PrediXcan with only top eQTL as predictor

The S-PrediXcan formula when only the top eQTL is used to predict the expression level of a gene can be expressed as

\begin{matrix} Z_{S-PrediXcan} & = \sum_{l \in {Model}_{g}} w_{l g} \frac{{\hat{σ}}_{l}}{{\hat{σ}}_{g}} \frac{{\hat{β}}_{l}}{se (β_{l})} \\ = w_{1 g} \frac{{\hat{σ}}_{1}}{\sqrt{w_{1 g}^{2} {\hat{σ}}_{1}^{2}}} Z_{1} \\ = Z_{1} \end{matrix}

where Z₁ is the GWAS Z-score of the top eQTL in the model for gene. Thus

Z_{top eQTL S - PrediXcan}^{2} = Z_{GWAS}^{2}

Comparison with SMR

SMR quantifies the strength of the association between expression levels of a gene and complex traits with T_SMR using the following function of the eQTL and GWAS Z-score statistics

T_{SMR} = \frac{Z_{eQTL}^{2} Z_{GWAS}^{2}}{Z_{eQTL}^{2} + Z_{GWAS}^{2}}

Here $Z_{eQTL}$ is the Z-score (=effect size/standard error) of the association between SNP and gene expression, and $Z_{GWAS}$ is the Z-score of the association between SNP and trait.

This SMR statistic (T_SMR) is not a $χ_{1}^{2}$ random variable as assumed in ref. ¹⁶. To prove this, we performed simulations following those described in ref. ¹⁶. We generated 10⁵ pairs of values for $Z_{GWAS}^{2}$ and $Z_{eQTL}^{2}$ . $Z_{GWAS}^{2}$ was sampled from a $χ_{1}^{2}$ distribution. $Z_{eQTL}^{2}$ was sampled from a non-central $χ_{1}^{2}$ distribution with parameter λ = 29 (a value chosen to mimic results from²⁹, see ref. ¹⁶). Only pairs with eQTLs satisfying genome-wide significance (p < 5 × 10⁻⁸) were kept. We performed a QQ plot and observed deflation when comparing to random values sampled from a $χ_{1}^{2}$ distribution (Fig. 5e). This simulation was repeated 1000 times, and we observed a mean of T_SMR close to 0.93 (Fig. 5f).

Only in two extreme cases, the chi-square approximation holds, when $Z_{eQTL} ≫ Z_{GWAS}$ or $Z_{eQTL} ≪ Z_{GWAS}$ . In these extremes, we can apply Taylor expansions to find an interpretable form of the SMR statistic.

If $Z_{eQTL} ≫ Z_{GWAS}$ , i.e., the eQTL association is much more significant than the GWAS association,

T_{SMR} = \frac{Z_{GWAS}^{2}}{1 + \frac{Z_{GWAS}^{2}}{Z_{eQTL}^{2}}} \approx Z_{GWAS}^{2} (1 - \frac{Z_{GWAS}^{2}}{Z_{eQTL}^{2}})

so that for large enough $Z_{eQTL}^{2}$ relative to $Z_{GWAS}^{2}$ ,

T_{SMR} \approx Z_{GWAS}^{2} = Z_{top eQTL S - PrediXcan}^{2}

using Eq. 12. Thus, in this case, the SMR statistic is slightly smaller than the (top eQTL based) S-PrediXcan $χ_{1}$ -square. This reduced significance is accounting for the uncertainty in the eQTL association. As the evidence for eQTL association grows, the denominator $Z_{eQTL}^{2}$ increases and the difference tends to 0.

On the other extreme when the GWAS association is much stronger than the eQTLs, $Z_{eQTL} ≪ Z_{GWAS}$ ,

T_{SMR} = \frac{Z_{eQTL}^{2}}{1 + \frac{Z_{eQTL}^{2}}{Z_{GWAS}^{2}}} \approx Z_{eQTL}^{2} (1 - \frac{Z_{eQTL}^{2}}{Z_{GWAS}^{2}})

so that analogously

T_{SMR} \approx Z_{eQTL}^{2}

In both extremes, the SMR statistic significance is approximately equal to the less significant of the two statistics GWAS or eQTL, albeit strictly smaller.

In between the two extremes, the right distribution must be computed using numerical methods. When we look at the empirical distribution of the SMR statistic’s p-value against the GWAS and eQTL (top eQTL for the gene) p-values, we find the ceiling of the SMR statistic is maintained as shown in Fig. 5e, f.

GERA imputation

Genotype files were obtained from dbGaP, and updated to release 35 of the probe annotations published by Affymetrix via PLINK⁴³. Probes were filtered out that had a minor allele frequency of <0.01, were missing in >10% of subjects, or did not fit Hardy-Weinberg equilibrium. Subjects were dropped that had an unexpected level of heterozygosity (F>0.05). Finally the HRC-1000G-check-bim.pl script (http://www.well.ox.ac.uk/~wrayner/tools/) was used to perform some final filtering and split data by chromosome. Phasing (via eagle v2.3⁴⁴) and imputation against the HRC r1.1 2016 panel⁴⁵ (via minimac3) were carried out by the Michigan Imputation Server⁴⁶.

GERA GWAS and MetaXcan Application

European samples had been split into ten groups during imputation to ease the computational burden on the Michigan server, so after obtaining the imputed .vcf files, we used the software PLINK⁴³ to convert the genotype files into the PLINK binary file format and merge the ten groups of samples together, while dropping any variants not found in all sample groups. For the association analysis, we performed a logistic regression using PLINK, and following QC practices from ref. ¹⁴ we filtered out individuals with genotype missingness >0.03 and filtered out variants with minor allele frequency <0.01, missingness >0.05, out of Hardy-Weinberg equilibrium significant at 1e-6, or had imputation quality <0.8. We used gender and the first ten genetic principal components as obtained from dbGaP as covariates. Following all filtering, our analysis included 61,444 European samples with 7,120,064 variants. MetaXcan was then applied to these GWAS results, using the 45 prediction models (GTEx and DGN).

Code Availability

We make our software publicly available on a GitHub repository: https://github.com/hakyimlab/MetaXcan. A short working example can be found on the GitHub page; more extensive documentation can be found on the project’s wiki page.

Data availability

The underlying GWAS results used in this analysis were downloaded from publicly available resources listed in Supplementary Data 2. The relevant GTEx gene expression data was obtained from dbGAP using accession phs000424.v6.p1. The GERA study was downloaded from dbGAP using accession number phs000674.v2.p2. WTCCC data was downloaded from WTCCC EGA european genome-phenome archive.

The list of ClinVar genes was downloaded from https://www.ncbi.nlm.nih.gov/clinvar/. TWAS results published in ref. ²⁴ were used. Prediction model weights and covariances for different tissues are available from the predictdb.org resource. The results of MetaXcan applied to the 44 human tissues and a broad set of phenotypes can be queried in gene2pheno.org, and we make the full data set of results available via the public GitHub repository https://github.com/hakyimlab/MetaXcan.

Electronic supplementary material

Supplementary Information^{(2.6MB, pdf)}

Peer Review File^{(398.6KB, pdf)}

41467_2018_3621_MOESM3_ESM.pdf^{(303.3KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1^{(123KB, csv)}

Supplementary Data 2^{(11.5KB, csv)}

Supplementary Data 3^{(3.8MB, csv)}

Supplementary Data 4^{(24.5KB, csv)}

Acknowledgements

We acknowledge the following US National Institutes of Health grants: R01MH107666 (H.K.I.), T32 MH020065 (K.P.S.), R01 MH101820 (GTEx), P30 DK20595 (Diabetes Research and Training Center), F31 DK101202 (J.M.T.), P50 DA037844 (Rat Genomics), and P50 MH094267 (Conte). H.E.W. was supported in part by start-up funds from the Loyola University Chicago.

The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health. Additional funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI SAIC-Frederick, Inc. (SAIC-F) subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to the Broad Institute, Inc. Biorepository operations were funded through an SAIC-F subcontract to the Van Andel Institute (10ST1035). Additional data repository and project management were provided by SAIC-F (HHSN261200800001E). The Brain Bank was supported by a supplements to the University of Miami grants DA006227 and DA033684 and to contract N01MH000028. Statistical Methods development grants were made to the University of Geneva (MH090941 and MH101814), the University of Chicago (MH090951, MH090937, MH101820, and MH101825), the University of North Carolina—Chapel Hill (MH090936 and MH101819), the Harvard University (MH090948), the Stanford University (MH101782), the Washington University St Louis (MH101810), and the University of Pennsylvania (MH101822). The data used for the analyses described in this manuscript were obtained from dbGaP accession number phs000424.v6.p1 on 06/17/2016.

This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113 and 085475.

This work was completed in part with resources provided by the University of Chicago Research Computing Center, Bionimbus⁴⁷, and the Center for Research Informatics. The Center for Research Informatics is funded by the Biological Sciences Division at the University of Chicago with additional funding provided by the Institute for Translational Medicine, CTSA grant number UL1 TR000430 from the National Institutes of Health.

Author contributions

A.N.B. contributed to S-PrediXcan software developmentl; executed S-PredixCan runs on the GWAS traits; developed framework for comparing PrediXcan and S-PrediXcan in simulated, cellular and WTCCC phenotypes; ran COLOC and SMR; developed gene2pheno.org database and web dashboard; contributed to the main text, supplement, figures, and analyses. S.P.D. performed the GTEx model training; ran the GERA GWAS; contributed to the main text. J.M.T. contributed to the main text. J.Z. contributed to the figures and predictdb.org resource. E.S.T. contributed to S-PrediXcan software. H.E.W. contributed to the main text and analysis. K.P.S. ran PrediXcan on WTCCC data and contributed to the analysis. R.B. contributed to the main text and figures. T.G. ran imputation of GERA genotypes. T.L.E. contributed to the analysis. L.M.H. contributed to the main text. E.A.S. contributed to the main text. D.L.N. contributed to the main text and analysis. N.J.C. contributed to the analysis. H.K.I. conceived the method, supervised the project, performed analysis, contributed to the main text, supplement, and figures. The GTEx consortium authors contributed in the collection, gathering, and processing of GTEx study data used for training transcriptome prediction models and running COLOC and SMR.

Competing interests

The authors declare no competing interests.

Footnotes

A full list of consortium members appears at the end of the paper.

Electronic supplementary material

Supplementary Information accompanies this paper at 10.1038/s41467-018-03621-1.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Hae Kyung Im, Email: haky@uchicago.edu.

GTEx Consortium:

François Aguet, Kristin G. Ardlie, Beryl B. Cummings, Ellen T. Gelfand, Gad Getz, Kane Hadley, Robert E. Handsaker, Katherine H. Huang, Seva Kashin, Konrad J. Karczewski, Monkol Lek, Xiao Li, Daniel G. MacArthur, Jared L. Nedzel, Duyen T. Nguyen, Michael S. Noble, Ayellet V. Segrè, Casandra A. Trowbridge, Taru Tukiainen, Nathan S. Abell, Brunilda Balliu, Ruth Barshir, Omer Basha, Alexis Battle, Gireesh K. Bogu, Andrew Brown, Christopher D. Brown, Stephane E. Castel, Lin S. Chen, Colby Chiang, Donald F. Conrad, Farhan N. Damani, Joe R. Davis, Olivier Delaneau, Emmanouil T. Dermitzakis, Barbara E. Engelhardt, Eleazar Eskin, Pedro G. Ferreira, Laure Frésard, Eric R. Gamazon, Diego Garrido-Martín, Ariel D. H. Gewirtz, Genna Gliner, Michael J. Gloudemans, Roderic Guigo, Ira M. Hall, Buhm Han, Yuan He, Farhad Hormozdiari, Cedric Howald, Brian Jo, Eun Yong Kang, Yungil Kim, Sarah Kim-Hellmuth, Tuuli Lappalainen, Gen Li, Xin Li, Boxiang Liu, Serghei Mangul, Mark I. McCarthy, Ian C. McDowell, Pejman Mohammadi, Jean Monlong, Stephen B. Montgomery, Manuel Muñoz-Aguirre, Anne W. Ndungu, Andrew B. Nobel, Meritxell Oliva, Halit Ongen, John J. Palowitch, Nikolaos Panousis, Panagiotis Papasaikas, YoSon Park, Princy Parsana, Anthony J. Payne, Christine B. Peterson, Jie Quan, Ferran Reverter, Chiara Sabatti, Ashis Saha, Michael Sammeth, Alexandra J. Scott, Andrey A. Shabalin, Reza Sodaei, Matthew Stephens, Barbara E. Stranger, Benjamin J. Strober, Jae Hoon Sul, Emily K. Tsang, Sarah Urbut, Martijn van de Bunt, Gao Wang, Xiaoquan Wen, Fred A. Wright, Hualin S. Xi, Esti Yeger-Lotem, Zachary Zappala, Judith B. Zaugg, Yi-Hui Zhou, Joshua M. Akey, Daniel Bates, Joanne Chan, Lin S. Chen, Melina Claussnitzer, Kathryn Demanelis, Morgan Diegel, Jennifer A. Doherty, Andrew P. Feinberg, Marian S. Fernando, Jessica Halow, Kasper D. Hansen, Eric Haugen, Peter F. Hickey, Lei Hou, Farzana Jasmine, Ruiqi Jian, Lihua Jiang, Audra Johnson, Rajinder Kaul, Manolis Kellis, Muhammad G. Kibriya, Kristen Lee, Jin Billy Li, Qin Li, Xiao Li, Jessica Lin, Shin Lin, Sandra Linder, Caroline Linke, Yaping Liu, Matthew T. Maurano, Benoit Molinie, Stephen B. Montgomery, Jemma Nelson, Fidencio J. Neri, Meritxell Oliva, Yongjin Park, Brandon L. Pierce, Nicola J. Rinaldi, Lindsay F. Rizzardi, Richard Sandstrom, Andrew Skol, Kevin S. Smith, Michael P. Snyder, John Stamatoyannopoulos, Barbara E. Stranger, Hua Tang, Emily K. Tsang, Li Wang, Meng Wang, Nicholas Van Wittenberghe, Fan Wu, Rui Zhang, Concepcion R. Nierras, Philip A. Branton, Latarsha J. Carithers, Ping Guan, Helen M. Moore, Abhi Rao, Jimmie B. Vaught, Sarah E. Gould, Nicole C. Lockart, Casey Martin, Jeffery P. Struewing, Simona Volpi, Anjene M. Addington, Susan E. Koester, A. Roger Little, Lori E. Brigham, Richard Hasz, Marcus Hunter, Christopher Johns, Mark Johnson, Gene Kopen, William F. Leinweber, John T. Lonsdale, Alisa McDonald, Bernadette Mestichelli, Kevin Myer, Brian Roe, Michael Salvatore, Saboor Shad, Jeffrey A. Thomas, Gary Walters, Michael Washington, Joseph Wheeler, Jason Bridge, Barbara A. Foster, Bryan M. Gillard, Ellen Karasik, Rachna Kumar, Mark Miklos, Michael T. Moser, Scott D. Jewell, Robert G. Montroy, Daniel C. Rohrer, Dana R. Valley, David A. Davis, Deborah C. Mash, Anita H. Undale, Anna M. Smith, David E. Tabor, Nancy V. Roche, Jeffrey A. McLean, Negin Vatanian, Karna L. Robinson, Leslie Sobin, Mary E. Barcus, Kimberly M. Valentino, Liqun Qi, Steven Hunter, Pushpa Hariharan, Shilpi Singh, Ki Sung Um, Takunda Matose, Maria M. Tomaszewski, Laura K. Barker, Maghboeba Mosavel, Laura A. Siminoff, Heather M. Traino, Paul Flicek, Thomas Juettemann, Magali Ruffier, Dan Sheppard, Kieron Taylor, Stephen J. Trevanion, Daniel R. Zerbino, Brian Craft, Mary Goldman, Maximilian Haeussler, W. James Kent, Christopher M. Lee, Benedict Paten, Kate R. Rosenbloom, John Vivian, and Jingchun Zhu

References

1.Nica AC, et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLOS Genet. 2010;6:1000895. doi: 10.1371/journal.pgen.1000895. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Nicolae DL, et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLOS Genet. 2010;6:e1000888. doi: 10.1371/journal.pgen.1000888. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Li YI, et al. RNA splicing is a primary link between genetic variation and disease. Science. 2016;352:600–604. doi: 10.1126/science.aad9417. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Gusev A, et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 2014;95:535–552. doi: 10.1016/j.ajhg.2014.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Battle A, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2014;24:14–24. doi: 10.1101/gr.155192.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Lappalainen T, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–511. doi: 10.1038/nature12531. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zhang X, et al. Identification of common genetic variants controlling transcript isoform variation in human whole blood. Nat. Genet. 2015;47:345–352. doi: 10.1038/ng.3220. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Stranger BE, et al. Patterns of Cis regulatory variation in diverse human populations. PLOS Genet. 2012;8:e1002639. doi: 10.1371/journal.pgen.1002639. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.The GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat. Genet. 2013;45:580–5. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Aguet F., et al. Local genetic effects on gene expression across 44 human tissues. Preprint at bioRxiv:http://biorxiv.org/content/early/2016/09/09/074450 (2016).
11.Gamazon ER, et al. A genebased association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Smoller JW, et al. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet. 2013;381:1371–9. doi: 10.1016/S0140-6736(12)62129-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Deloukas P, et al. Large scale association analysis identifies new risk loci for coronary artery disease. Nat. Genet. 2013;45:25–33. doi: 10.1038/ng.2480. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Morris AP, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 2012;44:981–990. doi: 10.1038/ng.2383. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gusev A, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 2016;48:245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zhu Z, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016;48:481–7. doi: 10.1038/ng.3538. [DOI] [PubMed] [Google Scholar]
17.He X, et al. Sherlock: detecting gene-disease associations by matching patterns of expression QTL and GWAS. Am. J. Hum. Genet. 2013;92:667–680. doi: 10.1016/j.ajhg.2013.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Giambartolomei C, et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLOS Genet. 2014;10:e1004383. doi: 10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hormozdiari F, et al. Colocalization of GWAS and eQTL signals detects target genes. Am. J. Hum. Genet. 2016;99:1245–1260. doi: 10.1016/j.ajhg.2016.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Wen X, Pique-Regi R, Luca F. Integrating molecular QTL data into genome-wide genetic association analysis: Probabilistic assessment of enrichment and colocalization. PLOS Genet. 2017;13:e1006646. doi: 10.1371/journal.pgen.1006646. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–78. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Manor O, Segal E. Robust prediction of expression differences among human individuals using only genotype information. PLOS Genet. 2013;9:e1003396. doi: 10.1371/journal.pgen.1003396. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Hamilton N. ggtern: an extension to’ggplot2’, for the creation of ternary diagramshttps://CRAN.R-project.org/package=ggtern (R package version 2.2.0, 2016).
24.Mancuso N, et al. Integrating gene expression with summary association statistics to identify genes associated with 30 complex traits. Am. J. Hum. Genet. 2017;100:473–487. doi: 10.1016/j.ajhg.2017.01.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zhou X, Carbonetto P, Stephens M. Polygenic modeling with bayesian sparse linear mixed models. PLOS Genet. 2013;9:e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Zou H, Hastie T. Regularization and variable selection via the elastic-net. J. R. Stat. Soc. 2005;67:301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]
27.Wheeler HE, et al. Survey of the heritability and sparse architecture of gene expression traits across human tissues. PLOS Genet. 2016;12:e1006423. doi: 10.1371/journal.pgen.1006423. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Pavlides JMW, et al. Predicting gene targets from integrative analyses of summary data from GWAS and eQTL studies for 28 human complex traits. Genome Med. 2016;8:1–6. doi: 10.1186/s13073-016-0338-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Westra HJ, et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 2013;45:1238–1243. doi: 10.1038/ng.2756. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Landrum MJ, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic acids Res. 2015;44:D862–8. doi: 10.1093/nar/gkv1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Shah N., et al. Identification of misclassified ClinVar variants using disease population prevalence. Preprint at biorxiv. http://biorxiv.org/lookup/doi/10.1101/075416 (2016). [DOI] [PMC free article] [PubMed]
32.Sekar A, et al. Schizophrenia risk from complex variation of complement component 4. Nature. 2016;530:177–83. doi: 10.1038/nature16549. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Musunuru K, et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature. 2010;466:714–9. doi: 10.1038/nature09266. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Dadu RT, Ballantyne CM. Lipid lowering with PCSK9 inhibitors. Nat. Publ. Group. 2014;11:563–575. doi: 10.1038/nrcardio.2014.84. [DOI] [PubMed] [Google Scholar]
35.Franzén O, et al. Cardiometabolic risk loci share downstream cis- and trans-gene regulation across tissues and diseases. Science. 2016;353:827–830. doi: 10.1126/science.aad6970. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Hoffmann TJ, et al. Genome-wide association analyses using electronic health records identify new loci influencing blood pressure variation. Nat. Genet. 2016;49:54–64. doi: 10.1038/ng.3715. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Cook JP, Morris AP. Multi-ethnic genome-wide association study identifies novel locus for type 2 diabetes susceptibility. Eur. J. Hum. Genet. 2016;24:1175–1180. doi: 10.1038/ejhg.2016.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Torres J. M., et al. Integrative cross tissue analysis of gene expression identifies novel type 2 diabetes genes. Preprint at bioRxivhttp://biorxiv.org/content/early/2017/02/27/108134 (2017).
39.Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169:1177–1186. doi: 10.1016/j.cell.2017.05.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Castel S. E., et al. Modified penetrance of coding variants by cis-regulatory variation shapes human traits. Preprint at bioRxiv.https://www.biorxiv.org/content/early/2017/09/18/190397 (2017).
41.Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Auton A, et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Chang CC, et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Loh PR, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016;48:1443–1448. doi: 10.1038/ng.3679. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.McCarthy S, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Das S, et al. Next-generation genotype imputation service and methods. Nat. Genet. 2016;48:1284–1287. doi: 10.1038/ng.3656. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Heath AP, et al. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets. J. Am. Med. Inform. Assoc. 2014;21:969–975. doi: 10.1136/amiajnl-2013-002155. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(2.6MB, pdf)}

Peer Review File^{(398.6KB, pdf)}

41467_2018_3621_MOESM3_ESM.pdf^{(303.3KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1^{(123KB, csv)}

Supplementary Data 2^{(11.5KB, csv)}

Supplementary Data 3^{(3.8MB, csv)}

Supplementary Data 4^{(24.5KB, csv)}

Data Availability Statement

[CR1] 1.Nica AC, et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLOS Genet. 2010;6:1000895. doi: 10.1371/journal.pgen.1000895. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Nicolae DL, et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLOS Genet. 2010;6:e1000888. doi: 10.1371/journal.pgen.1000888. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Li YI, et al. RNA splicing is a primary link between genetic variation and disease. Science. 2016;352:600–604. doi: 10.1126/science.aad9417. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Gusev A, et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 2014;95:535–552. doi: 10.1016/j.ajhg.2014.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Battle A, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2014;24:14–24. doi: 10.1101/gr.155192.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Lappalainen T, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–511. doi: 10.1038/nature12531. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Zhang X, et al. Identification of common genetic variants controlling transcript isoform variation in human whole blood. Nat. Genet. 2015;47:345–352. doi: 10.1038/ng.3220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Stranger BE, et al. Patterns of Cis regulatory variation in diverse human populations. PLOS Genet. 2012;8:e1002639. doi: 10.1371/journal.pgen.1002639. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.The GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat. Genet. 2013;45:580–5. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Aguet F., et al. Local genetic effects on gene expression across 44 human tissues. Preprint at bioRxiv:http://biorxiv.org/content/early/2016/09/09/074450 (2016).

[CR11] 11.Gamazon ER, et al. A genebased association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Smoller JW, et al. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet. 2013;381:1371–9. doi: 10.1016/S0140-6736(12)62129-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Deloukas P, et al. Large scale association analysis identifies new risk loci for coronary artery disease. Nat. Genet. 2013;45:25–33. doi: 10.1038/ng.2480. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Morris AP, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 2012;44:981–990. doi: 10.1038/ng.2383. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Gusev A, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 2016;48:245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Zhu Z, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016;48:481–7. doi: 10.1038/ng.3538. [DOI] [PubMed] [Google Scholar]

[CR17] 17.He X, et al. Sherlock: detecting gene-disease associations by matching patterns of expression QTL and GWAS. Am. J. Hum. Genet. 2013;92:667–680. doi: 10.1016/j.ajhg.2013.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Giambartolomei C, et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLOS Genet. 2014;10:e1004383. doi: 10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Hormozdiari F, et al. Colocalization of GWAS and eQTL signals detects target genes. Am. J. Hum. Genet. 2016;99:1245–1260. doi: 10.1016/j.ajhg.2016.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Wen X, Pique-Regi R, Luca F. Integrating molecular QTL data into genome-wide genetic association analysis: Probabilistic assessment of enrichment and colocalization. PLOS Genet. 2017;13:e1006646. doi: 10.1371/journal.pgen.1006646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–78. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Manor O, Segal E. Robust prediction of expression differences among human individuals using only genotype information. PLOS Genet. 2013;9:e1003396. doi: 10.1371/journal.pgen.1003396. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Hamilton N. ggtern: an extension to’ggplot2’, for the creation of ternary diagramshttps://CRAN.R-project.org/package=ggtern (R package version 2.2.0, 2016).

[CR24] 24.Mancuso N, et al. Integrating gene expression with summary association statistics to identify genes associated with 30 complex traits. Am. J. Hum. Genet. 2017;100:473–487. doi: 10.1016/j.ajhg.2017.01.031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Zhou X, Carbonetto P, Stephens M. Polygenic modeling with bayesian sparse linear mixed models. PLOS Genet. 2013;9:e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Zou H, Hastie T. Regularization and variable selection via the elastic-net. J. R. Stat. Soc. 2005;67:301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]

[CR27] 27.Wheeler HE, et al. Survey of the heritability and sparse architecture of gene expression traits across human tissues. PLOS Genet. 2016;12:e1006423. doi: 10.1371/journal.pgen.1006423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Pavlides JMW, et al. Predicting gene targets from integrative analyses of summary data from GWAS and eQTL studies for 28 human complex traits. Genome Med. 2016;8:1–6. doi: 10.1186/s13073-016-0338-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Westra HJ, et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 2013;45:1238–1243. doi: 10.1038/ng.2756. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Landrum MJ, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic acids Res. 2015;44:D862–8. doi: 10.1093/nar/gkv1222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Shah N., et al. Identification of misclassified ClinVar variants using disease population prevalence. Preprint at biorxiv. http://biorxiv.org/lookup/doi/10.1101/075416 (2016). [DOI] [PMC free article] [PubMed]

[CR32] 32.Sekar A, et al. Schizophrenia risk from complex variation of complement component 4. Nature. 2016;530:177–83. doi: 10.1038/nature16549. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Musunuru K, et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature. 2010;466:714–9. doi: 10.1038/nature09266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Dadu RT, Ballantyne CM. Lipid lowering with PCSK9 inhibitors. Nat. Publ. Group. 2014;11:563–575. doi: 10.1038/nrcardio.2014.84. [DOI] [PubMed] [Google Scholar]

[CR35] 35.Franzén O, et al. Cardiometabolic risk loci share downstream cis- and trans-gene regulation across tissues and diseases. Science. 2016;353:827–830. doi: 10.1126/science.aad6970. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Hoffmann TJ, et al. Genome-wide association analyses using electronic health records identify new loci influencing blood pressure variation. Nat. Genet. 2016;49:54–64. doi: 10.1038/ng.3715. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Cook JP, Morris AP. Multi-ethnic genome-wide association study identifies novel locus for type 2 diabetes susceptibility. Eur. J. Hum. Genet. 2016;24:1175–1180. doi: 10.1038/ejhg.2016.17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Torres J. M., et al. Integrative cross tissue analysis of gene expression identifies novel type 2 diabetes genes. Preprint at bioRxivhttp://biorxiv.org/content/early/2017/02/27/108134 (2017).

[CR39] 39.Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169:1177–1186. doi: 10.1016/j.cell.2017.05.038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Castel S. E., et al. Modified penetrance of coding variants by cis-regulatory variation shapes human traits. Preprint at bioRxiv.https://www.biorxiv.org/content/early/2017/09/18/190397 (2017).

[CR41] 41.Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Auton A, et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Chang CC, et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Loh PR, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016;48:1443–1448. doi: 10.1038/ng.3679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.McCarthy S, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Das S, et al. Next-generation genotype imputation service and methods. Nat. Genet. 2016;48:1284–1287. doi: 10.1038/ng.3656. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Heath AP, et al. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets. J. Am. Med. Inform. Assoc. 2014;21:969–975. doi: 10.1136/amiajnl-2013-002155. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics

Alvaro N Barbeira

Scott P Dickinson

Rodrigo Bonazzola

Jiamao Zheng

Heather E Wheeler

Jason M Torres

Eric S Torstenson

Kaanan P Shah

Tzintzuni Garcia

Todd L Edwards

Eli A Stahl

Laura M Huckins

Dan L Nicolae

Nancy J Cox

Hae Kyung Im

Abstract

Introduction

Results

Computing PrediXcan results using summary statistics

Fig. 1.

Fig. 2.

Colocalization estimates complement PrediXcan results

Fig. 3.

Comparison of S-PrediXcan to S-TWAS

Fig. 4.

Comparison of S-PrediXcan to SMR

Fig. 5.

MetaXcan framework

Fig. 6.

Suggested association analysis pipeline

Gene expression variation is associated to diverse traits

Small gene expression changes associated to mild phenotypes

Fig. 7.

Agnostic scanning across GTEx tissues improves discovery

Fig. 8.

Replication in an independent cohort

Fig. 9.

Table 1.

Discussion

Methods

Summary-PrediXcan formula

Association enrichment

Derivation of summary-PrediXcan formula

Notation and preliminaries

Calculation of Z-score

Calculation of σg in reference set

Expression model training

Cross-validation

Online Catalog and SMR, COLOC, TWAS

Comparison with TWAS

Summary-PrediXcan with only top eQTL as predictor

Comparison with SMR

GERA imputation

GERA GWAS and MetaXcan Application

Code Availability

Data availability

Electronic supplementary material

Acknowledgements

Author contributions

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Calculation of σ_g in reference set