Fine-mapping and QTL tissue-sharing information improve the reliability of causal gene identification

Alvaro N Barbeira; Owen J Melia; Yanyu Liang; Rodrigo Bonazzola; Gao Wang; Heather E Wheeler; François Aguet; GTEx Consortium; Kristin G Ardlie; Xiaoquan Wen; Hae K Im

doi:10.1002/gepi.22346

. Author manuscript; available in PMC: 2020 Dec 8.

Published before final editing as: Genet Epidemiol. 2020 Sep 10:10.1002/gepi.22346. doi: 10.1002/gepi.22346

Fine-mapping and QTL tissue-sharing information improve the reliability of causal gene identification

Alvaro N Barbeira ¹, Owen J Melia ¹, Yanyu Liang ¹, Rodrigo Bonazzola ¹, Gao Wang ², Heather E Wheeler ^3,^4,⁵, François Aguet ⁶, GTEx Consortium ^¶, Kristin G Ardlie ⁶, Xiaoquan Wen ⁷, Hae K Im ^1,^2,^*

PMCID: PMC7693040 NIHMSID: NIHMS1640792 PMID: 32964524

Abstract

The integration of transcriptomic studies and GWAS (genome-wide association studies) via imputed expression has seen extensive application in recent years, enabling the functional characterization and causal gene prioritization of GWAS loci. However, the techniques for imputing transcriptomic traits from DNA variation remain underdeveloped. Furthermore, associations found when linking eQTL studies to complex traits through methods like PrediXcan can lead to false positives due to linkage disequilibrium between distinct causal variants. Therefore, the best prediction performance models may not necessarily lead to more reliable causal gene discovery. With the goal of improving discoveries without increasing false positives, we develop and compare multiple transcriptomic imputation approaches using the most recent GTEx release of expression and splicing data on 17,382 RNA-sequencing samples from 948 post-mortem donors in 54 tissues. We find that informing prediction models with posterior causal probability from fine-mapping (dap-g) and borrowing information across tissues (mashr) can lead to better performance in terms of number and proportion of significant associations that are colocalized and the proportion of silver standard genes identified as indicated by precision-recall and ROC (Receiver Operating Characteristic) curves. All prediction models are made publicly available at predictdb.org.

Keywords: GWAS, PrediXcan, QTL integration

Introduction

Transcriptome studies with whole genome interrogation characterize genetic effects on gene expression traits. These mechanisms help elucidate the function of loci identified in genome-wide association studies (GWAS) by identifying potential causal genes that link genetic variation with complex traits (Albert & Kruglyak, 2015; Aguet et al., 2019; Huckins et al., 2019; Mancuso et al., 2018; Gusev et al., 2018){Albert:2015fx, Aguet2019, Huckins:2019ix, Mancuso:2018fv, Gusev:2018dy}.

In particular, the Genotype-Tissue Expression (GTEx) project (Aguet et al., 2019){Aguet2019} has sequenced whole genomes from 948 organ donors and generated RNA-seq data across 52 tissues and 2 cell lines. Results and tools derived from this comprehensive catalog of transcriptome variation have enabled a myriad of applications such as drug repurposing (So et al., 2017){So2017} and clinical discoveries in cancer susceptibility genes (Wu et al., 2018){Wu2018}, to name a few.

The general consensus that many noncoding variants associated with complex traits exercise their action via gene expression regulation has motivated the development of imputed transcriptome association approaches such as PrediXcan (Gamazon et al., 2015; Barbeira et al., 2018){Gamazon2015, Barbeira2018}, TWAS/FUSION (Gusev et al., 2016){Gusev:2016ey} and UTMOST (Hu et al., 2019){Hu2019}. In essence, these methods predict gene expression traits based on individuals’ genotypes and test how these predictions correlate with complex traits.

Reliable prediction models for gene expression traits are key components of imputed transcriptome association studies. Given the predominantly sparse genetic architecture of gene expression traits (Wheeler et al., 2016){Wheeler2016} and overall robustness and performance (Huckins et al., 2019; Fryett, Inshaw, Morris, & Cordell, 2018){Huckins:2019ix, Fryett:2018bg}, Elastic Net (Friedman, Hastie, & Tibshirani, 2010){Friedman2010GLMNET} has become the algorithm of choice for predicting transcriptome variation.

Despite Elastic Net’s many advantages such as robustness and sparcity, we hypothesized that transcriptome imputation can be improved by leveraging biologically-informed methods. Recent efforts (Hu et al., 2019){Hu2019} have exploited the high degree of eQTL sharing across tissues (Aguet et al., 2017){GTEx2017} by leveraging cross-tissue patterns in the broad GTEx panel to improve prediction performance, more notably in tissues with small sample sizes. Also, important methodological progress in fine-mapping (Wen, Pique-Regi, & Luca, 2017; Wang, Sarkar, Carbonetto, & Stephens, 2018){Wen2017, Wang2018} and an adaptive shrinkage method that improves effect size estimates across multiple experiments (Urbut, Wang, Carbonetto, & Stephens, 2019){Urbut2019} provide opportunities to further improve quality of downstream associations.

In this article, we analyze different transcriptome prediction strategies and compare their strengths both in prediction performance and downstream phenotypic associations.

Proximity and linkage disequilibrium (LD) between distinct causal variants can lead to non causal associations between predicted expression and complex traits (Barbeira et al., 2018; Wainberg et al., 2019){Barbeira2018, Wainberg:2019kq}. Since the ultimate goal of imputed transcriptome studies is to identify causal genes, our main focus here is to improve discoveries with less emphasis on expression prediction performance. We also applied the same model building techniques to alternative splicing traits quantified with Leafcutter (Y. I. Li et al., 2018){Li:2018cy}. We make all results, prediction models and software available to the research community.

Results

To identify optimal techniques for transcriptomic imputation, we have built models to predict genetically regulated expression (GREx) using four different approaches on GTEx expression and splicing data (release version 8). To reduce LD misspecification problems, most apparent when applying summary statistics-based versions of PrediXcan on GWAS of European populations, we used only European samples.

We restricted the analysis to genes that are annotated as protein coding, lncRNA, and pseudogenes in GENCODE version 26 (Frankish et al., 2019){Frankish2019}. We included 49 different tissues with sample sizes ranging from 65 (Kidney Cortex) to 602 (Muscle Skeletal).

The first strategy used the Elastic Net (Friedman et al., 2010){Friedman2010GLMNET} algorithm to compute predictions as described previously in (Gamazon et al., 2015; Wheeler et al., 2016){Gamazon2015, Wheeler2016}. For every gene available in each tissue, this strategy used variants from the HapMap CEU track in a window ranging from 1Mb upstream of the transcription start site to 1MB downstream of the transcription end site as explanatory variables. Only those models achieving thresholds of cross-validated correlation ρ > 0.1 and prediction performance p-value < 0.05 were kept. We will refer to this family as the EN-M models.

The second strategy used CTIMP (Cross Tissue gene expression IMPutation) (Hu et al., 2019){Hu2019}. CTIMP uses a regularized, generalized linear regression algorithm to fit expression from different tissues simultaneously. CTIMP optimizes a cost function including a within-tissue Lasso penalty and a cross-tissue group Lasso penalty, thus inheriting Lasso-like behaviour that is less sparse than Elastic Net. We used the same variants from the EN-M strategy (HapMap CEU track, same windows around each gene), and identical correlation threshold (ρ > 0.1) and cross-validated prediction performance threshold (p < 0.05) to accept models. We will refer to this family as the CTIMP-M family. We verified that this method’s performance is not significantly improved by using all available GTEx variants, as explained in the supplementary material.

The third strategy used the posterior inclusion probability (PIP) of a variant being causal for gene expression as estimated by the Bayesian fine mapping method dap-g (Deterministic Approximation of Posteriors) (Wen, Lee, Luca, & Pique-Regi, 2016){Wen2016}. First, for every gene, we restricted to variants with posterior inclusion probabilities PIP > 0.01. Since dap-g clusters variants by their LD, we kept the variant with highest PIP from each cluster to avoid redundant explanatory variables. Then, the selected variants were fed into the Elastic Net algorithm, scaling each variant’s effect size penalty by a factor of 1−PIP (i.e. more likely variants are less penalized). Only those models achieving good enough cross-validated prediction performance (p-value< 0.05) and correlation (ρ > 0.1) were kept. We will refer to this family as DAPGW-M (dap-g weighted). As discussed later, the cross-validated prediction performance of this approach can’t be fairly compared to EN-M and CTIMP-M because the pre-selection of fine-mapped variants is based on the same underlying data.

The fourth strategy used mashr (Multivariate Adaptive Shrinkage in R) (Urbut et al., 2019){Urbut2019} effect sizes from variants selected by dap-g as in the DAPGW-M approach. More specifically, fine-mapped variants were selected as in the DAPGW-M approach but the weights were obtained by applying mashr to the marginal effect sizes and standard errors from the GTEx eQTL analysis (Aguet et al., 2019){Aguet2019}. Unlike the previous methods, this approach does not fit into a cross-validation strategy and therefore lacks a natural prediction performance measure. Only eGenes with at least one cluster of variants achieving dap-g PIP> 0.1 were kept. We will refer to this family as MASHR-M.

We did not consider the BSLMM family of methods for transcriptome prediction. These models contain both a sparse and a polygenic component. The latter is likely to induce LD contamination(Barbeira et al., 2018){Barbeira2018} and doesn’t reflect the sparse architecture of expression traits (Wheeler et al., 2016){Wheeler2016}.

We also applied the EN-M and MASHR-M methods to alternative splicing quantification from LeafCutter (Y. I. Li et al., 2018){Li:2018cy} and made them readily available to the research community. These models were extensively used in (Aguet et al., 2019){Aguet2019} and (Barbeira, Bonazzola, et al., 2019){GTEx-GWAS-Companion}.

Summary of models

Given the differences in computational approach, not all prediction strategies generated models for every available gene-tissue pair. As can be seen in Fig. 1–A, EN-M yielded the smallest number of valid models, for 281,848 gene-tissue pairs. CTIMP-M produced 340,104 valid models, 21% more than EN-M, as expected from its integration of multiple tissues’ information.

Fine-mapping-based methods generated even more models: 518,537 from DAPGW-M (84% more than EN-M) and 686,241 from MASHR-M (143% more than EN-M). Please note that given the different criteria used to accept a model as valid, simple counts of available models should not be considered a measure of performance.

We show the distribution of cross-validated prediction performances in Fig. 1–B. We include 5 representative tissues ordered by increasing sample size (kidney, brain - hippocampus, brain - cerebellum, breast, skeletal muscle). In order to perform a uniform comparison, we used only gene-tissue pairs available to all model families. CTIMP-M showed better prediction performance than EN-M on tissues with smaller sample size, but performed similarly on tissues with larger sample sizes. We attribute this to CTIMP’s design, which leveraged all existing samples’ genotypes in the tissues of smaller expression sample size. MASHR-M models had no natural prediction performance measure and thus are excluded from these panels. DAPGW-M is presented for completeness but its comparison to EN-M and CTIMP-M is unfair. We show in Sup. Fig 1 the cross-validated prediction performances for all genes in each family.

Finemapping-based models perform well in independent expression dataset

Next, we sought to validate the models’ predictions in an independent RNA-seq dataset. We analyzed data from the the GEUVADIS project (Lappalainen et al., 2013){Lappalainen2013}, which includes 341 samples of European ancestry with genotype and LCL (lymphoblastoid cell lines) expression data. We predicted expression using GTEx LCL models from the 4 strategies, and compared with measured expression levels. Fig. 2–A shows the number of genes that each family was able to predict. DAPGW-M and MASHR-M had the largest number of predictable genes, followed by CTIMP-M and EN-M.

Fig 2. — Panel A shows the number of genes predicted in GEUVADIS cohort using the LCL models from each of the four strategies. MASHR-M had the most models available, followed in decreasing order by DAPGW-M, CTIMP-M and EN-M.

Panel B shows the distribution of prediction performances (Spearman ρ)for genes available to all four families. DAPGW-M and MASHR-M performed slightly but consistently better than EN-M and CTIMP-M. We attributed the small differences to the GTEX LCL tissue having a small sample size (n=115 individuals), much lower than the 341 available in GEUVADIS. Also, the intersection of genes available to all 4 strategies is dominated by those present in Elastic Net, the smallest set; and genes that can be modelled with Elastic Net tend to be the ones with less complicated patterns of variation.

To compare prediction performances, we used Spearman’s rank correlation coefficient ρ as a robust measure that handles the scale and complexity differences between real GEUVADIS expression data and predicted expression levels. Fig. 2–B shows the distribution of prediction performance (Spearman’s ρ) for genes present in all four methods on the LCL tissue. We observed that all four families achieved similar levels of performance, with MASHR-M, DAPGW-M, and CTIMP-M faring slightly but consistently better than EN-M. Mean correlations were 0.028 (se=0.006), 0.027 (se=0.006), and 0.018 (se=0.006) points larger for MASHR-M, DAPGW-M, and CTIMP-M respectively compared to EN-M.

We attribute the smaller performance differences to low power, since GTEx LCL tissue has a sample size of n= 115 individuals, much lower than the 341 available in GEUVADIS.

Fine-mapping improves number and colocalization of associations

Next, we assessed whether any of these models perform better at identifying causal genes. We considered the number and proportion of colocalized genes among the significant ones as measures of association quality.

We used the four families of models to correlate predicted expression with 87 phenotypes through 49 tissues using the summary version of PrediXcan. Results of applying the EN-M models to GWAS summary statistics, harmonized and imputed to GRCh38 (Schneider et al., 2017){Schneider2017hg}, were presented in (Aguet et al., 2019){Aguet2019}. In this section, we say that a gene-tissue pair is significant if it achieves a p-value below the Bonferroni-corrected threshold (0.05/number of gene-tissue pairs) within each trait.

We used enloc (Wen et al., 2017){Wen2017} results published in (Aguet et al., 2019){Aguet2019} to assess the colocalization status of GWAS and transcriptomic traits as evidence for a shared underlying mechanism. Briefly, enloc computes the “regional colocalization probability” (rcp) that a trait shares causal variants with a gene’s expression (or an intron’s splicing quantification), within a GWAS region and the overlapping gene’s cis-window. We say that a gene-tissue pair is “colocalized” with a trait if it achieves an enloc regional colocalization probability rcp > 0.5. Note that rcp <= 0.5 should not be interpreted as a false association; rather, it only means that there is not enough evidence of colocalization. See discussion on the conservative nature of colocalization approaches in (Barbeira, Bonazzola, et al., 2019){GTEx-GWAS-Companion}.

We say that a gene-tissue pair that is both significant and colocalized is a “prioritized” detection or candidate. To simplify interpretation of results across multiple tissues, we count the number of unique genes among the prioritized gene-tissue pairs for each trait.

We found that MASHR-M typically yields more candidate genes. On average 28.3 (sd=44.4), 25.5 (sd=40.3), 36.7 (sd=57.8), 36.6 (sd=57.0) genes were identified with EN-M, CTIMP-M, DAPGW-M, and MASHR-M, respectively. We display the numbers of detections for each trait in Fig. 3, through Q-Q plots comparing MASHR-M to the other the model families. We observe in Fig. 3–A that the fine-mapping informed families of models, DAPGW-M and MASHR-M, yielded a similar number of candidates per trait, consistently larger than EN-M and CTIMP-M. When comparing the fraction of colocalized genes among significant genes (3–B), MASHR-M yielded larger proportion of colocalized genes compared to the other 3 families. On average 8.18% (se=0.013), 8.88% (se=0.013), 9.01% (se=0.013), 11.7% (se=0.013) of identified genes with EN-M, CTIMP-M, DAPGW-M, and MASHR-M, respectively, were colocalized. In general, we observed that associations obtained through both DAPGW-M and MASHR-M models tend to agree (see Supplementary Figure 2 as an example).

Fig 3. — **Panel A** shows a Q-Q plot for the number of colocalized, significant genes per trait. Fine-mapping-informed models (DAPGW-M and MASHR-M) achieved similar numbers of colocalized detections, both slightly higher than EN-M and CTIMP-M.

**Panel B** shows a Q-Q plot for the fraction of colocalized genes among significant genes per trait. MASHR-M’s distribution is shifted towards higher proportions than the other families. We say a gene is significant if it achieves a Bonferroni-adjusted threshold of 0.05/number of available gene-tissue pairs, in at least one tissue. Likewise, we say a gene is colocalized if it achieves *enloc rcp >* 0.5 in any tissue. We say a gene is a candidate or ”prioritized” detection if it is both significant and colocalized in any tissue.

We were thus led to favor MASHR-M, which produced the largest number of models, with larger number of colocalized, significant associations as well as higher proportions of colocalized associations among significant genes.

Enloc relies on the dap-g algorithm itself as a component, so that the fraction of colocalized genes could have been biased towards dap-g informed methods. To make sure that the use of dap-g is not driving the improved colocalization ratae of MASHR-M over the other strategies, we verified the performance using another colocalization method, coloc (Giambartolomei et al., 2014){Giambartolomei2014}.

We observed that MASHR-M still had a better rate of colocalization among significant associations, albeit with smaller differences as can be seen in Supplementary Figure 3. This is probably in part due to coloc’s reduced power and limiting assumption of a single causal variant (see (Barbeira, Bonazzola, et al., 2019){GTEx-GWAS-Companion} for details).

Finemapping improves identification of silver standard genes

As an independent way to assess each prediction strategy’s ability to identify causal genes, we framed the problem as one of causal gene prediction and use standard prediction performance measures such as Receiver Operating Characteristic (ROC) and Precision-Recall (PR). This avoids using an ad-hoc significance or colocalization thresholds.

As proxies for causal genes, we leveraged two different “silver standards” as described in Barbeira et al. (Barbeira, Bonazzola, et al., 2019){GTEx-GWAS-Companion}. The first one, based on the OMIM (Online Mendelian Inheritance in Man) database (Amberger, Bocchini, Scott, & Hamosh, 2019){amberger:2019}, features 1592 known gene-trait associations. The second one is based on rare variant association studies (Marouli et al., 2017; Liu et al., 2017; Locke et al., 2019){marouli:2017, liu:2017, locke:2019} and contains 101 gene-trait associations.

We restricted our analysis to gene-trait pairs in the vicinity of the corresponding traits’ GWAS loci since we did not expect any of the methods to detect reliable signals elsewhere. We used approximately-independent LD regions (Berisa & Pickrell, 2016){berisa:2016} to define vicinity.

Using absolute values of z-scores as association score for each strategy, we assessed their ability to ‘predict’ the silver standard gene-trait associations. We show in Fig. 4 the ROC and Precision-Recall curves on OMIM- and rare variant-based silver standards.

Using the OMIM-based silver standard (Fig. 4–A and -C), we observed that MASHR-M strategy outperforms the other strategies, with DAPGW-M a close second.

Using the rare-variant-based silver standard (Fig. 4–B and -D), we observed that all four strategies are able to detect known causal genes. However, the limited size of this standard did not allow us to distinguish between the four families.

When considering the area under the ROC curve (AUC) for the combined OMIM and rare-variant-based silver standards, we computed the point estimate and estimated the standard errors using a bootstrap approach (implemented in (Robin et al., 2011){pROC}). We observed the MASHR-M models had the highest AUC of all of the model families. Differences in AUC between MASHR-M and CTIMP-M and EN-M models were 0.0636 (se=0.0307) and 0.0682 (se=0.0287) respectively, providing evidence that MASHR-M models are better equipped for detecting known genes and reinforcing our choice of MASHR-M as the best option.

Importance of imputation of missing summary statistics in practice

The prediction models’ usefulness depends on the availability of their variants in the GWAS of interest. Publicly available GWAS use different sequencing and genotyping techniques, based on different genotype imputation panels and human genome release versions, so that the lists of available variants vary wildly across traits. Thus, a GWAS might lack particular variants from a prediction model, so that the model can’t properly infer variation patterns as shown in (Barbeira, Pividori, et al., 2019){Barbeira2019}. Since many fine-mapped variants in the GRCh38-based GTEx study can be absent in a typical GWAS, we sought to assess the impact of variant compatibility in real applications.

We compared S-PrediXcan results from MASHR-M models on 69 publicly available GWAS with two preprocessing schemes:

Harmonization only (No imputation): simple harmonization of variants by lifting over genomic coordinates from the GWAS to match the GRCh38-based GTEx prediction models, and then filtering for matching alleles (“Harmonization” for short)
Harmonization and imputation of missing summary statistics (“Imputation” for short) on harmonized GWAS.

The 69 traits included in this analysis are those among the 87 traits not belonging to the Rapid GWAS project, to prevent the highly homogeneous Rapid GWAS datasets from dominating comparisons.

We show in Fig. 5 the effect of these preprocessing schemes on various performance metrics, segregated by human genome release version (hg17, hg18, hg19).

Fig. 5–A summarizes the increase in number of gene associations computed for every trait-tissue pair. For hg17- and hg18-based GWAS, the gain through summary-statistics imputation is almost threefold. Some hg19-based GWAS traits without imputation yield a good enough number of computable genes.

Fig. 5–B shows the distribution of median fraction of model SNPs also present in the GWAS, within each tissue-trait combination. Roughly 60% of models’ variants are present in hg17- and hg18-based GWAS without imputation; this percentage is substantially higher for hg19-based GWAS without imputation. Imputing summary statistics increases this median percentage to 100% on all tissue-trait combinations across the analyzed human genome release versions.

Fig. 5–C shows the increase in number of genes detected per trait. As in the previous panels, the increase is more noticeable for hg17- and hg18-based GWAS, while smaller for hg19-based studies.

Therefore, we recommend to always perform variant harmonization due to its low complexity and time requirements, followed by summary-statistics imputation if possible. For newer GWAS with modern sequencing and genotyping, summary-statistics imputation may not be as critical depending on their intersection with model variants.

Discussion

Through extensive analysis of different model training schemes, we conclude that using fine-mapping information (from dap-g) and cross-tissue patterns (from mashr) improve the reliability of causal gene detection. These models (MASHR-M) yield more detections when integrating GWAS and eQTL studies and show improved performance when validating results in a silver standard of known gene-to-trait associations (OMIM database). We make all prediction models and results publicly available.

Special consideration must be paid to how well each model’s variants intersect GWAS’ variants. Fine-mapping-informed models are sparse and parsimonious. This could be a hurdle when the fine-mapped variants of import are missing or have low imputation quality in a GWAS, as is often the case with older studies. In this scenario, our recommendation is to impute any missing variants. If that is not possible, the association with the incomplete prediction may still detect the underlying association albeit with reduced power. The MASHR-M and DAPGW-M models have predictors that belong to different LD clusters and the effect sizes are based on marginal regression and smoothing across tissues such that missing one of the “causal clusters” is unlikely to add false positives. The alternative is falling back to models such as CTIMP-M, defined on a robust set of variants available to most GWAS, at the cost of decreased performance (detection and prediction). EN-M additionally features some “built-in” redundancy: for a set of variants in LD among each other, they all tend to be included in a model with the effect spread between them.

While our recommended MASHR-M method offers several benefits compared to existing approaches, there is still room for improvement. Potential developments could rely on fine-mapping methods that jointly incorporate cross-tissue patterns, or consensus between different fine-mapping approaches. Also, epigenetic information has been shown to improve transcriptome prediction (Zhang et al., 2019){Zhang2019EpiXcan} as well. Future improvements should incorporate this epigenetic information and other biologically-informed annotations jointly.

Our validation in silver standards, especially our difficulty interpreting the results from the rare-variant-based silver standard, also illustrates the need for well-curated, large databases of known gene-to-phenotype associations to assess performance of either new or improved methods.

In conclusion, we present here a method for predicting the genetically regulated component of transcriptomic traits with superior performance both in terms of prediction performance and gene-trait association detection.

Methods

We executed all methods using open source software running in a high performance cluster. We release all of our code and the data analyzed in this paper to ease reproducibility and accessibility.

GTEx data processing

We downloaded GTEx data for version 8 release from dbGAP (accession number phs000424.v8.p1). This data arises from 17382 RNA-seq samples from 54 tissues of 948 post-mortem subjects, aligned to the GRCh38 assembly. Primary and extended results generated by consortium members are available on the Google Cloud Platform storage accessible via the GTEx Portal (see URLs).

899 whole-genome sequencing (WGS) samples were analyzed, 68 of them at an average coverage of 30x on HiSeq200, and the rest on HiSeqX. 866 GTEx donors’ samples were included in the downstream variant call files (VCF), after excluding one each from 30 duplicate samples and 3 donors. Among these, 838 subjects with RNA-seq data were included for QTL mapping and analysis.

Whole transcriptome RNA-Seq data were aligned using STAR (v2.5.3.a; (Dobin et al., 2013){dobin:2013}). For STAR index, GENCODE v26 was used with the sjdbOverhang 75 for 76-bp paired-end sequencing protocol. Default parameters were used for RSEM (see URLs; (B. Li & Dewey, 2011){Li:2011}) index generation. GTEx utilized Picard (see URLs) to mark and remove potential PCR duplicates and RNA-SeQC (DeLuca et al., 2012){DeLuca:2012dp} to process post-alignment quality control. RSEM was then used for per-sample transcript quantification. Subsequently, read counts were normalized between samples using TMM (Robinson & Oshlack, 2010){robinson:2010}. For eQTL analyses, latent factor covariates were calculated using PEER (Stegle, Parts, Durbin, & Winn, 2010){stegle:2010} as follows: 15 factors for N < 150 per tissue; 30 factors for 150 ≤ N < 250; 45 factors for 250 ≤ N < 350; and 60 factors for N ≥ 350. Expression phenotypes were adjusted for unwanted variation using covariates such as gender, sequencing platform and pcr protocol, the top 5 principal components from genotype data, and said PEER factors. Finally, fastQTL (Ongen, Buil, Brown, Dermitzakis, & Delaneau, 2016){ongen:2016} was used for cis-eQTL mapping in each tissue. Only protein-coding, lincRNA, and antisense biotypes as defined by Gencode v26 were considered for further analyses. To study alternative splicing, GTEx applied LeafCutter (version 0.2.8; (Y. I. Li et al., 2018){Li:2018cy}) using default parameters to quantify splicing QTLs in cis with intron excision ratios (Aguet et al., 2019){Aguet2019}.

We used the dap-g(Wen et al., 2016){Wen2016}, enloc (Wen et al., 2017){Wen2017} and coloc (Giambartolomei et al., 2014){Giambartolomei2014} results published in (Aguet et al., 2019){Aguet2019}.

GTEx expression and splicing modelling

We used the same genotypes, phenotypes, covariates, gene annotations and variant annotations from the main GTEx analysis.

When building prediction models, we imposed an additional restriction: we used only samples of European ancestry for the sake of leveraging a well defined population LD structure. Only variants with MAF> 0.01 in these samples were included. We used 49 tissues with sample sizes ranging from 65 (Kidney Cortex) to 602 (Muscle Skeletal).

This ancestry restriction mitigated problems due to LD mismatch when integrating with most publicly available GWAS summary statistics, which are conducted on predominantly European populations. Prediction models in other ancestries are important, and we are currently dedicating substantial effort to creating and analyzing such models. However, non-European models are beyond the scope of this paper.

We only generated models for genes annotated in GENCODE v26 as protein coding, lncRNA or pseudogenes.

Elastic Net models

We fitted an Elastic Net model for each gene-tissue pair with available adjusted expression data. We restricted the set of variants to those present in the HapMap 3 CEU track (International HapMap 3 Consortium et al., 2010){InternationalHapMap3Consortium2010} with MAF> 0.01. The motivation behind this choice was to restrict the analysis to a robust set of SNPs that has significant intersection with most publicly available GWAS summary statistics. For every gene, variants within 1Mb upstream of the gene’s transcription start site and 1Mb downstream of the transcription end site where used as explanatory variables for gene expression.

We used the R package glmnet (Friedman et al., 2010){Friedman2010GLMNET}, with mixing parameter α = 0.5 and penalty parameter chosen through 10-fold cross validation.

Prediction performance was estimated using a nested cross-validation approach. Expression was predicted out-of-sample for each fold, with Elastic Net parameters estimated only within training data, and the correlations to observed values at each fold were combined via Fisher’s transformation and Stouffer’s method. Only those models with mean Pearson correlation across 10 folds ρ > 0.1 and nested cross-validated correlation test p < 0.05 were kept.

We refer to these models as EN-M.

CTIMP models

We employed the CTIMP (Hu et al., 2019){Hu2019} framework on the same data from EN-M models in the previous section. This method fits expression for a gene in multiple tissues simultaneously through a regularized linear model, using a Lasso penalty within each tissue and a group-Lasso penalty for cross-tissue patterns. As it internally uses genotypes from all samples available across all tissues, we expect improvements over EN-M to be larger for tissues of smaller sample size where EN-M deals with a less informative LD structure among variants.

We performed five-fold cross validation for model tuning and evaluation following the authors’ description. We computed cross-validated correlation measures across folds as in the previous method, and kept those models achieving the thresholds of cross-validated correlation ρ > 0.1 and p-value p < 0.05. As in EN-M, we restricted the model training to variants in the HapMap 3 CEU track with MAF> 0.01; this became necessary because using all variants proved too computationally expensive, since CTIMP consumes large amounts of memory and processing time. We briefly show in the Supplement (Supplementary Figures 4, 5, 6) that this additional restriction brings negligible effects in model training performance and prediction.

We refer to these models as CTIMP-M.

Elastic Net informed by dap-g results

We also trained models via the Elastic Net algorithm using fine-mapping information to refine the list of variants to be used as explanatory variables, and lent more weight to variants with higher chances of affecting expression phenotypes. To this aim, we used dap-g’s posterior inclusion probability (PIP) of a variant affecting gene expression to select explanatory variables, without restricting to variants in the HapMap CEU track. For every gene, we used all variants in the gene’s cis-window with MAF> 0.01 and PIP> 0.01. Since dap-g groups variants in clusters according to LD, we kept the top variant (by PIP) per cluster to avert variable redundancy. Since we reasoned that more probable variants should bear more impact in the model’s outcome, we multiplied each variant’s penalty term in the Elastic Net regularization by a factor of 1 − PIP. We used the same thresholds from the previous subsections (ρ > 0.1 and p-value p < 0.05) to select models with acceptable prediction performance.

We refer to these models as DAPGW-M.

mashr-based models

Finally we explored an entirely different algorithm to determine the prediction models. We executed multivariate adaptive shrinkage in R (mashr) (Urbut et al., 2019){Urbut2019} to estimate the models’ effect sizes by leveraging cross-tissue variations while allowing for sparse and possibly correlated effects in a Bayesian framework. We used mashr on the same set of variants from DAPGW-M models. We kept models only for eGenes and effect sizes only for variants with PIP > 0.01 (from dap-g) at each gene-tissue pair. Unfortunately, there is no natural prediction performance measure in this scenario as cross-validation was not performed.

We refer to these models as MASHR-M.

GEUVADIS data processing

We used GEUVADIS LCL expression study for an independent validation of prediction performance. We obtained GEUVADIS expression data and sample information from the European Bioinformatics Institute web portal at https://www.ebi.ac.uk/. We obtained genotype data aligned to GRCh38 assembly from the International Genome Sample Resource web portal http://www.internationalgenome.org. We restricted data to individuals of European ancestry, yielding 341 samples.

For each one of the four previous model training schemes (EN-M, CTIMP-M, DAPGW-M, MASHR-M) we predicted expression through PrediXcan (Gamazon et al., 2015){Gamazon2015} on GEUVADIS genotypes using GTEX LCL models, and correlated predictions to observations.

GWAS processing and integration

We examined 87 GWAS from a heterogeneous set of traits first presented in the GTEx v8 study (Barbeira, Bonazzola, et al., 2019; Aguet et al., 2019){GTEx-GWAS-Companion, Aguet2019}. These traits were selected to support a phenome-wide study of the impact of gene regulation. Given the heterogeneous landscape of the GWAS, with intricate differences in data processing protocols and underlying human genome reference versions, it was necessary to make the GWAS variants homogeneous and compatible with those from the GTEx study.

First, the GWAS’ variants were harmonized to the GTEx study’s variants by mapping genomic coordinates via liftover (Haeussler et al., 2018){haeussler:20019} (https://pypi.org/project/pyliftover) and keeping only variants with matching alleles. Then, GTEx variants with missing summary statistics for any GWAS were imputed with the BLUP method, a standard in the field (Lee, Bigdeli, Riley, Fanous, & Bacanu, 2013){Lee2013}.

We executed S-PrediXcan for each of 4 families of models (EN-M, CTIMP-M, DAPGW-M and MASHR-M) using 49 tissues, for a total of 17,052 (trait, model family, tissue) tuples. We integrated with enloc and coloc results published in (Aguet et al., 2019){Aguet2019}.

When analyzing versatility of the models and GWAS preprocessing schemes, we used GWAS studies not belonging to the rapid GWAS study. This was decided because the rapid GWAS project has a common, homogeneous variant set that could dominate comparisons.

AUC estimation for silver standard gene identification

Fig 4–A and -B show the receiver operating characteristic (ROC) curve for silver standard gene identification in OMIM and the rare-variant-based silver standard respectively. To quantify the difference in performance among the different model families, we first computed the ROC of the two standards combined for each family. We then computed the area under the ROC curve (AUC) using the standard trapezoidal approach. The standard errors of the estimated AUC were estimated by a bootstrap approach using 2000 replicates, as implemented in (Robin et al., 2011){pROC}.

Supplementary Material

sup FigS2

NIHMS1640792-supplement-sup_FigS2.pdf^{(54KB, pdf)}

sup FigS3

NIHMS1640792-supplement-sup_FigS3.pdf^{(276.9KB, pdf)}

sup FigS1

NIHMS1640792-supplement-sup_FigS1.pdf^{(310.5KB, pdf)}

sup FigS6

NIHMS1640792-supplement-sup_FigS6.pdf^{(67.3KB, pdf)}

sup FigS4

NIHMS1640792-supplement-sup_FigS4.pdf^{(393.7KB, pdf)}

sup FigS5

NIHMS1640792-supplement-sup_FigS5.pdf^{(240.7KB, pdf)}

NIHMS1640792-supplement-1.pdf^{(152.9KB, pdf)}

Acknowledgements

We thank the donors and their families for their generous gifts of organ donation for transplantation, and tissue donations for the GTEx research project.

The consortium was funded by GTEx program grants: HHSN268201000029C (F.A., K.G.A., A.V.S., X.Li., E.T., S.G., A.G., S.A., K.H.H., D.Y.N., K.H., S.R.M., J.L.N.), 5U41HG009494 (F.A., K.G.A.), 10XS170 (Subcontract to Leidos Biomedical) (W.F.L., J.A.T., G.K., A.M., S.S., R.H., G.Wa., M.J., M.Wa., L.E.B., C.J., J.W., B.R., M.Hu., K.M., L.A.S., H.M.G., M.Mo., L.K.B.), 10XS171 (Subcontract to Leidos Biomedical) (B.A.F., M.T.M., E.K., B.M.G., K.D.R., J.B.), 10ST1035 (Subcontract to Leidos Biomedical) (S.D.J., D.C.R., D.R.V.), R01DA006227-17 (D.C.M., D.A.D.), Supplement to University of Miami grant DA006227. (D.C.M., D.A.D.), HHSN261200800001E (A.M.S., D.E.T., N.V.R., J.A.M., L.S., M.E.B., L.Q., T.K., D.B., K.R., A.U.), R01MH101814 (M.M-A., V.W., S.B.M., R.G., E.T.D., D.G-M., A.V.), U01HG007593 (S.B.M.), R01MH101822 (C.D.B.), U01HG007598 (M.O., B.E.S.), R01MH107666 (H.K.I.), P30DK020595 (H.K.I.). E.R.G. is supported by the National Human Genome Research Institute (NHGRI) under Award Number 1R35HG010718 and by the National Heart, Lung, and Blood Institute (NHLBI) under Award Number 1R01HL133559. E.R.G. has also significantly benefitted from a Fellowship at Clare Hall, University of Cambridge (UK) and is grateful to the President and Fellows of the college for a stimulating intellectual home. S.K.-H. is supported by the Marie-Sklodowska Curie fellowship H2020 Grant 706636. D.M.J.: T32HL00782. Y.Pa. is supported by the NHGRI award R01HG10067. A.R.H. was supported by the Massachusetts Lions Eye Research Fund Grant. H.E.W. is supported by NHGRI R15HG009569. Computation was performed at the high performance cluster of the Center for Research Informatics at the University of Chicago, funded by the Biological Sciences Division and CTSA UL1TR000430. Additional Computation was performed with resources provided by the University of Chicago Research Computing Center.

We thank the International Genomics of Alzheimer’s Project (IGAP) for providing summary results data for these analyses. The investigators within IGAP contributed to the design and implementation of IGAP and/or provided data but did not participate in analysis or writing of this report. IGAP was made possible by the generous participation of the control subjects, the patients, and their families. The i–Select chips was funded by the French National Foundation on Alzheimer’s disease and related disorders. EADI was supported by the LABEX (laboratory of excellence program investment for the future) DISTALZ grant, Inserm, Institut Pasteur de Lille, Université de Lille 2 and the Lille University Hospital. GERAD was supported by the Medical Research Council (Grant n° 503480), Alzheimer’s Research UK (Grant n° 503176), the Wellcome Trust (Grant n° 082604/2/07/Z) and German Federal Ministry of Education and Research (BMBF): Competence Network Dementia (CND) grant n° 01GI0102, 01GI0711, 01GI0420. CHARGE was partly supported by the NIH/NIA grant R01 AG033193 and the NIA AG081220 and AGES contract N01–AG–12100, the NHLBI grant R01 HL105756, the Icelandic Heart Association, and the Erasmus Medical Center and Erasmus University. ADGC was supported by the NIH/NIA grants: U01 AG032984, U24 AG021886, U01 AG016976, and the Alzheimer’s Association grant ADGC–10–196728.

Grants

HHSN268201000029C

5U41HG009494

10XS170

10XS171

10ST1035

R01DA006227–17

HHSN261200800001E

R01MH101814

U01HG007593

R01MH101822

U01HG007598

R01MH107666

P30DK020595

1R35HG010718

1R01HL133559

Marie-Sklodowska Curie fellowship H2020 Grant 706636

R35GM124836

R01HG10067

R15HG009569

Footnotes

Disclosure

F.A. is an inventor on a patent application related to TensorQTL; S.E.C. is a co-founder, chief technology officer and stock owner at Variant Bio; E.T.D. is chairman and member of the board of Hybridstat LTD.; B.E.E. is on the scientific advisory boards of Celsius Therapeutics and Freenome; G.G. receives research funds from IBM and Pharmacyclics, and is an inventor on patent applications related to MuTect, ABSOLUTE, MutSig, POLYSOLVER and TensorQTL; S.B.M. is on the scientific advisory board of Prime Genomics Inc.; D.G.M. is a co-founder with equity in Goldfinch Bio, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer, and Sanofi-Genzyme; H.K.I. has received speaker honoraria from GSK and AbbVie.; T.L. is a scientific advisory board member of Variant Bio with equity and Goldfinch Bio. P.F. is member of the scientific advisory boards of Fabric Genomics, Inc., and Eagle Genomes, Ltd. P.G.F. is a partner of Bioinf2Bio. E.R.G. receives an honorarium from Circulation Research, the official journal of the American Heart Association, as a member of the Editorial Board, and has performed consulting for the City of Hope / Beckman Research Institute. R.D. has received research support from AstraZeneca and Goldfinch Bio, not related to this work.

Code and data availability

Genotype-Tissue Expression (GTEx) project’s raw whole transcriptome and genome sequencing data are available via dbGaP accession number phs000424.v8.p1. All processed GTEx data are available via GTEx portal. Imputed summary results, enloc, coloc, PrediXcan, MultiXcan, dap-g, prediction models, and reproducible analysis are available in https://github.com/hakyimlab/gtex-gwas-analysis and links therein.

URLs

flashr,

https://gaow.github.io/mnm-gtex-v8/analysis/mashr_flashr_workflow.html#flashr-prior-covariances;

mashr, https://github.com/stephenslab/mashr;

Gencode, https://www.gencodegenes.org/releases/26.html;

GTEx GWAS subgroup repository, https://github.com/broadinstitute/gtex-v8;

GTEx portal, http://gtexportal.org;

Hail, https://github.com/hail-is/hail;

MetaXcan, https://github.com/hakyimlab/MetaXcan;

pyliftover, https://pypi.org/project/pyliftover/;

Summary GWAS imputation, https://github.com/hakyimlab/summary-gwas-imputation;

TORUS, https://github.com/xqwen/torus;

DAP, https://github.com/xqwen/dap;

UK Biobank GWAS, http://www.nealelab.is/uk-biobank/;

References

Aguet F, Barbeira AN, Bonazzola R, Brown A, Castel SE, Jo B, … Lappalainen T (2019). The GTEx Consortium atlas of genetic regulatory effects across human tissues. bioRxiv. doi: 10.1101/787903 [DOI] [PMC free article] [PubMed] [Google Scholar]
Aguet F, Brown AA, Castel SE, Davis JR, He Y, Jo B, … Zhu J (2017). Genetic effects on gene expression across human tissues. Nature. doi: 10.1038/nature24277 [DOI] [Google Scholar]
Albert FW, & Kruglyak L (2015, April). The role of regulatory variation in complex traits and disease. Nature Reviews Genetics, 16(4), 197–212. Retrieved from 10.1038/nrg3891 doi: 10.1038/nrg3891 [DOI] [PubMed] [Google Scholar]
Amberger JS, Bocchini CA, Scott AF, & Hamosh A (2019). OMIM.org: Leveraging knowledge across phenotype-gene relationships. Nucleic Acids Research. doi: 10.1093/nar/gky1151 [DOI] [PMC free article] [PubMed] [Google Scholar]
Barbeira AN, Bonazzola R, Gamazon ER, Liang Y, Park Y, Subgroup GGW, … Im HK (2019). Widespread dose-dependent effects of RNA expression and splicing on complex diseases and traits. bioRxiv. doi: 10.1101/814350 [DOI] [Google Scholar]
Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, … Im HK (2018). Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nature Communications. doi: 10.1038/s41467-018-03621-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Barbeira AN, Pividori M, Zheng J, Wheeler HE, Nicolae DL, & Im HK (2019, 01). Integrating predicted transcriptome from multiple tissues improves association detection. PLOS Genetics, 15(1), 1–20. Retrieved from 10.1371/journal.pgen.1007889 doi: 10.1371/journal.pgen.1007889 [DOI] [PMC free article] [PubMed] [Google Scholar]
Berisa T, & Pickrell JK (2016, January). Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2), 283–285. doi: 10.1093/bioinformatics/btv546 [DOI] [PMC free article] [PubMed] [Google Scholar]
DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire M-D, Williams C, … Getz G (2012, June). RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics, 28(11), 1530–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, … Gingeras TR (2013, January). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21. Retrieved from 10.1093/bioinformatics/bts635 doi: 10.1093/bioinformatics/bts635 [DOI] [PMC free article] [PubMed] [Google Scholar]
Frankish A, Diekhans M, Ferreira AM, Johnson R, Jungreis I, Loveland J, … Flicek P (2019). GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research. doi: 10.1093/nar/gky955 [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J, Hastie T, & Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. Retrieved from http://www.jstatsoft.org/v33/i01/ [PMC free article] [PubMed] [Google Scholar]
Fryett JJ, Inshaw J, Morris AP, & Cordell HJ (2018, October). Comparison of methods for transcriptome imputation through application to two common complex diseases. European Journal of Human Genetics, 1–10. Retrieved from 10.1038/s41431-018-0176-5 doi: 10.1038/s41431-018-0176-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, … Im HK (2015). A gene-based association method for mapping traits using reference transcriptome data. Nature genetics, 47(9), 1091–1098. Retrieved from 10.1038/ng.3367 doi: 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]
Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, & Plagnol V (2014, May). Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genetics, 10(5), e1004383. Retrieved from http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=24830394&retmode=ref&cmd=prlinks doi: 10.1371/journal.pgen.1004383 [DOI] [PMC free article] [PubMed] [Google Scholar]
Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, … Pasaniuc B (2016, March). Integrative approaches for large-scale transcriptome-wide association studies. Nature Genetics, 48(3), 245–252. Retrieved from http://www.nature.com/doifinder/10.1038/ng.3506 doi: 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
Gusev A, Mancuso N, Won H, Kousi M, Finucane HK, Reshef Y, … Price AL (2018, April). Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nature Genetics, 50(4), 538–548. Retrieved from http://www.nature.com/articles/s41588-018-0092-1 doi: 10.1038/s41588-018-0092-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Haeussler M, Zweig AS, Tyner C, Speir ML, Rosenbloom KR, Raney BJ, … Kent W (2018, 11). The UCSC Genome Browser database: 2019 update. Nucleic Acids Research, 47(D1), D853–D858. Retrieved from 10.1093/nar/gky1095 doi: 10.1093/nar/gky1095 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, … Zhao H (2019, 03 01). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics, 51, 568–576. Retrieved from 10.1038/s41588-019-0345-7 doi: 10.1038/s41588-019-0345-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
Huckins LM, Dobbyn A, Ruderfer DM, Hoffman G, Wang W, Pardiñas AF, … Sklar P (2019, March). Gene expression imputation across multiple brain regions provides insights into schizophrenia risk. Nature Genetics, 51(4), 1–20. Retrieved from 10.1038/s41588-019-0364-4 doi: 10.1038/s41588-019-0364-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, … McEwen JE (2010). Integrating common and rare genetic variation in diverse human populations. Nature. doi: 10.1038/nature09298 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lappalainen T, Sammeth M, Friedländer MR, ‘t Hoen P. a. C., Monlong J, Rivas M. a., … Dermitzakis ET (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501(7468), 506–11. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3918453{&}tool=pmcentrez{&}rendertype=abstract doi: 10.1038/nature12531 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee D, Bigdeli TB, Riley BP, Fanous AH, & Bacanu SA (2013). DIST: Direct imputation of summary statistics for unmeasured SNPs. Bioinformatics, 29(22), 2925–2927. doi: 10.1093/bioinformatics/btt500 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li B & Dewey B, & Dewey CN (2011, August). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12(1), 323. Retrieved from 10.1186/1471-2105-12-323 doi: 10.1186/1471-2105-12-323 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li YI, Knowles DA, Humphrey J, Barbeira AN, Dickinson SP, Im HK, & Pritchard JK (2018, January). Annotation-free quantification of RNA splicing using LeafCutter. Nature Publishing Group, 50(1), 151–158. Retrieved from http://www.nature.com/articles/s41588-017-0004-9 doi: 10.1038/s41588-017-0004-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu DJ, Peloso GM, Yu H, Butterworth AS, Wang X, Mahajan A, … others (2017). Exome-wide association study of plasma lipids in¿ 300,000 individuals. Nature genetics, 49(12), 1758. [DOI] [PMC free article] [PubMed] [Google Scholar]
Locke AE, Steinberg KM, Chiang CW, Service SK, Havulinna AS, Stell L, … others (2019). Exome sequencing of finnish isolates enhances rare-variant association power. Nature, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mancuso N, Gayther S, Gusev A, Zheng W, Penney KL, Kote-Jarai Z, … Kraft P (2018, December). Large-scale transcriptome-wide association study identifies new prostate cancer risk regions. Nature Communications, 1–11. Retrieved from 10.1038/s41467-018-06302-1 doi: 10.1038/s41467-018-06302-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Marouli E, Graff M, Medina-Gomez C, Lo KS, Wood AR, Kjaer TR, … others (2017). Rare and low-frequency coding variants alter human adult height. Nature, 542(7640), 186. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ongen H, Buil A, Brown AA, Dermitzakis ET, & Delaneau O (2016). Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics. doi: 10.1093/bioinformatics/btv722 [DOI] [PMC free article] [PubMed] [Google Scholar]
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, & Müller M (2011). proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, 12, 77. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson MD, & Oshlack A (2010, March). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25. Retrieved from 10.1186/gb-2010-11-3-r25 doi: 10.1186/gb-2010-11-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]
Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, … Church DM (2017). Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Research. doi: 10.1101/gr.213611.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
So HC, Chau CKL, Chiu WT, Ho KS, Lo CP, Yim SHY, & Sham PC (2017). Analysis of genome-wide association data highlights candidates for drug repositioning in psychiatry. Nature Neuroscience. doi: 10.1038/nn.4618 [DOI] [PubMed] [Google Scholar]
Stegle O, Parts L, Durbin R, & Winn J (2010). A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Computational Biology. doi: 10.1371/journal.pcbi.1000770 [DOI] [PMC free article] [PubMed] [Google Scholar]
Urbut SM, Wang G, Carbonetto P, & Stephens M (2019, 01 01). Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nature Genetics, 51, 187–195. Retrieved from 10.1038/s41588-018-0268-8 doi: 10.1038/s41588-018-0268-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, … Kundaje A (2019, March). Opportunities and challenges for transcriptome-wide association studies. Nature Genetics, 51(4), 1–10. Retrieved from 10.1038/s41588-019-0385-z doi: 10.1038/s41588-019-0385-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang G, Sarkar AK, Carbonetto P, & Stephens M (2018). A simple new approach to variable selection in regression, with application to genetic fine-mapping. bioRxiv. doi: 10.1101/501114 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wen X, Lee Y, Luca F, & Pique-Regi R (2016, June 2). Efficient integrative multi-snp association analysis via deterministic approximation of posteriors. American Journal of Human Genetics, 98. Retrieved from 10.1016/j.ajhg.2016.03.029 doi: 10.1016/j.ajhg.2016.03.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wen X, Pique-Regi R, & Luca F (2017, March). Integrating molecular QTL data into genome-wide genetic association analysis: Probabilistic assessment of enrichment and colocalization. PLoS Genetics, 13(3), e1006646. Retrieved from http://dx.plos.org/10.1371/journal.pgen.1006646 doi: 10.1371/journal.pgen.1006646 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wheeler HE, Shah KP, Brenner J, Garcia T, Aquino-Michaels K, Cox NJ, … Im HK (2016). Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues. PLoS Genetics, 12(11). doi: 10.1371/journal.pgen.1006423 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu L, Shi W, Long J, Guo X, Michailidou K, Beesley J, … Zheng W (2018). A transcriptome-wide association study of 229,000 women identifies new candidate susceptibility genes for breast cancer. Nature Genetics. doi: 10.1038/s41588-018-0132-x [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang W, Voloudakis G, Rajagopal VM, Readhead B, Dudley JT, Schadt EE, … Roussos P (2019). Integrative transcriptome imputation reveals tissue-specific and shared biological mechanisms mediating susceptibility to complex traits. Nature Communications. doi: 10.1038/s41467-019-11874-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sup FigS2

NIHMS1640792-supplement-sup_FigS2.pdf^{(54KB, pdf)}

sup FigS3

NIHMS1640792-supplement-sup_FigS3.pdf^{(276.9KB, pdf)}

sup FigS1

NIHMS1640792-supplement-sup_FigS1.pdf^{(310.5KB, pdf)}

sup FigS6

NIHMS1640792-supplement-sup_FigS6.pdf^{(67.3KB, pdf)}

sup FigS4

NIHMS1640792-supplement-sup_FigS4.pdf^{(393.7KB, pdf)}

sup FigS5

NIHMS1640792-supplement-sup_FigS5.pdf^{(240.7KB, pdf)}

NIHMS1640792-supplement-1.pdf^{(152.9KB, pdf)}

[R1] Aguet F, Barbeira AN, Bonazzola R, Brown A, Castel SE, Jo B, … Lappalainen T (2019). The GTEx Consortium atlas of genetic regulatory effects across human tissues. bioRxiv. doi: 10.1101/787903 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Aguet F, Brown AA, Castel SE, Davis JR, He Y, Jo B, … Zhu J (2017). Genetic effects on gene expression across human tissues. Nature. doi: 10.1038/nature24277 [DOI] [Google Scholar]

[R3] Albert FW, & Kruglyak L (2015, April). The role of regulatory variation in complex traits and disease. Nature Reviews Genetics, 16(4), 197–212. Retrieved from 10.1038/nrg3891 doi: 10.1038/nrg3891 [DOI] [PubMed] [Google Scholar]

[R4] Amberger JS, Bocchini CA, Scott AF, & Hamosh A (2019). OMIM.org: Leveraging knowledge across phenotype-gene relationships. Nucleic Acids Research. doi: 10.1093/nar/gky1151 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Barbeira AN, Bonazzola R, Gamazon ER, Liang Y, Park Y, Subgroup GGW, … Im HK (2019). Widespread dose-dependent effects of RNA expression and splicing on complex diseases and traits. bioRxiv. doi: 10.1101/814350 [DOI] [Google Scholar]

[R6] Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, … Im HK (2018). Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nature Communications. doi: 10.1038/s41467-018-03621-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Barbeira AN, Pividori M, Zheng J, Wheeler HE, Nicolae DL, & Im HK (2019, 01). Integrating predicted transcriptome from multiple tissues improves association detection. PLOS Genetics, 15(1), 1–20. Retrieved from 10.1371/journal.pgen.1007889 doi: 10.1371/journal.pgen.1007889 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Berisa T, & Pickrell JK (2016, January). Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2), 283–285. doi: 10.1093/bioinformatics/btv546 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire M-D, Williams C, … Getz G (2012, June). RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics, 28(11), 1530–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, … Gingeras TR (2013, January). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21. Retrieved from 10.1093/bioinformatics/bts635 doi: 10.1093/bioinformatics/bts635 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Frankish A, Diekhans M, Ferreira AM, Johnson R, Jungreis I, Loveland J, … Flicek P (2019). GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research. doi: 10.1093/nar/gky955 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Friedman J, Hastie T, & Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. Retrieved from http://www.jstatsoft.org/v33/i01/ [PMC free article] [PubMed] [Google Scholar]

[R13] Fryett JJ, Inshaw J, Morris AP, & Cordell HJ (2018, October). Comparison of methods for transcriptome imputation through application to two common complex diseases. European Journal of Human Genetics, 1–10. Retrieved from 10.1038/s41431-018-0176-5 doi: 10.1038/s41431-018-0176-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, … Im HK (2015). A gene-based association method for mapping traits using reference transcriptome data. Nature genetics, 47(9), 1091–1098. Retrieved from 10.1038/ng.3367 doi: 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, & Plagnol V (2014, May). Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genetics, 10(5), e1004383. Retrieved from http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=24830394&retmode=ref&cmd=prlinks doi: 10.1371/journal.pgen.1004383 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, … Pasaniuc B (2016, March). Integrative approaches for large-scale transcriptome-wide association studies. Nature Genetics, 48(3), 245–252. Retrieved from http://www.nature.com/doifinder/10.1038/ng.3506 doi: 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Gusev A, Mancuso N, Won H, Kousi M, Finucane HK, Reshef Y, … Price AL (2018, April). Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nature Genetics, 50(4), 538–548. Retrieved from http://www.nature.com/articles/s41588-018-0092-1 doi: 10.1038/s41588-018-0092-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Haeussler M, Zweig AS, Tyner C, Speir ML, Rosenbloom KR, Raney BJ, … Kent W (2018, 11). The UCSC Genome Browser database: 2019 update. Nucleic Acids Research, 47(D1), D853–D858. Retrieved from 10.1093/nar/gky1095 doi: 10.1093/nar/gky1095 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, … Zhao H (2019, 03 01). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics, 51, 568–576. Retrieved from 10.1038/s41588-019-0345-7 doi: 10.1038/s41588-019-0345-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Huckins LM, Dobbyn A, Ruderfer DM, Hoffman G, Wang W, Pardiñas AF, … Sklar P (2019, March). Gene expression imputation across multiple brain regions provides insights into schizophrenia risk. Nature Genetics, 51(4), 1–20. Retrieved from 10.1038/s41588-019-0364-4 doi: 10.1038/s41588-019-0364-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, … McEwen JE (2010). Integrating common and rare genetic variation in diverse human populations. Nature. doi: 10.1038/nature09298 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Lappalainen T, Sammeth M, Friedländer MR, ‘t Hoen P. a. C., Monlong J, Rivas M. a., … Dermitzakis ET (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501(7468), 506–11. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3918453{&}tool=pmcentrez{&}rendertype=abstract doi: 10.1038/nature12531 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Lee D, Bigdeli TB, Riley BP, Fanous AH, & Bacanu SA (2013). DIST: Direct imputation of summary statistics for unmeasured SNPs. Bioinformatics, 29(22), 2925–2927. doi: 10.1093/bioinformatics/btt500 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Li B & Dewey B, & Dewey CN (2011, August). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12(1), 323. Retrieved from 10.1186/1471-2105-12-323 doi: 10.1186/1471-2105-12-323 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Li YI, Knowles DA, Humphrey J, Barbeira AN, Dickinson SP, Im HK, & Pritchard JK (2018, January). Annotation-free quantification of RNA splicing using LeafCutter. Nature Publishing Group, 50(1), 151–158. Retrieved from http://www.nature.com/articles/s41588-017-0004-9 doi: 10.1038/s41588-017-0004-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Liu DJ, Peloso GM, Yu H, Butterworth AS, Wang X, Mahajan A, … others (2017). Exome-wide association study of plasma lipids in¿ 300,000 individuals. Nature genetics, 49(12), 1758. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Locke AE, Steinberg KM, Chiang CW, Service SK, Havulinna AS, Stell L, … others (2019). Exome sequencing of finnish isolates enhances rare-variant association power. Nature, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Mancuso N, Gayther S, Gusev A, Zheng W, Penney KL, Kote-Jarai Z, … Kraft P (2018, December). Large-scale transcriptome-wide association study identifies new prostate cancer risk regions. Nature Communications, 1–11. Retrieved from 10.1038/s41467-018-06302-1 doi: 10.1038/s41467-018-06302-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Marouli E, Graff M, Medina-Gomez C, Lo KS, Wood AR, Kjaer TR, … others (2017). Rare and low-frequency coding variants alter human adult height. Nature, 542(7640), 186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Ongen H, Buil A, Brown AA, Dermitzakis ET, & Delaneau O (2016). Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics. doi: 10.1093/bioinformatics/btv722 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, & Müller M (2011). proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, 12, 77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Robinson MD, & Oshlack A (2010, March). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25. Retrieved from 10.1186/gb-2010-11-3-r25 doi: 10.1186/gb-2010-11-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, … Church DM (2017). Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Research. doi: 10.1101/gr.213611.116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] So HC, Chau CKL, Chiu WT, Ho KS, Lo CP, Yim SHY, & Sham PC (2017). Analysis of genome-wide association data highlights candidates for drug repositioning in psychiatry. Nature Neuroscience. doi: 10.1038/nn.4618 [DOI] [PubMed] [Google Scholar]

[R35] Stegle O, Parts L, Durbin R, & Winn J (2010). A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Computational Biology. doi: 10.1371/journal.pcbi.1000770 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Urbut SM, Wang G, Carbonetto P, & Stephens M (2019, 01 01). Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nature Genetics, 51, 187–195. Retrieved from 10.1038/s41588-018-0268-8 doi: 10.1038/s41588-018-0268-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, … Kundaje A (2019, March). Opportunities and challenges for transcriptome-wide association studies. Nature Genetics, 51(4), 1–10. Retrieved from 10.1038/s41588-019-0385-z doi: 10.1038/s41588-019-0385-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Wang G, Sarkar AK, Carbonetto P, & Stephens M (2018). A simple new approach to variable selection in regression, with application to genetic fine-mapping. bioRxiv. doi: 10.1101/501114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Wen X, Lee Y, Luca F, & Pique-Regi R (2016, June 2). Efficient integrative multi-snp association analysis via deterministic approximation of posteriors. American Journal of Human Genetics, 98. Retrieved from 10.1016/j.ajhg.2016.03.029 doi: 10.1016/j.ajhg.2016.03.029 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Wen X, Pique-Regi R, & Luca F (2017, March). Integrating molecular QTL data into genome-wide genetic association analysis: Probabilistic assessment of enrichment and colocalization. PLoS Genetics, 13(3), e1006646. Retrieved from http://dx.plos.org/10.1371/journal.pgen.1006646 doi: 10.1371/journal.pgen.1006646 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Wheeler HE, Shah KP, Brenner J, Garcia T, Aquino-Michaels K, Cox NJ, … Im HK (2016). Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues. PLoS Genetics, 12(11). doi: 10.1371/journal.pgen.1006423 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Wu L, Shi W, Long J, Guo X, Michailidou K, Beesley J, … Zheng W (2018). A transcriptome-wide association study of 229,000 women identifies new candidate susceptibility genes for breast cancer. Nature Genetics. doi: 10.1038/s41588-018-0132-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Zhang W, Voloudakis G, Rajagopal VM, Readhead B, Dudley JT, Schadt EE, … Roussos P (2019). Integrative transcriptome imputation reveals tissue-specific and shared biological mechanisms mediating susceptibility to complex traits. Nature Communications. doi: 10.1038/s41467-019-11874-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Fine-mapping and QTL tissue-sharing information improve the reliability of causal gene identification

Alvaro N Barbeira

Owen J Melia

Yanyu Liang

Rodrigo Bonazzola

Gao Wang

Heather E Wheeler

François Aguet

GTEx Consortium

Kristin G Ardlie

Xiaoquan Wen

Hae K Im

Abstract

Introduction

Results

Summary of models

Fig 1. Models summary.

Finemapping-based models perform well in independent expression dataset

Fig 2. Validation in a separate expression cohort.

Fine-mapping improves number and colocalization of associations

Fig 3. PrediXcan associations across 87 traits.

Finemapping improves identification of silver standard genes

Fig 4. ROC and PR curves.

Importance of imputation of missing summary statistics in practice

Fig 5. Effect of imputation on association quality.

Discussion

Methods

GTEx data processing

GTEx expression and splicing modelling

Elastic Net models

CTIMP models

Elastic Net informed by dap-g results

mashr-based models

GEUVADIS data processing

GWAS processing and integration

AUC estimation for silver standard gene identification

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases