Partitioning gene-based variance of complex traits by gene score regression

Wenmin Zhang; Si Yi Li; Tianyi Liu; Yue Li

doi:10.1371/journal.pone.0237657

. 2020 Aug 20;15(8):e0237657. doi: 10.1371/journal.pone.0237657

Partitioning gene-based variance of complex traits by gene score regression

Wenmin Zhang ¹, Si Yi Li ², Tianyi Liu ², Yue Li ^1,^2,^*

Editor: F Alex Feltus³

PMCID: PMC7446906 PMID: 32817676

Abstract

The majority of genome-wide association studies (GWAS) loci are not annotated to known genes in the human genome, which renders biological interpretations difficult. Transcriptome-wide association studies (TWAS) associate complex traits with genotype-based prediction of gene expression deriving from expression quantitative loci(eQTL) studies, thus improving the interpretability of GWAS findings. However, these results can sometimes suffer from a high false positive rate, because predicted expression of different genes may be highly correlated due to linkage disequilibrium between eQTL. We propose a novel statistical method, Gene Score Regression (GSR), to detect causal gene sets for complex traits while accounting for gene-to-gene correlations. We consider non-causal genes that are highly correlated with the causal genes will also exhibit a high marginal association with the complex trait. Consequently, by regressing on the marginal associations of complex traits with the sum of the gene-to-gene correlations in each gene set, we can assess the amount of variance of the complex traits explained by the predicted expression of the genes in each gene set and identify plausible causal gene sets. GSR can operate either on GWAS summary statistics or observed gene expression. Therefore, it may be widely applied to annotate GWAS results and identify the underlying biological pathways. We demonstrate the high accuracy and computational efficiency of GSR compared to state-of-the-art methods through simulations and real data applications. GSR is openly available at https://github.com/li-lab-mcgill/GSR.

Introduction

Genome-wide association studies (GWAS) have been broadly successful in associating genetic variants with complex traits and estimating trait heritabilities in large populations [1–4]. Over the past decade, GWAS have quantified the effects of individual genetic variants on hundreds of polygenic phenotypes [5, 6]. GWAS summary statistics have enabled various downstream analyses, including partitioning heritability [7], inferring causal single nucleotide polymorphisms (SNPs) using epigenomic annotations [8], and gene sets enrichment analysis for complex traits [9]. However, it remains challenging to link these genetic associations with known biological mechanisms. One main reason is that the majority of the GWAS loci are not located in known genic regions of the human genome.

Transcriptome-wide association studies (TWAS) [10–12] offer a systematic way to integrate GWAS and the reference genotype-gene expression datasets, such as the Genotype-Tissue Expression project (GTEx) [13], via expression quantitative loci (eQTL). In TWAS, we could first quantify the impact of each genetic variant on expression variability in a population and obtain predicted gene expression levels based on new genotypes; Then, we could correlate the predicted gene expression with the phenotype of interest in order to identify pivotal genes [10]. Moreover, when individual-level genotypes and gene expression levels are not available, we could still quantify gene-to-phenotype association (i.e. TWAS statistics) using only the marginal effect sizes of SNPs on the phenotype and on gene expression respectively [11]. These concepts and implementations have largely facilitated explanation of genetic association findings at the gene or the pathway level.

However, as depicted in Fig 1, TWAS are often confounded by the gene-to-gene correlation of the genetically predicted gene expression due to the SNP-to-SNP correlation i.e., linkage disequilibrium (LD) [12]. Consequently, relying on the TWAS statistics may lead to false positive discoveries of causal genes and pathways. One approach to address this problem is to fine-map causal genes by inferring the posterior probabilities of configurations of each gene being causal in a defined GWAS loci and then test gene set enrichment using the credible gene sets of prioritized genes [14]. However, this approach is computationally expensive, restricted to GWAS loci, and sensitive to the arbitrary thresholds used for determining the credible gene set and the maximum number of causal genes per locus.

Another method called PASCAL [9] projects SNP signals onto genes while correcting for LD, and then performs pathway enrichments as the aggregated transformed gene scores, which asymptotically follows a chi-square distribution. However, PASCAL does not leverage the eQTL information for each SNP thereby assuming that a priori all SNPs have the same effect on the gene. Stratified LD score regression (LDSC) offers a principle way to partition the SNP heritability into functional categories, defined based on tissue or cell-type specific epigenomic regions [7] or eQTL regions of the genes exhibiting a strong tissue specificity [15]. Although LDSC is able to obtain biologically meaningful tissue-specific enrichments, it operates at the SNP level, rendering it difficult to assess enrichment of gene sets. Moreover, neither PASCAL nor LDSC is able to integrate the observed gene expression data measured in a disease cohort (rather than the reference cohort) that are broadly available across diverse studies of diseases including cancers such as The Cancer Genome Atlas (TCGA) [16].

Although expression-based methods, such as gene set enrichment analysis (GSEA), are often adopted in combination with the observed gene expression and phenotypes [17], they generally do not account for the gene-to-gene correlation. While this type of correlation is usually caused by shared transcriptional regulatory mechanisms across genes, GSEA still likely produces false positives in identifying causal pathways.

In this study, we present a novel and powerful gene-based heritability partitioning method that jointly accounts for gene-to-gene correlation and integrates information captured at either the SNP-to-phenotype or the SNP-to-gene level. We utilize this method to identify plausible causal gene sets or pathways for complex traits. We showcase its high accuracy and computational efficiency in various simulated and real scenarios.

Methods

Partitioning gene-based variance of complex traits

We assume gene expression has linearly additive effects on a continuous polygenic trait y:

\begin{matrix} y_{i} = \sum_{j} A_{i j} α_{j} + ϵ_{i} \end{matrix}

(1)

where A_ij denotes the expression of the j-th gene in the i-th individual for i ∈ {1, …, N} individuals and j ∈ {1, …, G} genes; α_j denotes the true effect size of the j-th gene on the trait and $α_{j} \sim N (0, σ_{j}^{2})$ ; ϵ_i denotes the residual for the i-th individual in this linear model and $ϵ_{i} \sim N (0, σ_{ϵ}^{2})$ .

Here we further assume that both y and A are standardized such that $\frac{1}{N} \sum_{i} y_{i} = 0$ , $\frac{1}{N} y^{⊤} y = 1$ , $\frac{1}{N} \sum_{i} A_{i j} = 0$ and $\frac{1}{N} A_{j}^{⊤} A_{j} = 1$ , for j ∈ {1, …, G}.

We define the estimated marginal effect size of the j-th gene on the trait as ${\hat{α}}_{j}$ :

\begin{matrix} {\hat{α}}_{j} & = \frac{1}{N} A_{j}^{⊤} y \end{matrix}

(2)

\begin{matrix} = \frac{1}{N} A_{j}^{⊤} (\sum_{k} A_{k} α_{k} + ϵ) \end{matrix}

(3)

\begin{matrix} = \sum_{k} \frac{1}{N} A_{j}^{⊤} A_{k} α_{k} + \frac{1}{N} A_{j}^{⊤} ϵ \end{matrix}

(4)

\begin{matrix} = \sum_{k} {\hat{r}}_{j k} α_{k} + ϵ^{'} \end{matrix}

(5)

where $ϵ^{'} = \frac{1}{N} A_{j}^{⊤} ϵ$ with

\begin{matrix} V a r (ϵ^{'}) = \frac{1}{N^{2}} A_{j}^{⊤} V a r (ϵ) A_{j} = \frac{1}{N} σ_{ϵ}^{2} \end{matrix}

and ${\hat{r}}_{j k} = \frac{1}{N} A_{j}^{⊤} A_{k}$ is the estimated Pearson correlation in gene expression between the j-th gene and the k-th gene.

We define $χ_{j}^{2} = N {\hat{α}}_{j}^{2}$ . Then, if we further assume α, r and ϵ′ are independent, we have

\begin{matrix} E [χ_{j}^{2}] & = E [N {\hat{α}}_{j}^{2}] \end{matrix}

(6)

\begin{matrix} = N E [{(\sum_{k} {\hat{r}}_{j k} α_{k} + ϵ^{'})}^{2}] \end{matrix}

(7)

\begin{matrix} = N \sum_{k} E [{\hat{r}}_{j k}^{2}] E [α_{k}^{2}] + σ_{ϵ}^{2} \end{matrix}

(8)

Now, consider C gene sets C_c, where c ∈ {1, …, C} and denote the proportion of total trait variance explained by the c-th gene set as τ_c with $τ_{c} = \frac{\sum_{j \in C_{c}} V a r (α_{j})}{| C_{c} |}$ . Here, |C_c| denotes the number of genes in the c-th gene set.

Consequently,

\begin{matrix} E [α_{k}^{2}] = V a r (α_{k}) = \sum_{c : k \in C_{c}} τ_{c} \end{matrix}

By approximating $E [{\hat{r}}_{j k}^{2}]$ with ${\hat{r}}_{j k}^{2} + \frac{1}{N}$ , we have that

\begin{matrix} E [χ_{j}^{2}] & = N \sum_{k} E [{\hat{r}}_{j k}^{2}] E [α_{k}^{2}] + σ_{ϵ}^{2} \end{matrix}

(9)

\begin{matrix} = N \sum_{c} τ_{c} \sum_{k \in C_{c}} {\hat{r}}_{j k}^{2} + \sum_{c} τ_{c} + σ_{ϵ}^{2} \end{matrix}

(10)

\begin{matrix} = N \sum_{c} τ_{c} l (j, c) + 1 \end{matrix}

(11)

where we define gene score as $l (j, c) = \sum_{k \in C_{c}} {\hat{r}}_{j k}^{2}$ and $V a r (y) = \sum_{c} τ_{c} + σ_{ϵ}^{2} = 1$ since the continuous trait is normalized.

Therefore, if we are able to obtain estimates for $χ_{j}^{2}$ and C gene score l(j, c) for j ∈ {1, …, G} and c ∈ {1, …, C}, we will be able to perform linear regression of the estimated $χ_{j}^{2}$ on l(j, c), and derive regression coefficient that is an estimate for each τ_c (c ∈ {1, …, C}), respectively.

These are available from GWAS summary statistics of SNP-to-trait effect sizes, eQTL summary statistics of SNP-to-gene expression effect sizes, and a reference LD panel. Specifically,

Suppose we have estimated effect sizes (β_p×1) of p SNPs based on a GWAS including N_gwas samples, i.e.
$\begin{matrix} β = \frac{1}{N_{gwas}} X^{⊤} y \end{matrix}$
where X_{N_gwas×p} is the standardized genotype. Meanwhile, we have the eQTL summary statistics W estimated using
$\begin{matrix} A_{e Q T L} = X_{e Q T L} W \end{matrix}$

Therefore, the predicted gene expression in GWAS is given by
$\begin{matrix} A = X W \end{matrix}$

Since
$\begin{matrix} χ_{j}^{2} & = N {\hat{α}}_{j}^{2} \end{matrix}$ (12)

$\begin{matrix} = N {(\frac{1}{N} A_{j}^{⊤} y)}^{2} \end{matrix}$ (13)

$\begin{matrix} = N {(\frac{1}{N} W_{j}^{⊤} X^{⊤} y)}^{2} \end{matrix}$ (14)

$\begin{matrix} = N {(W_{j}^{⊤} β)}^{2} \end{matrix}$ (15)
the required $χ_{j}^{2}$ can be estimated without accessing any individual-level data.
Furthermore, a reference LD panel Σ_p×p summarizing SNP-to-SNP correlation in the matched population with the GWAS study can provide estimates for r_jk as
$R = [r_{j k}]$ (16)

$= \frac{1}{N} A^{⊤} A$ (17)

$\begin{matrix} = \frac{1}{N} W^{⊤} X^{⊤} X W \end{matrix}$ (18)

$\begin{matrix} = \frac{1}{N} W^{⊤} Σ W \end{matrix}$ (19)

It is noteworthy that with individual-level gene expression data, we can also easily obtain the required $χ_{j}^{2}$ and R = [r_jk] by definition.

In practice, many gene sets are not disjoint and share common genes with each other. Therefore, we regress one gene set at a time along with a “dummy” gene set that include the union of all of the other genes. The dummy gene set is used to account for unbalanced gene sets and to stabilize estimates of τ_c. We also include an intercept in the regression model to alleviate non-gene-set biases, for example, positive correlation between gene scores and true gene effect sizes that could lead to intercept greater than 1 and negative correlation between gene scores and true gene effect sizes could lead to intercept smaller than 1.

Simulation design

To assess the accuracy of our GSR approach, we simulated causal SNPs for gene expression as well as causal gene sets for a continuous trait based on real genotypes and known gene sets from existing databases. Our simulation included two stages: At stage 1, we first simulated gene expression based on reference genotype panel. We then estimated SNP-gene effects ${\hat{W}}_{g}$ for each gene g based on the simulated gene expression and genotype, which were then used to predict gene expression; At stage 2, separately, we simulated the a continuous trait using simulated gene expression based on genotype, and estimated the marginal SNP-phenotype effects.

Simulation step 1: simulating gene expression:

To simulate individual genotype, we first partitioned genotype data for 489 individuals of European ancestry obtained from the 1000 Genomes Project [18] into independent 1703 LD blocks as defined by LDetect [19];
We then randomly sampled 100 LD blocks and used only those 100 LD blocks for the subsequent simulation; We used 100 LD blocks as opposed to whole genome to reduce computational burden required for multiple simulation runs;
For LD block j (j ∈ {1, …, 100}) of an individual i (i ∈ {1, …, 500}), we randomly sampled from the 489 available samples for block j, and concatenated these sampled LD blocks 1, …, 100 for this individual. We repeated this procedure to simulate genotype X_ref for N_ref = 500 individuals as a reference population;
We standardized the simulated genotype X_ref;
We randomly sampled k in-cis causal SNPs per gene within ± 500 kb around the gene, where k = 1 (default). We also experimented different number of causal SNPs k ∈ {2, 3, all in- cis SNPs};
We sampled SNP-gene weights $W_{g} \sim N (0, h_{g}^{2} / k)$ where gene expression heritability $h_{g}^{2} = 0.1$ (default), which is the variance of gene expression explained by genotype. We also experimented different gene heritability $h_{g}^{2} = {0.2, 0.3, 0.4, 0.5}$ ;
We then simulated gene expression A_g,ref = X_ref W_g + ϵ, where $ϵ \sim N (0, σ_{ϵ}^{2})$ and $σ_{ϵ}^{2} = \frac{1}{N_{ref}} ∥ X_{ref} W_{g} ∥^{2} (\frac{1}{h_{g}^{2}} - 1) I_{N_{ref}}$ to match the desired heritability: $\frac{1 - h_{g}^{2}}{h_{g}^{2}} = \frac{σ_{ϵ}^{2}}{∥ X_{ref} W_{g} ∥^{2} / N_{ref}}$
Finally, we applied LASSO regression $A_{g, ref} \sim \bar{X} W_{g}$ to get ${\hat{W}}_{g}$ for each gene.

Simulation step 2: simulating phenotype:

We simulated another N_gwas = 50,000 GWAS individuals by the 100 predefined LD blocks among the 489 Europeans in 1000 Genome data, following the same procedures as decribed above;
We then standardized the simulated genotype X_gwas;
We then sampled a causal pathway $C_{c}$ from MSigDB such that all of the $G_{c} \equiv | C_{c} |$ genes in $C_{c}$ were causal genes for the phenotype;
For each non-causal pathway, we removed genes that were also present in the causal pathway. We removed non-causal pathways containing fewer than five genes afterwards (default); Alternatively, in more realistic scenarios, we allowed for sharing genes with causal pathways by non-causal pathways;
We sampled gene-phenotype effect $α \sim N (0, σ_{α}^{2} / G_{c} I_{G_{c}})$ , where the phenotypic variance explained by gene expression $σ_{α}^{2} = 0.1$ (default). We also experimented different $σ_{α}^{2} \in {0.1, 0.2, 0.3, 0.4, 0.5}$ ;
We simulated gene expression A_c as in step 1 for the N_gwas individuals, and standardized it to obtain ${\bar{A}}_{c}$
We simulated a continuous trait using causal gene expression: $y = {\bar{A}}_{c} α + ϵ_{y}$ where $ϵ_{y} \sim N (0, σ_{ϵ_{y}}^{2})$ . Here, $σ_{ϵ_{y}}^{2} = \frac{1}{N_{gwas}} ∥ {\bar{A}}_{c} α ∥^{2} (\frac{1}{σ_{α}^{2}} - 1) I_{N_{g w a s}}$ to match the predefined proportion of variance explained: $\frac{1 - σ_{α}^{2}}{σ_{α}^{2}} = \frac{σ_{ϵ_{y}}^{2}}{∥ {\bar{A}}_{c} α ∥^{2} / N_{gwas}}$
Lastly, we computed GWAS summary SNP-to-trait effect size: $β = \frac{1}{N} X_{gwas}^{⊤} y$

We repeated these simulation procedures 100 times. Unless otherwise stated, while we were experimenting various settings, we kept the other settings at their default values: k = 1 causal SNP per gene; gene expression variance explained per causal SNP $h_{g}^{2} = 0.1 / k$ ; phenotypic variance explained per gene $σ_{α}^{2} = 0.1$ ; one causal pathway. Using these obtained summary statistics, we were able to perform GSR, PASCAL, LDSC and FOCUS in each simulated scenario.

Applying existing methods

PASCAL: PASCAL was downloaded from https://www2.unil.ch/cbg/index.php?title=Pascal [9]. We executed the software using default settings. LDSC: Stratified LD score regression software was downloaded from https://github.com/bulik/ldsc [15]. Because LDSC operates on SNP level, we considered SNPs located within ± 500 kb around genes in each pathway to be involved in the corresponding pathway. Then, for each pathway, we computed the LD scores over all chromosomes. We experimented the options of running LDSC with and without the 53 baseline annotations using our simulated data. We found that LDSC running without the 53 baseline worked better in our case. One possible reason is that the baseline annotations cover genome-wide SNPs whereas there are much fewer SNPs in the simulated pathways. FOCUS: We downloaded FOCUS [14] from https://github.com/bogdanlab/focus. We used FOCUS to infer the posterior probability of each gene being causal for the phenotype across all of the LD blocks. We then took the 90% credible gene set as follows. We first summed all of the posteriors over all of the genes. We then sorted the genes by the decreasing order of their FOCUS-posteriors. We kept adding the top ranked the gene into the 90% credible gene until the sum of their posteriors was greater than or equal to the 90% of the total sum of posteriors. We used the 90% credible gene set for hypergeometric test for each pathway to compute the p-values. We also tried other thresholds for credible sets ranging from 75% (including the fewest genes) to 99% (including the most genes). GSEA: GSEA software was obtained from http://software.broadinstitute.org/gsea [17]. We used the command-line version of GSEA to test for gene set enrichments using the observed gene expression and phenotype data.

Real data application

We applied our approach to investigate pathway enrichment for 27 complex traits (Fig 2b) using publicly available summary statistics and genotype-expression weights based on 1,264 GTEx whole blood samples. The GWAS summary statistics were downloaded from public database https://data.broadinstitute.org/alkesgroup/sumstats_formatted/ [7]. We downloaded expression weights and reference LD structure estimated in 1000 Genomes using 489 European individuals, from the TWAS/FUSION website (http://gusevlab.org/projects/fusion/) [11, 18]. Franke lab cell-type-specific gene expression dataset were obtained from https://data.broadinstitute.org/mpg/depict/depict_download/tissue_expression.

In addition, we applied GSR to test for gene set enrichment in three well-powered types of cancer: breast invasive carcinoma (BRCA, 982 cases and 199 controls), thyroid carcinoma (THCA, 441 cases and 371 controls) and prostate adenocarcinoma (PRAD, 426 cases and 154 controls), using gene expression datasets from The Cancer Genome Atlas (TCGA). Uniformly processed (normalized and batch-effect corrected) gene expression datasets from TCGA and GTEx were obtained from https://figshare.com/articles/Data_record_3/5330593 [20]. Gene expression and phenotype were standardized before supplying to the GSR software. Standard GSEA was also performed for comparison.

Gene sets were downloaded from the MSigDb website http://software.broadinstitute.org/gsea/msigdb/index.jsp. Here we combined BIOCARTA, KEGG and REACTOME to create a total of 1,050 gene sets. We also downloaded the 4,436 GO biological process terms as additional gene sets as well as the 189 gene sets pertaining to oncogenic signatures for the TCGA data analysis.

Results

Gene scores were correlated with TWAS statistics in polygenic complex traits

Our method GSR is built on the hypothesis that the marginal gene effect sizes on the phenotype should be positively correlated with the sum of correlation with other genes, which include causal genes. To validate this hypothesis, we defined gene score for each gene as the sum of its squared Pearson correlation with all of the other genes, derived from gene expression levels. We calculated TWAS marginal statistics as the product of GWAS summary statistics (β) and eQTL weights (W) derived from the GTEx whole blood samples (Eq 15). To assess the impact of gene-to-gene correlation on TWAS statistic, we correlated the gene scores with the TWAS marginal statistics for 27 complex traits. Overall, most traits had Pearson correlation between the gene score and the marginal TWAS statistic above 0.4. For instance, the correlation in schizophrenia was 0.76 (Inter-Quartile Range: 0.66—0.81 based on 1,000 permutations; Fig 2). This implies a pervasive confounding impact on the downstream analysis, including gene set or pathway enrichment analysis, causal gene identification, etc., using the TWAS summary statistics while assuming independence of genes (Fig 1).

GSR improved pathway enrichment power

In simulated scenarios with default settings (Methods), compared to PASCAL and LDSC, GSR demonstrated hugely improved computational efficiency (Table 1), superior sensitivity in detecting causal pathways with an improved statistical power as well as competitive specificity in controlling for false positives (Fig 3). Specifically, in 100 simulations, GSR achieved an overall area under the precision-recall curve (AUPRC) of 0.925, and identified the true causal pathway as the most significant one 93 times, compared to 56 times by PASCAL, which only achieved an overall AUPRC of 0.260. Notably, the FOCUS-predicted 75%, 90%, 99% credible gene sets were also significantly enriched for causal pathways (Fig 3).

Table 1. Comparison of existing methods with GSR.

Method	GWAS	TWAS	Measured expression	Running time
PASCAL [9]	sum. stat. ^*			10 m
LDSC [15]	sum. stat.			>24 h ^†
FOCUS [14]	sum. stat.	sum. stat.		>24 h
GSEA [17]			individual expression	10 m
GSR	sum. stat.		individual expression	3 min

Open in a new tab

* Summary statistics

^† For custom gene sets, the main computation time for LDSC is calculating the LD score for all of the 1000 Genome SNPs.

Fig 3 — (a) Precision-recall curves for GSR and PASCAL summarizing results from 100 simulations. (b) Summary of p-values obtained by running GSR along with PASCAL, LDSC and FOCUS 10 times. For each method, the enrichment significance for causal pathways and non-causal pathways are displayed. We experimented FOCUS with 75%, 90%, and 99% credible sets for the pathway enrichments. For the ease of comparison, we plotted the y-axis on a square-root negative logarithmic scale. Red line denotes p-value threshold of 0.001; Blue line denotes p-value threshold of 0.1.

We then varied four different settings: (a) the number of causal SNP per gene; (2) SNP-gene heritabilities; (3) gene-phenotype variance explained; (4) overlapping causal pathway. We focused our comparison with PASCAL because it directly tested for pathway enrichment and has been demonstrated to outperform other relevant enrichment methods [9]. In all simulation settings, GSR demonstrated an improved power in detecting the causal pathways (S2 Fig in S1 File), as it was able to detect causal pathways when multiple SNPs influenced gene expression, when the proportion of variance explained by the gene expression was low, or when the causal and non-causal pathways were allowed to overlap. In contrast, a lot of causal pathways were not deemed significant by PASCAL based on a p-value threshold of 0.001, which was equivalent to a Bonferroni-corrected p-value threshold of 0.1 after correcting for multiple testing on approximately 100 pathways tested per simulation.

Improved power in pathway enrichment leveraging observed gene expression

One unique feature of GSR is the ability to run not on only the summary statistics but also on observed gene expression, where the gene-gene expression correlation is directly estimated from the in-sample gene expression. To evaluate the accuracy of this application, we simulated gene expression and phenotype for 1,000 individuals, which were provided as input to GSR for pathway enrichment analysis. As a comparison, we applied GSR to the summary statistics generated from the same dataset.

As in the simulation above, the SNP-expression weights were estimated from a separate set of 500 reference individuals whereas the SNP-phenotype associations were estimated from only 1,000 individuals. Notably, the sample size for the GWAS cohort is much smaller than the previous application to mimic the real data where usually fewer than 1000 individuals have both the RNA-seq and phenotype available (e.g., TCGA). Additionally, we applied standard GSEA [17] to the same dataset with the observed gene expression. We observed an improved power of GSR when using the observed gene expression over GSR using the summary statistics (Fig 4), whereas GSEA had a comparable performance as the latter. Specifically, all causal pathways in the simulated replicates had a p-value below 0.001, with the largest p-value being 7.5 × 10⁻⁶, as determined by GSR using observed gene expression, while no causal pathway reached this level of significance (with the smallest p-value being 1.4 × 10⁻²) determined by GSEA. We also compared the performances of GSR using observed gene expression to GSEA in various simulation settings and obtained consistent conclusions (S3 Fig in S1 File).

Fig 4 — Nominal (NOM) p-values yielded by GSEA were summarized. Red line denotes p-value threshold of 0.001; Blue line denotes p-value threshold of 0.1.

Gene set enrichments in complex traits

Applying GSR to 27 complex traits, we revealed various pathways where the enriched gene sets were biologically meaningful. For example, the enriched gene sets for high density lipoprotein (HDL) predominantly involve lipid metabolism; In contrast, for Lupus, gene sets were enriched in interferon signalling pathways, a known immunological hallmark. We listed the top 10 enrichments over gene sets from MSigDB and Gene Ontology terms for HDL and the autoimmune trait Lupus in S1 Table in S1 File.

Additionally, we applied GSR to test cell-type-specific enrichments using 205 cell types, 48 of which were derived from GTEx and 157 cell types were derived from Franke lab datasets [15]. We observed biologically meaningful cell type-specific enrichment for the 27 complex traits (Fig 5). In particular, central neural system cell-specific gene sets were enriched for schizophrenia, immune cell-specific gene sets for lupus, immune cell-specific and digestive tract cell-specific gene sets for Crohn’s disease and cardiac cell-specific gene sets for coronary artery disease. Lastly, we correlated traits based on their gene set enrichments and observed meaningful phenotypic clusters, suggesting shared biological mechanisms by the related phenotypes (S4 Fig in S1 File). For example, Crohn’s disease and ulcerative colitis, two subtypes of inflammatory bowel disease formed a cluster; Neurological diseases, schizophrenia and bipolar disorder formed a cluster; Moreover, lipid traits including LDL, HDL, and Triglycerides formed their own cluster.

Fig 5 — GSR was applied to each complex trait in order to identify significantly enriched gene sets among 205 pre-defined cell-type-specific gene sets, represented by nine different colors. Gene sets were indicated by dots and were aligned in the same order on the x-axis. Red lines indecate Bonferroni-corrected p-value threshold (0.05).

Application on observed gene expression

Lastly, using expression profiles of BRCA, THCA and PRAD from TCGA and GTEx [20], we tested the enrichments of 186 oncogenic gene sets as well as 1,050 gene sets from BIOCARTA, KEGG, and REACTOME in each type of tumor. Overall, we observed a significantly stronger enrichments for the oncogenic signatures with higher p values compared to the more general gene sets across all three tumour types (t-test p-value = 6.4 × 10⁻²⁵, 9.0 × 10⁻²⁹ and 1.1 × 10⁻²³ for BRCA, PRAD and THCA respectively; S5 Fig in S1 File). As a comparison, we also ran standard GSEA and observed qualitatively similar enrichments (S5 Fig in S1 File).

Discussion

In this work, we describe GSR, an efficient method to test for gene set or pathway enrichments using either GWAS summary statistics or observed gene expression and phenotype information. We demonstrate robust and powerful detection of causal pathways in extensive simulation using our proposed method compared to several state-of-the-art methods. When applying to the real data, we also obtained biologically meaningful enrichments of relevant gene sets and pathways. These features warrant GSR a widely applicable method in various study settings with an aim to interpret association test results and capture the underlying biological mechanisms.

Our approach has superior computational efficiency. In particular, GSR took only 3-5 minutes running on the full summary statistics and less than 5 minutes on the full gene expression data with one million SNPs and 20,000 genes to test for enrichments of over 4,000 gene sets. In our simulations, it is not surprising that FOCUS can accurately fine-map causal genes as the simulation designs followed similar assumptions adopted by FOCUS [14]. However, FOCUS is at least 20 times slower than GSR. For the simulated data, FOCUS took 30 minutes to fine-map all of the genes in GWAS loci whereas GSR took under three minutes to test for pathway enrichments on the same machine. Additionally, the computational cost of FOCUS is exponential to the number of causal genes considered within each locus whereas GSR is not affected by the number of causal genes. Also, because GSR operates at genome-wide level, no threshold is needed to decide which genes to be included whereas FOCUS needs user-defined threshold for constructing the credible gene set for the subsequent hypergeometric enrichment test. Given these advantages, we envision that GSR will be a valuable tool for the bioinformatic community and statistical genetic community as a fast way to investigate the functional implications of complex polygenic traits.

In different simulation settings, GSR exhibits improved pathway enrichment power over PASCAL and LDSC, two popular methods for partitioning heritability and identifying causal gene sets. Since GSR leverages SNP-to-gene association summaried by eQTL weights while either PASCAL or LDSC operates on the SNP level, without considering this intermediate association, such improvements are expected and beneficial. Given that existing eQTL studies have yielded reliable estimates of SNP-to-gene effects and are easily accessible, we consider GSR more promising in bridging the gap between large GWAS and multi-faceted functional annotations on the genome.

One unique feature of our approach is that it could leverage the observed individual-level gene expression that are broadly available to calculate more accurate in-sample gene-gene correlation. Indeed, we observed more accurate detection of causal pathway for modest sample size (1000 individuals) where the phenotype and gene expression are available compared to GSR operating only on summary statistics. In real data analysis, we demonstrate that GSR can achieve similar biologically meaningful enrichments as GSEA when applied to the observed gene expression. On the other hand, GSR has the advantage of working with summary statistics when the individual gene expression and phenotype are not available where GSEA could hardly be performed.

It is noteworthy that p values generated by different methods in this study are not directly comparable due to different model assumptions, statistical tests being used, sampling methods, etc. However, we posit that the p-values themselves are informative in reality. When gene set enrichment analysis is performed in related studies, p-values are usually directly adopted to identify specific signals as a common practice. Therefore, GSR may be promising to refine interpretation and reveal under-identified biological mechanisms in existing studies, as it is able to yield smaller p-values for the true underlying pathways.

Our method has important limitations. First of all, our method relies on pre-computed eQTL weights, which might absorb measurement uncertainty, confounding effects as well as stochastic errors. Besides, it is usually unknown how these weights vary across different populations, i.e. whether the effect of each SNP on the corresponding gene expression is conserved, particularly when investigation is carried out on a diseased population while using a non-diseased reference population. Furthermore, our method is built on an important assumption that the effect sizes of genes on the trait and the derived gene scores are independent. In practice, if this assumption is violated, our method might suffer from the bias introduced. While no method exists to examine the validity of these properties to our knowledge, since we obtained consistent results in our real data analyses, we posit that our method should be robust in identifying causal pathways. We propose our method could be widely utilized various studies where further calibration of the exact estimates of effect sizes should continuously improve its performance.

Supporting information

S1 File

(PDF)

Click here for additional data file.^{(483.7KB, pdf)}

Acknowledgments

We thank Mathieu Blanchette for the helpful comments on the manuscript.

Data Availability

All of the data used in this paper are described under subsection "Real data application" in the manuscript and pasted below as reference. We applied our approach to investigate pathway enrichment for 27 complex traits (Fig 2b) using publicly available summary statistics and genotype-expression weights based on 1,264 GTEx whole blood samples (https://www.gtexportal.org/home/datasets2). The GWAS summary statistics were downloaded from public database https://data.broadinstitute.org/alkesgroup/sumstats_formatted/. We downloaded expression weights and reference LD structure estimated in 1000 Genomes using 489 European individuals, from the TWAS/FUSION website (http://gusevlab.org/projects/fusion/) Franke lab cell-type-specific gene expression dataset were obtained from https://data.broadinstitute.org/mpg/depict/depict_download/tissue_expression. In addition, we applied GSR to test for gene set enrichment in three well-powered types of cancer: breast invasive carcinoma (BRCA, 982 cases and 199 controls), thyroid carcinoma (THCA, 441 cases and 371 controls) and prostate adenocarcinoma (PRAD, 426 cases and 154 controls), using gene expression datasets from The Cancer Genome Atlas (TCGA). Uniformly processed (normalized and batch-effect corrected) gene expression datasets from TCGA and GTEx were obtained from https://figshare.com/articles/Data_record_3/5330593. Gene expression and phenotype were standardized before supplying to the GSR software. Standard GSEA was also performed for comparison. Gene sets were downloaded from the MSigDb website http://software.broadinstitute.org/gsea/msigdb/index.jsp. Here we combined BIOCARTA, KEGG and REACTOME to create a total of 1,050 gene sets. We also downloaded the 4,436 GO biological process terms as additional gene sets as well as the 189 gene sets pertaining to oncogenic signatures for the TCGA data analysis.

Funding Statement

The research is supported by Canada First Research Excellence Fund (CFREF) Healthy Brains, Healthy Life (HBHL) New Investigator fund (249591) at McGill University and Mon- treal Neurologic Institute (MNI) and NSERC Discovery Grant (RGPIN-2019-0621). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No author received a salary from any of the funders.

References

1. Burton PR, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. 10.1038/nature05911 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. AJHG. 2017;101(1):5–22. 10.1016/j.ajhg.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences. 2009;106(23):9362–9367. 10.1073/pnas.0903103106 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk of complex disease. Current Opinion in Genetics & Development. 2008;18(3):257–263. 10.1016/j.gde.2008.07.006 [DOI] [PubMed] [Google Scholar]
5. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Research. 2017;45(D1):D896–D901. 10.1093/nar/gkw1133 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nature Publishing Group. 2016;18(2):117–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh PR, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics. 2015;47(11):1228–1235. 10.1038/ng.3404 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Li Y, Kellis M. Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases. Nucleic Acids Research. 2016;. 10.1093/nar/gkw627 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Lamparter D, Marbach D, Rueedi R, Kutalik Z, Bergmann S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Computational Biology. 2016;12(1):e1004714–20. 10.1371/journal.pcbi.1004714 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. GTEx Consortium, Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, et al. A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics. 2015;47(9):1091–1098. 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nature Genetics. 2016;48(3):245–252. 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, et al. Opportunities and challenges for transcriptome- wide association studies. Nature Genetics. 2019;51(4):1–10. 10.1038/s41588-019-0385-z [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Battle A, Brown CD, Engelhardt BE, Montgomery SB. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204–213. 10.1038/nature24277 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Mancuso N, Freund MK, Johnson R, Shi H, Kichaev G, Gusev A, et al. Probabilistic fine-mapping of transcriptome-wide association studies. Nature Genetics. 2019;51(4):1–12. 10.1038/s41588-019-0367-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Finucane HK, Reshef YA, Anttila V, Slowikowski K, Gusev A, Byrnes A, et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics. 2018;50(4):1–14. 10.1038/s41588-018-0081-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Sanchez-Vega F, et al. Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell. 2018;173(2):321–337.e10. 10.1016/j.cell.2018.03.035 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. 10.1073/pnas.0506580102 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations Bioinformatics (Oxford, England: ). 2015;. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Wang Q, Armenia J, Zhang C, Penson AV, Reznik E, Zhang L, et al. Data Descriptor: Unifying cancer and normal RNA sequencing data from different sources. Scientific Data. 2018;5:1–8. 10.1038/sdata.2018.61 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0237657.r001

Decision Letter 0

F Alex Feltus

29 Apr 2020

PONE-D-20-07689

Partitioning gene-based variance of complex traits by gene score regression

PLOS ONE

Dear Dr. Li,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The expert reviewers provide valuable advice. Please carefully address each and every concern. It will substantially improve your report.

We would appreciate receiving your revised manuscript by Jun 13 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

F. Alex Feltus, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

'The research is supported by Canada First Research Excellence Fund (CFREF) Healthy

Brains, Healthy Life (HBHL) New Investigator fund (249591) at McGill University and Montreal

Neurologic Institute (MNI) and NSERC Discovery Grant (RGPIN-2019-0621).'

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

'The funders had no role in study design, data collection and analysis, decision to

publish, or preparation of the manuscript.'

4. Please include a copy of Table 1 which you refer to in your text on page 14.

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript describes an issue when gene set enrichments are performed using genes identified from transcriptome wide association study (TWAS). In TWAS while using expression quantitative trait loci (eQTL) to genetically predict gene expression in a GWAS cohort and performing associations between gene expression and phenotype, genes that are not relevant to the phenotype but are regulated by SNPs in high LD with the causal SNP can obtain high test statistics. This can lead to false discoveries in gene set enrichment analyses. This is a reasonable issue and a valid aim for the study. To address this issue, the authors’ strategy is to regress out the sum of gene-gene correlation from the genes’ marginal statistic and estimate the amount of phenotypic variance explained by the predicted expression of the genes. I found the main text to be sparsely written and confusing in various sections and many aspects of this study are unclear to me. Some analysis approaches are also puzzling to me. Either there are issues with the methodology and/or the procedures could be described in a considerably better way to engage a wide readership. I have the following comments:

1. Section 4.1 line “We calculated TWAS marginal statistic as the product of GWAS summary statistic and eQTL weights derived from the GTEx whole blood samples” is unclear in what was the summary statistic used - effect size? P value? Why was TWAS statistic defined this way when the authors could have used some existing TWAS studies and calculate correlation??

2. Why did the genes have to be binned and the correlation calculated on the average scores? The Fig 2 titles “Correlation between marginal statistic and gene scores 27 traits” are then misleading because it’s actually averaging within bins. To make this analysis more robust, permutations should be performed to assess the significance of correlation. Also, how come the x axis that shows the bins, goes only from 12 to 20 when it should start from 1? The straightforward way would be to calculate the correlation between the gene scores with the TWAS effect size. The authors should comment on why their specific approach was taken.

3. Fig 3: The methods compared all have intrinsically different algorithms, assumptions, statistical tests, number of samplings etc. Are the p values from all these really comparable? Fig 3 legend says “the enrichment score for causal pathways and non-causal pathways” which suggests that a metric such as effect size/fold change would be presented. A fair comparison would include some sort of precision/recall metrics. The authors also ran only 10 simulations which seems quite low, and might explain why some of the interquartile ranges in fig 3 are so large.

4. The method FOCUS seems to perform better in the simulations but the reasoning against using that is that it took 30 mins vs GSR took 3 mins. This is an insufficient argument in favor of GSR, users would definitely prefer accuracy over little extra computational resources.

5. Fig 5 labels are not legible and quite distracting. It is unclear what this figure is really trying to highlight, the relevant pathways come up from the compared methods as well. Only the scale of p values change. Are there relevant pathways that other methods miss but are identified by GSR?

6. Fig 6: This figure is also very briefly explained in the text. What is the x axis and what do multiple points for the same color represent? The other methods are not compared at this point?

7. The figure legends in the manuscript in general are very short and non-informative.

Other comments:

Fig 1C - The number labels in this panel are confusing as this is a hypothetical example. Maybe just lable gene 1, gene 2 etc.

The simulation analysis in Fig 3 is barely explained in the main text and just references the methods. This analysis could be set up in a more informative way in the main text to benefit the readers.

Discussion could be elaborated a little.

Reviewer #2: In this manuscript, the authors proposed a method, Gene Score Regression (GSR), to estimate the phenotypic variance explained by the gene expression and can be used to test for gene set or pathway enrichments based on GWAS summary statistics or the observed gene expression. They performed simulation experiments and also applied GSR to real data. The results supported that GSR is powerful and robust. My major concern is power and false positive rate. In simulation study, the authors only performed 10 times for each setting, if computation time is not a problem, it would be good to perform at least 1000 times to estimate power and false positive rate for some settings, particularly low heritability. Minor concern is the format. It seems that the authors use another format in the beginning but didn’t completely match PLOS ONE’s format. It really confused me when I reviewed this manuscript, so I think the authors should reorganize it. Figure quality and errors should also be careful. I listed my questions as following by section: (please see attachment)

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Partitioning gene-based variance of complex traits by gene score regression.docx

Click here for additional data file.^{(26.6KB, docx)}

PLoS One. 2020 Aug 20;15(8):e0237657. doi: 10.1371/journal.pone.0237657.r002

Author response to Decision Letter 0

18 Jun 2020

Reviewer #1

Thank you for your comments. We have made extensive modifications and clarifications throughout the manuscript. Please also see our response below.

We used the effect sizes from both GWAS summary statistics and GTEx eQTL summary statistics. We sought to have a unified way to integrate both GWAS results and eQTL information and we found the product of SNP-to-trait effect sizes in GWAS and SNP-to-gene expression effect sizes could be a proxy for the marginal effect size of the gene on the trait. We have now clarified this in Eq.12-15 and explained in more detail in the Methods in lines 216.

The genes were binned in Figure 2 to reduce noise because gene scores, calculated based on eQTL summary statistics may contain inflated noise. This approach was also adopted by the LD-score regression study (Bulik-Sullivan et al. Nature Genetics). However, for comparison, we have added a Supplementary Figure S1 which does not bin gene scores. From there, we observed that the slopes fitted in two cases were very similar, despite the Pearson correlation dropped if the genes were not binned, which indicates the noise inflation. We have provided this information and corrected the figure legend and annotate x axis as average gene score of each bin instead of the bin indicator. In addition, we have also performed 1000 permutations to derive a confidence interval (line 221) for the Pearson correlation estimate, by randomly sampling genes and recreating bins. This CI overlapped with and was centered around the original correlation estimate, thus verifies the robustness of the approach.

We completely agree that the p-values obtained from these different algorithms were not directly comparable, as you pointed out. However, we posit that the p-values themselves are very important in practice, because when people perform gene set / pathway enrichment analysis, the ultimate goal, usually, would be to identify some enriched signals. Under the circumstances, it is the p-values that the researchers would need to rely on to pick up these targets. Thus, if an algorithm is able to yield smaller p-values for the true underlying pathways regardless the difference in the underlying null distribution, we consider it would be more useful. We added this consideration to the Discussion in lines 342-348. Nevertheless, we have also provided precision-recall curves (PRCs) and area under PRC measures for GSR and PASCAL, which is most relevant method to ours, based on 100 simulations in Figure 3a, and demonstrated the superiority of our proposed method.

We also opted to keep the original figure where we showed 10 simulations for all the other methods (LDSC and FOCUS with different credible gene sets), as (1) it is computationally expensive to apply all methods and (2) based on only 10 simulations we were already able to exemplify the performance of them, with quite narrow interquartile ranges for FOCUS and LDSC.

In our simulations, it is not surprising that FOCUS can accurately fine-map causal genes as the simulation designs followed similar assumptions adopted by FOCUS (Mancuso et al., Nature Genetics). If these assumptions do not hold (which is unknown in real settings), it remains debatable whether FOCUS could still accurately capture all the true signals. It is noteworthy that the computational cost of FOCUS is exponential to the number of causal genes considered within each locus whereas GSR is not affected by the number of causal genes. Also, because GSR operates at genome-wide level, no threshold is needed to decide which GWAS/TWAS loci or which genes to be included whereas FOCUS needs user-defined threshold for choosing those GWAS/TWAS loci and for constructing the credible gene set for the subsequent hypergeometric enrichment test. Taken together, we still posit that GSR still add value is a valuable tool to the relevant TWAS studies given its flexibility to use different sources and increased computational efficiency. We have added these to the Discussion in lines 309-312.

We agree. We have now removed this figure and re-iterated this part of results in lines 272-274.

6. Fig 6: This figure is also very briefly explained in the text. What is the x axis and what do multiple points for the same color represent? The other methods are not compared at this point?

We have now expanded our explanation in the legend of (currently) Figure 5. Gene sets were indicated by dots and were aligned in the same order on the x-axis and multiple points for the same color represent 9 tissue group. Because this section was mainly for demonstration of the biological interpretation one could get from running GSR, we did not opt to compare to the other methods, which have already been widely used.

7. The figure legends in the manuscript in general are very short and non-informative.

Thank you. We have revised all the legends and hopefully they are now more informative.

Other comments:

Fig 1C - The number labels in this panel are confusing as this is a hypothetical example. Maybe just lable gene 1, gene 2 etc.

We have made adjustment.

The simulation analysis in Fig 3 is barely explained in the main text and just references the methods. This analysis could be set up in a more informative way in the main text to benefit the readers.

We have added more details to the Methods in section “applying existing methods”.

Discussion could be elaborated a little.

We have added more discussion upon the utility and limitations of our method.

Reviewer #2

In this manuscript, the authors proposed a method, Gene Score Regression (GSR), to estimate the phenotypic variance explained by the gene expression and can be used to test for gene set or pathway enrichments based on GWAS summary statistics or the observed gene expression. They performed simulation experiments and also applied GSR to real data. The results supported that GSR is powerful and robust. My major concern is power and false positive rate. In simulation study, the authors only performed 10 times for each setting, if computation time is not a problem, it would be good to perform at least 1000 times to estimate power and false positive rate for some settings, particularly low heritability.

Thank you for your comments. While we agree that the small number of replications may introduce some uncertainty in ascertaining the power of different methods, we have to admit that the excessively high computational cost did circumscribe our efforts to perform more experiments, taking into account the more time-consuming process in generating complete SNP-gene-phenotype datasets. To this end, we have run GSR and PASCAL, our major competitor, in 100 simulations respectively and updated our results in Figure 3. However, since in previous analyses we found FOCUS consistently gave accurate results (but with exceedingly long time) while LDSC was not specifically built for this type of task, we refrained from applying these two algorithms and added discussions in lines 324-329.

Minor concern is the format. It seems that the authors use another format in the beginning but didn’t completely match PLOS ONE’s format. It really confused me when I reviewed this manuscript, so I think the authors should reorganize it. Figure quality and errors should also be careful.

We have reformatted the manuscript and we hope it is more clear now.

I listed my questions as following by section:

1 Introduction

“In TWAS, we can regress on the expression changes using the genotype information from the reference cohort…” In this description, the dependent variable is the expression changes and the independent variable is the genotype information. According to the cited reference 10, Figure 2 and Equation 1, the dependent variable is the expression, not the expression changes. Please clarify.

We have corrected this expression in line 15-16. Indeed, the dependent variable is the gene expression.

In Figure 1 (a), please specific which SNP is 1, 2 and 3, respectively? In (b), the blue SNPs are causal for a non-causal gene. Are the two blue SNPs the most significant SNPs among two non-causal genes, respectively? In (c), what does the number represent?

We have revised Figure 1a to specify the SNPs; In Figure 1b the two blue SNPs are not necessarily the most significant ones (in detecting SNP-phenotype associations), as this is merely a hypothetical example; In reality, the causal SNPs for non-causal genes may have larger or smaller p-values, depending on the exact magnitude of linkage to the true causal SNPs for the phenotype; We have removed the misleading numbers in Figure 1c.

2. Related Methods

“…it does not account for the gene-gene correlation, which is distinct from TWAS-induced correlation but is rather due to the sharing of transcriptional regulatory network among genes.” Please explain again that what leads to TWAS-induced correlation? (LD or other factors)

TWAS-induced correlation mostly comes from LD. We have re-written this section into the Introduction and specified in lines 24-26.

3. Methods

Phenotypic variance explained by gene expression

For Eq (2), does it still hold for binary outcome (y), i.e., logistic regression?

We have clarified in the Methods in lines 60-61. This approach can be generalized to binary traits on a liability scale, as has been done in the LD-score regression study (Bulik-Sullivan et al. Nature Genetics).

Please verify Eq (4) and (5) for the term of A_g^gwas y and A_g y. For OLS solution, they will 〖(A_g^gwas)〗^T y.and A_g^T y, respectively.

Thank you very much. We have corrected them.

The gene expression of GWAS (A_g^gwas) were estimated based on a reference panel. This estimation is reliable for “controls” in GWAS that represent generation population as those from the reference panel. Is it still reliable for “cases” in GWAS using estimated values (W ^_g) from the reference panel.

We have added this point in the discussion part in line 353-355. Our assumption is that genes instead of genotype directly affect the disease status. In this sense, we think as long as the genetic structure of the two populations match and the bias introduced by population stratification is well controlled, which is normally the case for GWAS study, it is valid to use the same set of weights.

Please define the notions for i and j.

We have defined them in line 63.

The Eq (6) was not used in the following text. What is the purpose to show β ^ in the Eq(6)?

We intended to related it to the GWAS SNP-to-trait effect size. Now we have clarified in lines 83-84.

“From (8) to (9), we assume that all of the random variables are independent. The assumption holds if gene causal effects are independent and are also independent from the gene-gene correlation.” If assumption doesn’t hold, what kind of bias will be introduced (e.g., underestimate or overestimate)?

We have added discussion on the bias in lines 97-100 and lines 356-358.

Please explain how to obtain χ_g^2. Does it the test statistic from the regression of phenotype y on gene expression (or predicted gene expression)?

Now we have clarified in Eq. 12-15.

3.2 Partitioning variance component by gene sets

“The full derivation is similar to that for Eq (12) and detailed in Supplementary Methods.” Please indicate which section in Supplementary Methods.

We have now unified the derivation for both summary statistics and individual level data. All materials have been incorporated into the Methods.

“Therefore, we regress one gene set at a time along with a dummy gene set that include the union of all of the genes in the gene sets. The dummy gene set is used to account for unbalanced gene sets.” In a regression model for one gene set, does it include two independent variables, i.e., one is the gene score for a given gene set and the other is the gene score for a dummy gene set? In the dummy gene set, does it include genes belonging to the gene set of interest? Why a dummy gene set can be used to account for unbalanced gene sets? Please explain.

Yes. We have two independent variables in the example you provided. We have further clarified in lines 93-95. The including of a dummy gene set because all genes need to be contained in the our model.

According to the main equation, an intercept in a regression model would be close to 1. In what condition, the intercept will be away from 1? In the text, “We also include an intercept in the regression model to properly control non-gene-set biases.” How to control non-gene-set biases?

- If the intercept is away from 1, one can examine multiple plausible reasons, including such as various forms of interactions, measurement error, correlation between gene scores and effect sizes in Equation 9, etc. It should be noted that our method is built on an important assumption that the effect sizes of genes on the trait and the derived gene scores are independent. In practice, if this assumption is violated, our method might suffer from the bias introduced. For example, positive correlation between gene scores and true gene effect sizes that could lead to intercept greater than 1 and negative correlation between gene scores and true gene effect sizes could lead to intercept smaller than 1. Here, to be exact, we think there would be no easy solution to control for these biases, but an intercept might alleviate such effects if they are additive to the original offset. We have revised this argument in lines 97-100 and added discussion to the Discussion in lines 356-358.

3.3 Gene score regression on total gene expression

In page 5, α ^_g=1/N_gwas A ^_g^T y, but here α ^_g=A_g^T y. Is it correct without a term of 1/N_real ?

Yes. We have clarified it now.

“If one gene is a causal gene and the other is not, we will see inflated summary statistic for the non-causal gene, thereby confounding the detection for causal pathways.”. Does it indicate that the non-causal gene will be detected and the false positive rate then increases in this case?

- We have re-phrased this entire section which might be misleading.

3.4 Simulation

Simulation step 1: simulate gene expression:

In 1000 Genomes Project, there are 503 individuals of European ancestry. What are the exclusion criteria to remove individuals? And, how many independent blocks were generated?

There are only 489 individuals of European ancestry from the 1000 Genomes Project that were documented in the TWAS/FUSION project (Gusev, et al., Nature Genetics), which was our data source. We have specified this in lines 190-191. We sampled 100 LD blocks from a total of 1,703 LD blocks determined by LDetect to reduce computational burden. This is now clarified in lines 112-113.

Genotype was standardized in the reference panel, i.e. 489 individuals, or after simulation in 500 individuals. Please clarify.

We standardized the genotype after simulation. This is now clarified in lines 121.

Was a bootstrap technique used to simulate 500 individuals from 489 Europeans in 1000 Genome data?

We have rephrased the simulation process in lines 117-120 such that this is more clear. We sampled real LD blocks from these 489 individuals and concatenated them. Therefore, the simulated genotypes would consist of LD blocks from different individuals.

We randomly sampled k in-cis causal SNPs per gene within ±500 kb around the gene, where k = 1 (default).” If k >1, are the randomly sampled k causal SNPs per gene independent (LD r2<0.2)?

- These k SNPs did not have to be independent; no LD threshold was imposed because we posit this would better preserve the LD structure.

3.5 Data sets and 3.6 Running existing methods

Different methods require different data type (such as summary statistics or individual genotype/expression data for SNP or gene level) to perform analysis. Please indicate what data type that these five methods in Table 1 require and what datasets that they used for analysis.

We have added this to Table 1.

Please provide the sample size information of dataset that were used in this manuscript, including TWAS, GWAS, TCGA and GTEx.

- We have now provided the sample sizes accordingly in lines 186, 195-197.

4. Results

4.1 Gene scores correlate with TWAS statistics in polygenic complex traits

Please reorganize the 4.1 section. Some parts of description should be presented in the method section.

We have reorganized this section.

“We calculated TWAS marginal statistic as the product of GWAS summary statistic and eQTL weights derived from the GTEx whole blood samples.” This sentence indicates how to calculate TWAS marginal statistic for each SNP. However, GSR was proposed for gene level analysis, so please explain how to calculate TWAS marginal statistic for a given gene, e.g. summation of the products of GWAS summary statistic and eQTL weights within a given gene.

This refers to the current Equation 15, where W would be the eQTL weights and beta would be GWAS summary statistics. This has been clarified in lines 215-216.

“This implies a pervasive confounding impacts on the downstream analysis using the TWAS statistic (Figure 1e) when using existing approaches that mostly assume independence of genes.” Please provide some examples for downstream analysis. And, Figure 1e should be Figure 2e

Thank you. We have expanded this sentence in lines 222-224 and corrected the figure reference.

For Figure 2 (a), please explain why gene bin was used to show the correlation? Why not directly use gene score and chi squared? For (b), the correlation of gene score and TWAS marginal statistic is negative for T2D. How to interpret the negative correlation and what confounders could lead to the negative correlation? Figure panel names doesn’t match to the description in the figure legend.

- The genes were binned in Figure 2 to reduce noise because gene scores, calculated based on TWAS summary statistics may contain inflated noise. This approach was also adopted by the LD-score regression study (Bulik-Sullivan et al. Nature Genetics). However, for comparison, we have added a Supplementary Figure S1 which does not bin gene scores. From there, we observed that the slopes fitted in two cases were very similar, despite the Pearson correlation dropped if the genes were not binned, which indicates the noise inflation. In Figure 2b, the negative correlation for T2D indicates that this trait, possibly due to complicated genetic architecture and confounding gene-to-environment interaction and drug effects, is not suitable for using our approach. We therefore later illustrated the utility of our method using traits with higher correlation, such as schizophrenia. We have also corrected the panel labels.

4.2 GSR improves pathway enrichment power

In Table 1, what does “Exprs” refer?

That referred to observed gene expression. We have spelled it out and added information to the legend.

In Figure 3, please indicate what blue and red dotted lines represent. For y axis, the label doesn’t match the description in the figure legend. Here, is the “enrichment score” (in the description) shown as the p-value?

Thank you. We have added and corrected information to the figure legend.

“Notably,… enrichment test.” This whole paragraph should be moved to discussion section, since there is a section called “Discussion and Conclusion”.

- We have rephrased and re-organized accordingly.

4.3 Improved power in pathway enrichment when using the observed gene expression

“To evaluate the accuracy of this application, we simulated gene expression and phenotype for 1000 individuals, which were provided as input to GSR for pathway enrichment analysis.” Is it another simulation study other than section 4.2? If yes, please describe the procedures in the simulation section.

- Simulation of gene expression is now described in the Methods “Simulation step 2”.This would then not involve using reference TWAS summary statistics.

4.4 Gene set enrichments in complex traits

In this section, the description related to Materials and Methods should be reorganized.

We have re-organized this section.

In Figure 5, it is very unclear to directly put pathway names onto the main figure area, e.g. a band on the bottom misleading the number of gene set to test. On the right tail of Figure 5, i.e., significant for FOCUS, not for GSR, it shows that some gene set/pathway enrichments were detected by FOCUS, but not by GSR. Does it imply that GSR is not powerful in some cases? If yes, please try to evaluate this limitation.

We have removed this Figure which could be misleading and re-organized this section.

For cell-type-specific enrichment analyses, was W ^_g estimated from gene expressions of specific cell type?

- No, the cell-type-specific enrichment analyses only utilized cell-type-specific gene sets identified in GTEx and Franke lab datasets; the weights were not specifically estimated for each type of cell, but was based on TWAS using GTEx whole blood samples. This is specified in lines 216-217. Thus, they do not directly represent cell-type-specific gene expression. We have also discussed this limitation in the Discussion in lines 350-355.

4.5 Application on observed gene expression

Please specify sample size for each cancer, including case and control, in the observed gene expression analyses.

We have now specified in the Methods in lines 195-197.

Supplementary information

There are duplications in Supplementary information and Methods. Please reorganize.

Thank you. We have removed the duplicated sections.

Attachment

Submitted filename: plos_one_response_letter_rev.docx

Click here for additional data file.^{(37.8KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0237657.r003

Decision Letter 1

F Alex Feltus

22 Jul 2020

PONE-D-20-07689R1

Partitioning gene-based variance of complex traits by gene score regression

PLOS ONE

Dear Dr. Li,

Please address the minor comments from Reviewer #2 and take the opportunity to deep read once more before acceptance.

Please submit your revised manuscript by Sep 05 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

F. Alex Feltus, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The manuscript text edits and the clarifications provided by the authors have addressed my comments.

Reviewer #2: The authors clarified my questions and reorganized well. I just have four minor questions as follows. Please see attachment.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Partitioning gene-based variance of complex traits by gene score regression 2nd.docx

Click here for additional data file.^{(13.2KB, docx)}

PLoS One. 2020 Aug 20;15(8):e0237657. doi: 10.1371/journal.pone.0237657.r004

Author response to Decision Letter 1

28 Jul 2020

Reviewer #2

The authors clarified my questions and reorganized well. I just have four minor questions as follows.

- Thank you for your feedback.

Methods

Partitioning gene-based variance of complex traits

1. “…,we will be able to perform linear regression and derive regression coefficient that is an estimate for each τc, respectively.” Please indicate what dependent and independent variables are for the linear regression.

- We have indicated now in line 79. We regressed Chi-squares (dependent variable) on gene scores (independent variables).

Simulation design

2. In simulation step 1.3, for 500 individuals, did you sample each block for each individual with replacement? For example, for individual i (I = 1, …, 500) and block j (j = 1, …, 100), you randomly sampled one block j from 489 blocks of j. And, you concatenated sampled block 1, …, 100 for individual i. Please provide a clear description.

- For LD block j (j in {1,...,100}) of an individual i, we randomly sampled from the 489 available samples for block j, and concatenated these sampled LD blocks 1,...,100 for this individual. We repeated this procedure to simulate genotype X_ref for N_ref = 500 individuals as a reference population

- We have further clarified this in lines 118-121.

Results

3. For Figure 2, I think it would be good to add the explanation of “how to interpret negative correlation for T2D in the figure description.

- The negative correlation for T2D indicates that this trait, possibly due to complicated genetic architecture and confounding gene-to-environment interaction and drug effects, is not suitable for using our approach.

- We have added the rationale in the figure legend.

4. In Table 1, “*Summary statistics” and “†For custom gene sets,…” are footnotes and should not be put in the table title.

- Thank you. We have moved those to the footnotes.

Attachment

Submitted filename: plos_one_response_letter_rev2.docx

Click here for additional data file.^{(21.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0237657.r005

Decision Letter 2

F Alex Feltus

31 Jul 2020

Partitioning gene-based variance of complex traits by gene score regression

PONE-D-20-07689R2

Dear Dr. Li,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

F. Alex Feltus, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0237657.r006

Acceptance letter

F Alex Feltus

6 Aug 2020

PONE-D-20-07689R2

Partitioning gene-based variance of complex traits by gene score regression

Dear Dr. Li:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. F. Alex Feltus

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(PDF)

Click here for additional data file.^{(483.7KB, pdf)}

Attachment

Submitted filename: Partitioning gene-based variance of complex traits by gene score regression.docx

Click here for additional data file.^{(26.6KB, docx)}

Attachment

Submitted filename: plos_one_response_letter_rev.docx

Click here for additional data file.^{(37.8KB, docx)}

Attachment

Submitted filename: Partitioning gene-based variance of complex traits by gene score regression 2nd.docx

Click here for additional data file.^{(13.2KB, docx)}

Attachment

Submitted filename: plos_one_response_letter_rev2.docx

Click here for additional data file.^{(21.6KB, docx)}

Data Availability Statement

[pone.0237657.ref001] 1. Burton PR, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. 10.1038/nature05911 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref002] 2. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. AJHG. 2017;101(1):5–22. 10.1016/j.ajhg.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref003] 3. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences. 2009;106(23):9362–9367. 10.1073/pnas.0903103106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref004] 4. Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk of complex disease. Current Opinion in Genetics & Development. 2008;18(3):257–263. 10.1016/j.gde.2008.07.006 [DOI] [PubMed] [Google Scholar]

[pone.0237657.ref005] 5. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Research. 2017;45(D1):D896–D901. 10.1093/nar/gkw1133 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref006] 6. Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nature Publishing Group. 2016;18(2):117–127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref007] 7. Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh PR, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics. 2015;47(11):1228–1235. 10.1038/ng.3404 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref008] 8. Li Y, Kellis M. Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases. Nucleic Acids Research. 2016;. 10.1093/nar/gkw627 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref009] 9. Lamparter D, Marbach D, Rueedi R, Kutalik Z, Bergmann S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Computational Biology. 2016;12(1):e1004714–20. 10.1371/journal.pcbi.1004714 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref010] 10. GTEx Consortium, Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, et al. A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics. 2015;47(9):1091–1098. 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref011] 11. Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nature Genetics. 2016;48(3):245–252. 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref012] 12. Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, et al. Opportunities and challenges for transcriptome- wide association studies. Nature Genetics. 2019;51(4):1–10. 10.1038/s41588-019-0385-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref013] 13. Battle A, Brown CD, Engelhardt BE, Montgomery SB. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204–213. 10.1038/nature24277 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref014] 14. Mancuso N, Freund MK, Johnson R, Shi H, Kichaev G, Gusev A, et al. Probabilistic fine-mapping of transcriptome-wide association studies. Nature Genetics. 2019;51(4):1–12. 10.1038/s41588-019-0367-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref015] 15. Finucane HK, Reshef YA, Anttila V, Slowikowski K, Gusev A, Byrnes A, et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics. 2018;50(4):1–14. 10.1038/s41588-018-0081-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref016] 16. Sanchez-Vega F, et al. Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell. 2018;173(2):321–337.e10. 10.1016/j.cell.2018.03.035 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref017] 17. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. 10.1073/pnas.0506580102 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref018] 18. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref019] 19. Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations Bioinformatics (Oxford, England: ). 2015;. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0237657.ref020] 20. Wang Q, Armenia J, Zhang C, Penson AV, Reznik E, Zhang L, et al. Data Descriptor: Unifying cancer and normal RNA sequencing data from different sources. Scientific Data. 2018;5:1–8. 10.1038/sdata.2018.61 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Partitioning gene-based variance of complex traits by gene score regression

Wenmin Zhang

Si Yi Li

Tianyi Liu

Yue Li

Roles

Abstract

Introduction

Fig 1. Overview of confounding effects on pathway analysis.

Methods

Partitioning gene-based variance of complex traits

Simulation design

Applying existing methods

Real data application

Fig 2. Gene scores correlated with marginal TWAS summary statistics.

Results

Gene scores were correlated with TWAS statistics in polygenic complex traits

GSR improved pathway enrichment power

Table 1. Comparison of existing methods with GSR.

Fig 3. Evaluation of power and robustness of GSR in detecting causal pathways.

Improved power in pathway enrichment leveraging observed gene expression

Fig 4. Comparison of pathway enrichment determined by GSR using or not using observed gene expression information, and by GSEA.

Gene set enrichments in complex traits

Fig 5. Cell-type-specific enrichment of gene sets for representative complex traits.

Application on observed gene expression

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

F Alex Feltus

Roles

Author response to Decision Letter 0

Decision Letter 1

F Alex Feltus

Roles

Author response to Decision Letter 1

Decision Letter 2

F Alex Feltus

Roles

Acceptance letter

F Alex Feltus

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases