Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Nov 2;16(11):e1008315. doi: 10.1371/journal.pcbi.1008315

Leveraging functional annotation to identify genes associated with complex diseases

Wei Liu 1, Mo Li 2, Wenfeng Zhang 2, Geyu Zhou 1, Xing Wu 3, Jiawei Wang 1, Qiongshi Lu 4,5,6, Hongyu Zhao 1,2,7,*
Editor: Seyoung Kim8
PMCID: PMC7660930  PMID: 33137096

Abstract

To increase statistical power to identify genes associated with complex traits, a number of transcriptome-wide association study (TWAS) methods have been proposed using gene expression as a mediating trait linking genetic variations and diseases. These methods first predict expression levels based on inferred expression quantitative trait loci (eQTLs) and then identify expression-mediated genetic effects on diseases by associating phenotypes with predicted expression levels. The success of these methods critically depends on the identification of eQTLs, which may not be functional in the corresponding tissue, due to linkage disequilibrium (LD) and the correlation of gene expression between tissues. Here, we introduce a new method called T-GEN (Transcriptome-mediated identification of disease-associated Genes with Epigenetic aNnotation) to identify disease-associated genes leveraging epigenetic information. Through prioritizing SNPs with tissue-specific epigenetic annotation, T-GEN can better identify SNPs that are both statistically predictive and biologically functional. We found that a significantly higher percentage (an increase of 18.7% to 47.2%) of eQTLs identified by T-GEN are inferred to be functional by ChromHMM and more are deleterious based on their Combined Annotation Dependent Depletion (CADD) scores. Applying T-GEN to 207 complex traits, we were able to identify more trait-associated genes (ranging from 7.7% to 102%) than those from existing methods. Among the identified genes associated with these traits, T-GEN can better identify genes with high (>0.99) pLI scores compared to other methods. When T-GEN was applied to late-onset Alzheimer’s disease, we identified 96 genes located at 15 loci, including two novel loci not implicated in previous GWAS. We further replicated 50 genes in an independent GWAS, including one of the two novel loci.

Author summary

TWAS-like methods have been widely applied to understand disease etiology using eQTL data and GWAS results. However, it is still challenging to discriminate the true disease-associated genes from those in strong LD with true genes, which is largely due to the misidentification of eQTLs. Here we introduce a novel statistical method named T-GEN to identify disease-associated genes considering epigenetic information. Compared to current TWAS methods, T-GEN not only identified eQTLs with higher CADD scores and function potentials in gene-expression imputation models, but also identified more disease-associated genes across 207 traits and more genes with high (>0.99) pLI scores. Applying T-GEN in late-onset Alzheimer’s disease identified 96 genes at 15 loci with two novel loci. Among 96 identified genes, 50 genes were further replicated in an independent GWAS.


This is a PLOS Computational Biology Methods paper.

Introduction

Genome-Wide Association Studies (GWAS) have been very successful in identifying single nucleotide polymorphisms (SNPs) associated with human diseases [1]. However, most identified SNPs are located in non-coding regions, making it challenging to understand the roles of these SNPs in disease etiology. Several approaches have been developed recently to link genes with identified SNPs and provide insights for downstream analysis [26]. PrediXcan [7] and similar methods [812] have been developed for utilizing transcriptomic data, such as those from GTEx [13], to interpret identified GWAS non-coding signals and to identify additional disease associated genes. These methods first impute (i.e. predict) gene expression levels from SNP genotypes and then identify disease-associated genes by associating phenotypes with predicted expression levels. At the SNP level, SNPs used in the gene expression imputation models are selected through statistical correlation between these SNPs’ genotypes and gene expression levels. Since SNPs in the same LD block are correlated, it is hard to differentiate regulatory SNPs from others statistically, which may lead to incorrect identifications of genes with regulatory SNPs that are in strong LD with true trait/disease genes.

To more accurately identify regulatory eQTLs that play functional roles, we assume that SNPs with active epigenetic annotations are more likely to regulate tissue-specific gene expression [1418]. As for available tissue-specific epigenetic data, we consider epigenetic marks that are known hallmarks for DNA regions with important functions, such as H3K4me1 signals that are often associated with enhancers [19]. We note that these epigenetic marks have been used to infer regulatory regions and prioritize eQTLs in some published studies [2023]. Reported GWAS hits are enriched in regions with active epigenetic signals, the Encyclopedia of DNA Elements (ENCODE) project found that 34% of disease-associated SNPs overlap DNA-hypersensitive sites (encompassing 3.9% of the whole genome sequence) [24], and these epigenetic signals can help fine-map true GWAS hits with functional impacts [25]. Based on these previous findings, we use epigenetic signals to select SNPs among all candidate cis-SNPs when modeling the relationship between gene expression levels and SNP genotypes. We have developed a new method, called T-GEN (Transcriptome-mediated identification of disease-associated Genes with Epigenetic aNnotation), that leverages tissue-specific epigenetic information to identify disease-associated genes. By prioritizing the SNPs likely having regulatory roles (through the use of epigenetic marks) when building gene expression imputation models, T-GEN identified SNPs with higher deleterious effects and higher function potential, Further application of T-GEN identified more trait-associated genes in 207 traits, compared to other gene expression imputation models. More specifically, in late-onset Alzheimer’s disease (AD), T-GEN identified the largest number of genes with novel loci, indicating the importance of cholesterol transportation, neuron activity and mitochondrial dysfunction in late-onset AD.

Results

Method overview

Similar to previous methods [8,26,27], we use two linear models to study gene-level genetic effects on traits mediated by gene expression regulation. Firstly, individual-level genotype and gene expression data are used to build gene expression imputation models for each tissue. One novel feature of our method is the integration of epigenetic data from the Roadmap Epigenomics Project [28] to prioritize regulatory SNPs in gene expression imputation. As a result, SNPs located in regions with active epigenetic marks are more likely to be selected, consistent with the enrichment of epigenetic marks (e.g. H3K4me3 and DNase-I hypersensitivity) in regulatory DNA regions for gene expression [29]. After obtaining tissue-specific gene expression imputation models, we combine them with GWAS summary statistics to identify gene-level associations with disease phenotypes. A schematic workflow of T-GEN is shown in Fig 1.

Fig 1. The general scheme of our method.

Fig 1

Gene expression imputation models were built based on gene expression matrix Y of a specific gene in a tissue, the genotype matrix X and epigenetic annotation matrix A of cis-SNPs of the gene. The annotation matrix was used to select SNPs having regulatory effects on gene expression since we assumed that only part of cis-SNPs have effects on gene expression of the nearby gene. After getting SNP coefficient vectors β in each tissue for the gene, we combined β’s with GWAS summary stats and then get the gene-level association statistics for each disease in each tissue.

By utilizing a spike-and-slab prior, SNPs regulating gene expression levels were selected considering their epigenetic signals through the following model:

Y=Xβ+ϵ,
βkπkN(0,σβ2σ2I)+(1πk)δ0,
logit(πk)=Akω,

where Y is the vector of gene expression values in a given tissue for a gene, X is the genotype matrix for the candidate cis-SNPs for the gene, β is the tissue-specific effect vector of genotypes on expression level, and ϵ denotes the random noise. For a cis-SNP k, its effect βk follows a mixture prior of a normal distribution and a point mass around 0. The probability πk of the SNP being an eQTL of its nearby gene is linked to its epigenetic annotation matrix Ak via a logit model, with ω being the annotation coefficient vector. We use a variational Bayesian method [30] to estimate the coefficient vector β. More details are shown in the Methods section.

For comparison, we also consider four other gene expression imputation methods, including elastic net (elnt) [7,31] used in both PrediXcan and FUSION, linear models with spike-and-slab priors solved by variational Bayes (vb) [32,33], and both elastic net and linear models with spike-and-slab priors applied only to SNPs having active epigenetic signals (elnt.annot and vb.annot). Also, the annotation configuration and incompleteness may also affect the results, which we discussed in S1 Text.

In general, we describe the relationship between the imputed gene expression level and selected SNPs in the form of Y^=Xβ^, where β^ denotes the SNP effect estimates in the imputation models. Different imputation methods lead to different sets of SNPs and effect size estimates. We then use a univariate regression model to test the association between traits and imputed gene expression levels, which was also used for gene-trait testing in many TWAS methods [8,26,27]:

T=μ+Y^κY+τ.

The z-score of the gene coefficient κY is denoted as z=κY^se(κY^)=β^TΛZ^X, where Λ is a diagonal matrix with diagonal elements being the ratio of the standard deviation of SNP genotypes over the standard deviation of the imputed gene expression levels, while Z^X is the z score vector of SNP effects on traits in GWAS. We test the association between each tissue-gene pair and trait. Significant disease-associated tissue-gene pairs are then identified after adjusting for multiple testing. The implemented method and pre-trained models are available on https://github.com/vivid-/T-GEN.

T-GEN prioritizes biologically functional SNPs in gene expression imputation

To study whether our method can better prioritize functional SNPs in gene expression imputation models, we evaluated the functional states of the identified eQTLs using the ChromHMM-annotated SNP status [34]. T-GEN identified higher percentage eQTLs (47%) having one of the 11 active states (out of 15 states) (S1 Table), with an 87% increase over elnt models (25%) and 77% increase over vb (27%) models (Fig 2A). Overall, these results demonstrate that our method better identified SNPs with functional potential to regulate gene expression while not increasing the total number of SNPs selected (S1 Fig).

Fig 2. More SNPs with functional potential were idenfied and more imputation models were built by T-GEN.

Fig 2

A) compares the percentages of SNPs in gene expression imputation models having active ChromHMM15 annotated states across three different methods (elnt, vb and vb.logit) using gene expression and genotype data from GTEx in 26 tissues. The “elnt” model was built via elastic net, the “vb” model was built via a variational Bayesian method, and the “T-GEN” model was built using our method (variational bayesian method with a logit link). Across all 26 tissues, imputation models built by our method have higher percentage of SNPs with active ChromHMM annotated states (indicated by blue bars). X axis denotes the mean percentage of SNPs in imputation models having ChromHMM15 annotated states in each tissue for each models. The dotted lines are the mean values of R2 across 26 tissues for each method. B) shows the ratio of CADD score mean level increases in T-GEN compared to elnt and vb models in 26 tissues. C) shows the ratios of gene model numbers (FDR < 0.05) in each method over that in elastic net models. D) indicates the difference in the number of genes models between that from each method and that from elastic net. In C and D, different colors of each tissue indicate their sample sizes, from upper to lower: [401,501), [301, 401), [201, 301), [101, 201).

We note that T-GEN utilizes the epigenetic information in SNP selections, and the same information is also used in ChromHMM models. Therefore, we expect to select more SNPs annotated as functionally active in ChromHMM models. To further evaluate the functional potential of the SNPs selected by T-GEN, we considered the CADD scores of the identified SNPs across all five methods (S2 Table). T-GEN-identified eQTLs have higher CADD scores (3.36 on average, representing a 0.9% increase compared to elnt and vb, p<2e-16, Wilcoxon rank sum test) and a higher percentage (0.34% Fig 2B) of functionally deleterious SNPs (larger than 20, representing a 2.3% and 2.9% increase to elnt and vb respectively). The net CADD score improvement is not substantial, which may be partially explained by the purifying selections undergone by cis-eQTLs [35,36] and their consequent low deleteriousness. These results indicate the statistically significant higher functional potential of identified eQTLs by T-GEN than those identified by other models except direct SNP filtering in elnt.annot and vb.annot, which is expected.

More genes are effectively imputed by T-GEN

The number of genes that are assessed in transcriptome-wide association analysis is affected by the quality of expression imputation. For all five imputation methods, gene expression imputation models were filtered based on the significance of imputation models with an FDR cutoff of 0.05 (Methods). Therefore, significant trait-gene associations can only be detected for genes with significant imputation models.

With the same FDR cutoff, T-GEN had an increase in the range of 2.8% to 55.3% (3,807 in whole blood) for the number of gene expression imputation models in each tissue, compared with the elastic net methods in 25 of the 26 tissues (Fig 2C and 2D). There was a smaller increase or even slight decrease in some tissues with a smaller sample size like brain cortex (2.8%, 269, n = 136) and brain frontal cortex BA9 (-0.55%, -50, n = 118). It is worth to note that, by direct SNP filtering using epigenetic annotations, both “elnt.annot” and “vb.annot” had many fewer genes with high quality imputation models, implying that stringent SNP categorization might lead to loss in power of detecting genes with considerable expression components explained by cis-eQTL. Compared to elastic net methods (elnt and elnt.annot), variational Bayes (both vb and T-GEN) methods better imputed genes, which are partly attributed to its improved variable selection performance [37].

More genes and higher percentage of functionally conserved genes identified in 207 traits

To evaluate the performance in identifying trait-gene associations, which is the ultimate goal of gene expression imputation, we applied T-GEN to GWAS summary statistics of 207 traits (S3 Table) from LD Hub. After Bonferroni adjustment, significant trait-associated genes were identified in each tissue. T-GEN showed a 25% (9.6 genes, compared with elnt.annot) to 175% (30.4 genes, compared with vb.annot) increase in the average number of significant trait-associated genes across 207 traits (Fig 3). After aggregating identified genes into pre-defined cytogenetic bands, we observed a similar pattern in the numbers of identified trait-associated loci across 207 traits. T-GEN identified the largest number of associated loci (14.3 loci on average) across all five methods (S2 Fig), showing an 11% (1 locus, compared with elnt) to 86% increase (5 loci, compared with vb.annot).

Fig 3. More genes were identified as trait-associated by T-GEN across 207 traits from the LD Hub.

Fig 3

Applied to 207 traits from LD Hub, significant trait-associated genes were identified in 26 tissues (p-values threshold: 0.05 divided by the number of gene-tissue pairs). Each boxplot represents the distribution of the number of differences between that identified from our tissue-specific analysis and that identified from the four other methods.

To investigate the functional potential of trait-associated genes identified by each method, we further compared the enrichment pattern of associated genes having pLI scores larger than 0.99 for each method. Among all identified genes, some genes are identified as trait-associated in multiple traits. Grouping identified genes into three categories based on the number of associated traits, we found that T-GEN identified the largest percentage of trait-associated genes having higher pLI (>0.99) in gene groups with fewer than 5 associated traits (Fig 4A). This may indicate that T-GEN is more likely to identify conserved genes specific to 1–2 diseases compared to other methods (p = 0.009 to elnt, p = 0.002 to vb). Across all categories, T-GEN also showed the strongest enrichment signal (fold change: 1.15, binomial test p value: 0.038) compared to all the other four methods, which didn’t show significant enrichment pattern (S4 Table). Although pLI score is an indirect measure of gene importance in human traits and more associated with fitness, it does provide an important way to prioritize genes whose heterozygous mutations are phenotypically harmful [38,39].

Fig 4. Function constriant and tissue-specificity of identified trait-associated genes.

Fig 4

A) Higher pecentage of significant genes by T-GEN have pLI scores larger than 0.99 for genes identified in leass than 5 traits. Considering the number of traits that each identified gene is associated with, all significant trait-associated genes were groupped into three categories. The bar plot shows the percentage of genes identified by each method having larger pLI scores (>0.99) in each category. Error bars indicate the standard error calculated using bootstraping (120 traits each time, for 20 times). B) More genes were identified by T-GEN as trait-associated in tissues most enriched for genetics signals. In tissues with the highest heritability enrichment and also other tissues, the numbers of identified trait-associated genes were compared across all five methods. Each barplot shows the mean value of the numbers of identified trait-associated genes across 207 traits in the LD Hub.

To assess the ability of identifying associated genes in the tissue most relevant to traits, which is defined as the tissue with the highest heritability enrichment estimated by LDSC [40], we compared the numbers of identified genes in the most relevant tissues (S5 Table). Using LDSC and annotation from GenoSkyline-Plus [41], we identified the tissue with the highest heritability enrichment for each trait. We compared the number of significantly associated genes identified in heritability-enriched tissues across five methods (Fig 4B). In the most-enriched tissues, T-GEN identified the largest numbers of significantly associated genes. When comparing the ratios of the number of genes identified in the most-enriched tissue and those in the other tissues across 151 lipid-associated traits (S3 Fig), T-GEN showed a significant increase compared to the vb (Wilcoxon rank sum test, p = 0.048) and vb.annot models (p = 3.5e-3).These results suggest that T-GEN can better identify disease genes in trait-relevant tissues.

Overall, T-GEN showed improvement not only in the total number of trait-gene associations, but also in identifying genes with functional importance potential and tissue-specific associated genes in tissues most relevant to a trait.

T-GEN identifies novel genes for late-onset AD

To further investigate the performance of our method in identifying trait-associated genes in detail, we analyzed the biological functions of genes associated with AD (N = 74,046) that were only identified by our method. Considering the total number of tissue-gene pairs (258,039), 96 significantly associated genes at 15 loci (S6 Table, Fig 5) were identified by T-GEN, with five loci identified in the brain tissues (caudate basal ganglia, anterior cingulate cortex BA24, hippocampus, cortex, and frontal cortex BA9) and four loci identified in the whole blood. Thirteen out of the 15 loci have been implicated in AD GWAS[42]. In comparison (S4 Fig), 79 genes at 10 loci were identified by elastic net models, 81 genes at eight loci by elnt.annot models, 61 genes at three loci by vb.annot models, and 81 genes at 11 loci by vb method. Not only the number of associated genes by T-GEN is the largest, the heritability enrichment in signal-contributing eQTLs for associated genes by T-GEN is also the highest (S7 Table).

Fig 5. Gene-level manhattan plot of T-GEN results in IGAP data.

Fig 5

The plot shows the gene-level association with LOAD attained from T-GEN. Several significant genes are indicated in the figure.

Compared with AD-associated genes identified by the other four methods, three loci were only identified by our method. One locus is located on 14q32.12 including LGMN (p = 9.45e-8). LGMN is located about 200kb to previously identified GWAS significant SNP rs1049863 [42]. SLC24A4 and RIN3 are two potential genes contributing to the GWAS signal of this locus in previous GWAS studies, whose functions are not yet understood. LGMN encodes protein AEP that cleaves inhibitor 2 of PP2A and may trigger tau pathology in AD brain [43], which makes LGMN a potential signal gene at this locus. Another locus is 16p22.3, where COG4 was identified (p = 1.35e-7). Two of the identified eQTLs (S8 Table) of COG4 are potential GWAS hits (rs7196032, p = 1.1e-4 and rs7192890: p = 3.5e-3) (S5 Fig). COG4 encodes a protein involved in the structure and function of the Golgi apparatus. Recent research has shown that defects in the Golgi complex are associated with AD and Parkinson’s disease by affecting the functions of Rab-GTPase and SNAREs [44]. The third locus is 6p12.3, where CD2AP (p = 5.70e-8) and RP11-385F7.1 (p = 1.27e-7) were identified. This locus was previously identified as a susceptibility locus in AD [42] with CD2AP reported as the locus signal genes affecting amyloid precursor protein (APP) metabolism and production of amyloid-beta [45]. While the additional RP11-385F7.1 identified by our method is a long non-coding RNA (lncRNA) located near the promoter region of CD2AP. The level of RP11-385F7.1 lncRNA has a strong Pearson correlation with gene expression level of COQ4 located on 16q22.3 [46], which was also the gene only identified by our method and encodes a protein that may serve as an antioxidant strategy target for AD [47].

Although other loci identified by our method have been implicated in GWAS, including CLU locus (8p21.2), BIN1 locus (2q14.3), CR1 locus (1q32.2), MS4A6A locus (11q12.1), PTK2B locus (8p21.1, 8p21.2), and CELF1 locus (11p11.2), we identified additional, potentially functionally-impactful genes at these loci. At the PTK2B locus, we identified an additional gene ADRA1A (p = 1.85e-8), which is involved in neuroactive ligand-receptor interaction and calcium signaling and has been implicated as a potential gene in late-onset AD via gene-gene interaction analysis [48]. At the MS4A6A locus, we identified an additional gene named OSBP (p = 1.56e-7), which transports sterols to nucleus where the sterol would down-regulate genes for LDL receptor, which is important in AD etiology [49].

Two out of 15 loci identified by our method are novel compared with published GWAS loci [42,5052]. One locus is located on 16q22.3, where COG4 was identified (p = 1.35e-7), which is also the locus only identified by our method and is discussed above. The associated gene identified at the other novel locus, TMEM135 (p = 1.80e-8) is located 1MB (Fig 6) downstream of the identified GWAS locus (the PICALM locus), and two of the identified eQTLs of TMEM135 by our method were located at the GWAS loci (rs536841: p = 5.27e-14 and rs541458: p = 5.67e-12). On the other hand, TMEM135 is a target gene for liver X receptors (LXR), which is involved in removing excessive cholesterol from peripheral tissues [53]. Increased cholesterol levels in brain tissue are associated with accumulated AβPP [51] suggesting the potential role of LXR and TMEM135 in the etiology of AD.

Fig 6. Regional Manhattan plot around TMEM135.

Fig 6

The listed SNP (rs541458) is one of the identifed eQTLs by T-GEN in the imputation model of TMEM135. Among all eQTL of TMEM135 identified by T-GEN, this SNP also has the strongest GWAS signal in the published AD GWAS study.

To validate the T-GEN results, we tried to replicate our findings in an independent GWAS for AD using inferred phenotypes based on family history (GWAX, N = 114,564) [54]. Among the 96 genes identified by our method using the GWAS of clinical AD, 50 (52%) genes were successfully replicated in this independent GWAS study (p < 5.2e-4) including TMEM135 in the novel loci. For the other four methods, only 9.9% (8/81, vb) to 41% (33/81, elnt.annot) genes were replicated in the GWAX data. Pathway enrichment analysis via Enrichr [55] identified apoptosis-related network (p = 2e-3), statin pathway (p = 9e-3), and ApoE-related inflammation and atherosclerosis (p = 0.04). Statin pathway was again implicated (p = 0.067) in the GWAX data despite a lack of statistical significance. The high replication rate of the results in an external study further confirmed the power of T-GEN.

To further evaluate the impact of sample size on training gene expression imputation models, we built imputation models in AD heritability enriched tissues estimated by LDSC (liver and whole blood) using the GTEx v8 data and further identified AD-associated genes in these two tissues (S9 Table). Compared with using the GTEx v6 data, elastic net models by prediXcan (16 in v8 vs. 15 in v6) and T-GEN models (22 in v8 vs. 28 in v6) did not show substantial increase in the number of genes identified. However, with increased sample size in GTEx v8, we observed higher replication rates for the identified disease-associated genes using the GWAX data. For elastic net models, the replication rate increased from 17.6% to 66.7% while for T-GEN models, the rate increased from 20.7% to 75% in these two tissues.

We also compared the performance of T-GEN models trained in GTEx v8 with a recent method named mash [56]. Compared with the most updated models built on GTEx v8 for prediXcan (trained using elastic net) and mash, T-GEN identified the largest number of genes (22 vs. 18 by mash and 16 by prediXcan) and maintained a high replication rate in the GWAX data (75% vs. 76.2% in mash and 66.7% in prediXcan).

Discussion

In this paper, we have introduced a new method called T-GEN, which leverages epigenetic signals to improve gene expression imputation and identify trait-associated genes. Different from previous methods, T-GEN uses data from GTEx and Roadmap Epigenomics Project to prioritize SNPs with active epigenetic annotations for gene expression imputation. We found that T-GEN models were more likely to include SNPs with functional potential and more genes were effectively imputed compared to other methods. Applied to more than 200 traits, T-GEN identified more genes/loci associated with these traits. T-GEN is especially more likely to identify genes with potential function importance (high pLI scores).

When applied to AD GWAS, T-GEN identified the largest number of associated genes (96 genes in 15 loci) compared with four other methods. We found novel association signals at previously identified loci including LGMN in the SLC24A4/RIN3 locus, ADRA1A at the PTK2B locus and OSBP at the MS4A6A locus. Besides, two novel loci were identified on chromosomes 3 and 16. Genes identified by our method suggest putative roles of several biological processes in Alzheimer’s disease, including mitochondrial dysfunction (indicated by UQCR11, MTCH2 and TMEM135), cholesterol transportation (indicated by TMEM135 and OSBP), and neuron activity (indicated by ADRA1A). Fifty out of 96 identified associated genes were replicated in an independent GWAS dataset. Most importantly, the genes identified from our method for the Alzheimer’s disease showed higher replication rates than those identified from other methods.

Overall, T-GEN showed improved performance in prioritizing functional SNPs and identifying disease-associated genes. Limited by the lack of individual-level epigenetic data and small sample sizes, T-GEN may also be further improved in different aspects, such as the low accuracy of predicting gene expression levels, which may be further improved, for example by simultaneously using non-parametric Bayesian model [57] and considering epigenetic annotation. Nevertheless, the relationship between imputation accuracy and the power of TWAS-like methods, and how GWAS signals and eQTL signals contribute to final results are rather complex and worth further investigation (S1 Text). The eQTLs used for expression imputation in these methods also need further experimental validations. Utilizing epigenetic annotations in the reference panel from Roadmap also limits the potential of leveraging epigenetic information in gene expression imputation process, since the epigenetic signals may vary across individuals just like gene expression levels. Another limitation of our method is that we assumed SNPs with known active epigenetic signals are more likely to regulate gene expression while the regulatory effects of epigenetics are rather complex. For example, H3K9ac might be present in both actively transcribed regions and bivalent regions [58]. It has also been shown that Cytosine-phosphate diester-guanine (CpG) islands associated with gene expression may have intermediate instead of low DNA methylation levels [59]. Although trait heritability and GWAS risk SNPs are both enriched in genomic regions with active epigenetic signals in related cell types or tissues [6064], most of these studies are based on in-silico results. Using more data on epigenetic regulation collected from biological experiments instead of in-silico predictions may further improve the power of our approach. In addition, only cis-SNPs are used here for gene expression imputation, while a larger proportion of gene expression may be explained by trans-eQTL [6567]. Integrating trans-eQTL into gene expression imputations would help to identify more trait-associated genes and also co-regulatory peripheral genes and core genes [68]. Methods like T-GEN, using gene expression as a mediated trait to study genetic effects and to identify disease genes, can only provide evidence for associations between genes and traits/diseases. The causal relations between identified genes and traits should be further validated by functional analysis in future research. Also, results from TWAS studies are hard to validate in silico and a benchmarking dataset would help in the comparison of different TWAS-like methods. Considering multiple TWAS methods when identifying disease genes may aid in controlling false discoveries.

By using individual-level data with larger sample sizes would further improve power and replication rates of identifying disease-associated genes under our modeling framework even though the improvement in imputing gene expression is not significant (S1 Text). Apart from functional analysis, gene-level fine mapping or Mendelian randomization (MR) would also help in discriminating causal relations from associations between genes and traits.

Methods

Bayesian variable selection model

To model the genetic effects of SNPs on the expression level of a gene in a single tissue, we used the following bi-level variable selection model to select SNPs:

Y=Xβ+ϵ,
ϵN(0,σ2I),
βk|γk=1N(0,σβ2σ2I),
βk|γk=0~δ0,
γkBernoulli(πk),

where Y is the n×1 gene expression level from the GTEx [69] database (v6p) (n denotes the sample size in this tissue), X denotes the n×p centered genotype matrix for cis-SNPs for this gene, n is the sample size, and p is the number of SNPs. In analysis, Y denotes the gene expression in a specific tissue while X denotes the genotype matrix for SNPs located within 1MB from the upstream or downstream of the gene. We use the p×1 vector to represent the effects of those cis-SNPs on the gene expression level while ϵ is the n×1 random error vector, which is assumed to follow a normal distribution. In a typical data set with both genotype data and RNA-seq data for the same group of individuals, the sample size n is usually much smaller than the number of cis-SNPs for a gene (p), some constraints are needed to allow for accurate SNP selection and effect size estimation. Therefore, to select effective SNPs from thousands of candidate cis-SNPs, we assume that the effect size vector comes from a mixture of normal distribution and a point mass at zero. γk is the indicator variable denoting whether the kth SNP has an effect on gene expression level. When a SNP k has an effect on gene expression level (i.e. γk = 1), we assume that its effect size follows a normal distribution with 0 mean and variance σβ2σ2.

To more accurately identify eQTL with potential biological functions, we integrate the epigenetic annotation to prioritize SNPs with active epigenetic signals including H3K4me1, H3K4me3, and H3K9ac. We only consider epigenetic signals for SNPs which have significant epigenetic markers (p<1e-2). For other SNPs that don’t have significant epigenetic markers, we set their corresponding annotation to be 0. To achieve this goal, we used a logit link to associate the epigenetic annotation with the probability of a SNP being an eQTL:

logit(πk)=Akω,
ωN(0,η1I),

where Ak is the epigenetic signal 1×m vector for the SNP k and the m×1 vector is the epigenetic signal effect vector where m is the number of epigenetic signals considered, such as DNA methylation and histone marker status. We assume that the variance of the coefficients is η−1.

Prior assumptions on hyper parameters

We make the following assumptions on the hyper parameters in the model above:

σ2IG(a,b),
σβ2IG(c,d),
ηGamma(a0,b0),

i.e. we assume that these three parameters follow either inverse gamma or gamma distributions.

Variational Bayes inference

For the convenience of defining selected SNPs, we firstly introduce PPS(k) to denote the posterior probability of a SNP k being selected:

PPS(k)=θP(γk=1|X,Y,θ)P(θ|X,Y)

which is basically the weighted average of the posterior probability of this SNP being effective while the weight is the model likelihood, similar to that used in varbvs.

To address the computational challenge in fitting our model to high-dimensional genotype data, we applied the variational Bayes method, which is an alternative of the Markov Chain Monte Carlo (MCMC) method for statistical inference with less computation time and lower computation burden. Variational Bayes provides an analytical approximation of the posterior probability by selecting one from a family of distributions with the minimum Kullback-Leibler divergence to the exact posterior.

Under each set of chosen prior parameters θ={σ2,σβ2,η}, variational Bayes is applied to update other parameters to get the optimal model. Besides, for a set of chosen priors, we define the posterior probability of a SNP being selected as:

αk=P(γk=1|X,Y,θ).

When updating one parameter, we need to get the approximation (Q*(.)) of its posterior distribution via taking the expectations of all the other parameters in the full probability equation:

P(Y|X,A,θ)=P(Y|β,ϵ,X)P(σβ2,σ2,γ)P(γ|π)P(π|A,ω)P(ω|η)dηdσ2dσβ2.

More details about fitting the model were shown in the S1 Text.

Model training and evaluation

Gene expression level imputation models for 26 tissues were trained using the RNA-seq and genotype data from the GTEx(v6p) project and epigenetic data from the Roadmap Epigenomics Project. Only tissues with both epigenomics data from the Roadmap Epigenomics Project and the GTEx data were considered. For genotype data, we first removed less common SNPs (minor allele frequency < 0.01) and SNPs with ambiguous alleles. For each gene, we considered SNPs located from 1 Mb upstream of its transcription starting site to 1 Mb downstream of its transcription end site. For the RNA-seq data from GTEx, we first normalized these data and further adjusted for possible confounding factors including sequencing platform, top three principal components, sex and probabilistic estimating of expression residuals (PEER) factors. More specific details of preprocessing expression data from GTEx were described in our previous publication [27].

We further used five-fold cross validation to evaluate our gene expression imputation models. More specifically, for each tissue, samples were randomly divided into five groups with about the same size. We compared our method with four other methods via training models using four groups of data and then testing them on the fifth group. To compare performance among different methods, squared correlation between the observed and imputed expression levels (R2) was used. In our method, with the number of epigenetic categories existing in Roadmap varying across different tissues (S10 Table), all available epigenetic annotation categories were used for each tissue. The continuous annotation values (fold enrichments compared to expected background counts for ChIP-seq or DNase signal and fractions of methylation reads for DNA methylation) were used for training models. For elastic net models (elnt), as what previous studies did [7], the parameter α was set to be 0.5 as in PrediXcan and the optimal λ was selected via the function cv.glmnet provided in the ‘glmnet’ package [70]. For elastic net models with epigenetic signals direct filtering (elnt.annot), we first removed cis-SNPs without positive values for H3K4me1 [71], H3K4me3 [72] or H3K9ac [73] signals, which have been reported as associated with gene expression regulation. After filtering, the remaining cis-SNPs were used to build imputation models. As for methods directly using variational Bayes (vb), the model is similar to our method, without the logit link of epigenetic annotations. We also applied the direct variational Bayes method to SNPs with some epigenetic signals (same filtering process in elnt.annot). Paired Wilcoxon signed rank test was used to compare the performance of the models across different genes in each tissue. For further validation, different models were used to predict gene expression levels in an external brain expression data set of the CommonMind Consortium (www.synapse.org/CMC). For gene-trait association identification, only gene expression imputation models with significant squared correlation between the observed and imputed expression levels (R2) (FDR < 0.05) were considered.

Association analysis

After estimating SNP effects on gene expression levels, we used published GWAS summary-level results to identify the gene-level genetic effects on different traits mediated by gene expression. Just as the imputation model introduced above, for a gene, in a single tissue, its gene expression is modeled via the genotypes of cis-SNPs Y = Xβ+ϵ while the expression-trait relation is modelled by T = μ+YκY+τ, where μ is the intercept and τ is the error term in the model. The basic idea of this test is similar to what is used in UTMOST and PrediXcan.

The test statistic for the effect of gene expression Y on trait T is

Z=κ^Yse(κ^Y)

where κ^Y is the estimated expression effect on the trait and se(κ^Y) stands for the stand error for the estimated effect. For a linear model, κ^ has the following form

κY^=cov(Y,T)var(Y)=β^Tcov(X,T)var(Y)=β^Tvar(X)B^var(Y)

where β^ is the vector of estimated SNP effects on gene expression levels, B^ denotes the vector of GWAS SNP-level effect sizes for those identified effective SNPs in the gene expression imputation model. Besides, var(X) is a diagonal matrix whose elements are the genotype variances for each SNP and var(Y) is the variance of the imputed expression levels for this gene in the tissue.

To calculate the test statistic z score, we also need to derive the stand error of the estimated κY:

se(κY^)=var(κY^)=var(τ)var(Y)×nGWAS

Besides, based on the three linear models between any two out of these three variables (genotypes, imputed expression Y, and trait phenotypes T), we can get the variance proportions of outcomes explained by predictors as:

RT,Y2=κ^Y2σ^Y2σ^T2
RT,X2=κ^X2σ^X2σ^T2

where κ^X is the estimated SNP-level effect size presented in the GWAS result and σ^X2 is the estimated SNP variance diagonal matrix. Therefore, the standard error of κ^Y is

se(κ^X)σ^X2(1RT,Y2)(1RT,X2)σ^Y2.

Then, the Z score is Zβ^TΛZ^X, where β^ is the SNP coefficient vector in the gene expression imputation model, Λ is a diagonal matrix whose diagonal elements are the ratio of standard errors of SNPs over the standard error of imputed gene expression. Besides, the Z^X is the Z score vector provided in the GWAS results. Intuitively, the statistic is the weighted sum of GWAS z-scores for SNPs selected in gene expression imputation models and weights are proportional to the variance proportion of imputed expression levels explained by each SNP.

GWAS data analysis

The summary statistics files of 207 traits (not based on UK biobank data) were downloaded from the LD Hub website (http://ldsc.broadinstitute.org/gwashare/) by March of 2018, imputation models built by all five methods across 26 tissues were applied to identify trait-associated genes. For these five methods, individual-level genotypes from the GTEx database were used to calculate the standard deviations of SNPs and also the stand errors of imputed gene expression.

Supporting information

S1 Fig. Number of identified SNPs by each method.

The figure shows the number of identified eQTLs by all five methods across 26 tissues.

(PDF)

S2 Fig. More loci were identified as trait-associated by T-GEN across 207 traits from the LD Hub.

Applied to 207 traits from the LD Hub, significant trait-associated genes were identified in 26 tissues (p-values threshold: 0.05 divided by the number of gene-tissue pairs). Those identified associated-genes were further grouped into pre-defined cytobands. Each boxplot represents the distribution of the number differences between those identified from our tissue-specific analysis and those identified from the four other methods. Y axis is truncated at the value of 3 times the third quarters of each boxplot for visualization.

(PDF)

S3 Fig. The ratio of number of genes identified in trait-relevant tissues over that identified in non-associated tissues.

Each boxplot shows the distribution of the ratios across 207 traits in the LD Hub.

(PDF)

S4 Fig. The Venn diagram of genes identified as associated with LOAD in all five methods.

Twenty nine genes were identified by all five methods. 62 genes were shared between T-GEN and elastic-net methods, which may indicate the consistency of gene findings in T-GEN and other TWAS methods based on elastic-net models.

(PDF)

S5 Fig. Regional Manhanttan plot for SNPs near the identified associated-gene COG4.

The listed SNP (rs7196032) is one of eQTLs identified by T-GEN in the imputation model of COG4.

(PDF)

S6 Fig. Evaluating the imputation accuracy of the five methods.

a) indicates the comparison of R2 between observed and imputed gene expression levels in 5-fold cross validation analysis. Using 5-fold cross validation in the GTEx data, R2 between imputed and observed expression levels was calculated in elastic net (elnt) and vb.logit models. Dotted lines indicate the mean values of R2 for each method across all tissues. The mean values of vb.annot and T-GEN models are very close to each other, which lead to the overlaid green and blue dotted lines. b) shows R2 between observed gene expression in the CommonMind dataset and predicted gene expression based on five different methods. T-GEN showed the lowest values of R2 among all five methods. Y axis is truncated at the value of 2 times the third quarters of each boxplot for visualization. c) Using GTEx v8 data, T-GEN, elnt and mashr models were trained in the brain cortex BA9 tissue. R2 between observed gene expression in the CommonMind dataset and predicted gene expression were compared across these three methods.

(PDF)

S7 Fig. The percentages of identified trait-associated genes in 207 traits having pLI>0.99 and their relationship with their imputation accuracy.

For each method, all trait-associated gene-tissue models are classified into 20 bins based on their imputation accuracy (R2). For each bin, the percentage of genes having pLI>0.99 is calculated. The linear equation indicates the linear relationship between the mean R2 and the percentage and the p value indicate the significance level of the association.

(PDF)

S8 Fig. The percentage of genes identified as trait-associated in any of 207 traits and its relationship with the corresponding imputation accuracy.

For each method, all gene-tissue models (not just trait-associated ones) are classified into 20 bins based on their imputation accuracy (R2). For each bin, the percentage of gene-tissue models identified as trait-associated (bars) and the mean level of imputation accuracy for trait-associated ones (squares with crosses) are calculated. The linear equation indicates the linear relationship between the mean imputation accuracy and the percentage of trait-associated gene-tissue models. The p value indicates the significance level.

(PDF)

S9 Fig. The imputation accuracy in muscle skeletal tissue using GTEx v8 data.

A) shows the comparision among imputation models using binary-coded annotation, models trained using the probit link function in the annotation layer and models trained using the original T-GEN method. b) shows the results of models trained using annotation information with different missing rates. The rate 0 indicates the results of the original T-GEN method. Red diamonds indicate the mean level of each group.

(PDF)

S1 Table. Active states in ChromHMM-15 model.

This table shows 11/21 active states in the ChromHMM-15 model.

(XLSX)

S2 Table. Ratio of identifed SNPs with CADD score lager than 20 by each methods.

(XLSX)

S3 Table. 207 traits from LD Hub.

207 traits from the LD Hub considered in our paper and also their corresponding GWAS studies.

(XLSX)

S4 Table. Enrichment pattern of genes with pLI > 0.99 in identified trait-associated genes.

This table shows the enrichment analysis results for all five methods, the p values were obtained using binomial test.

(XLSX)

S5 Table. Numbers of singifcant genes in the most relevent tissue across traits in LDhub.

This table shows the number of significant genes in the tissues with most-enriched heritability across different traits.

(XLSX)

S6 Table. T-GEN results in LOAD case study.

This table shows T-GEN association results in LOAD, genes with green background in the table are those replicated in GWAX data.

(XLSX)

S7 Table. Heritability enrichment results in LOAD.

eQTLs contributing to identified LOAD-associated genes were identified as an annotation used in LD score regression. Heritability enrichment analysis was further conducted. The table shows the results for all five methods.

(XLSX)

S8 Table. eQTL for novel genes identified in AD by T-GEN.

This table shows eQTLs for novel AD-associated genes identified by T-GEN.

(XLSX)

S9 Table. Genes identified associated with LOAD using GTEx v8 models.

This tables shows identfied LOAD-associated genes by mash, prediXcan and T-GEN in heritability-enriched tissues of LOAD (whole blood and liver). Whether the gene is replicated in the additioanl GWAX data is also indicated in the table.

(XLSX)

S10 Table. Numbers of Roadmap annotation categories for each cell types and corresponding tissues.

This table shows the numbers of roadmap annotation categories (like DNA methylation, H3K4me3 and H3K9ac) in Roadmap cell types and corresponding 26 Roadmap tissues.

(XLSX)

S11 Table. CPU hours needed for building gene expression imputation models in T-GEN.

This table shows the running time needed for building imputation models in T-GEN across 26 tissues using GTEx v6 dataset.

(XLSX)

S12 Table. AD enrichment.

Heritability enrichment of AD in GenoSkyline annotations using LDSC.

(XLSX)

S13 Table. Model number.

T-GEN models in each tissue.

(XLSX)

S1 Text. Details of the T-GEN method, discussion on the effects of imputation accuracy and effects of the annotation layer.

In the supplementary method part, we showed the details of the variational method used in our T-GEN model including the updating procedures of parameters. In the supplementary discussion part, the influences of gene expression imputation accuracy on gene-trait association test is discussed. Also, the effects of different link function in the annotation layer of T-GEN model, ways of configuring annotation information and incompleteness of annotation were discussed.

(DOCX)

Data Availability

Roadmap epigenomics project data are available at: https://egg2.wustl.edu/roadmap/web_portal/. GTEx gene expression data are available at: https://gtexportal.org/home/datasets; GTEx genotype data: v6 dbGaP accession phs000424.v6.p1; v8 dbGaP accession phs000424.v8. GWAS summary stats from LD hub are available at: http://ldsc.broadinstitute.org. All trained models and association test results of T-GEN can be found in https://github.com/vivid-/T-GEN.

Funding Statement

Supported in part by NIH grants R01 GM122078, R01 GM134005, and P30 AG021342, and NSF grant DMS 1902903. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Claussnitzer M, Cho JH, Collins R, Cox NJ, Dermitzakis ET, Hurles ME, et al. A brief history of human disease genetics. Nature. 2020;577:179–189. 10.1038/s41586-019-1879-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. Am J Hum Genet. 2011;89:82–93. 10.1016/j.ajhg.2011.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chen LS, Hutter CM, Potter JD, Liu Y, Prentice RL, Peters U, et al. Insights into Colon Cancer Etiology via a Regularized Approach to Gene Set Analysis of GWAS Data. Am J Hum Genet. 2010;86:860–871. 10.1016/j.ajhg.2010.04.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hormozdiari F, van de Bunt M, Segrè AV, Li X, Joo JWJ, Bilow M, et al. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet. 2016;99:1245–1260. 10.1016/j.ajhg.2016.10.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Joehanes R, Zhang X, Huan T, Yao C, Ying S, Nguyen QT, et al. Integrated genome-wide analysis of expression quantitative trait loci aids interpretation of genomic association studies. Genome Biol. 2017;18:16 10.1186/s13059-016-1142-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dobbyn A, Huckins LM, Boocock J, Sloofman LG, Glicksberg BS, Giambartolomei C, et al. Landscape of Conditional eQTL in Dorsolateral Prefrontal Cortex and Co-localization with Schizophrenia GWAS. Am J Hum Genet. 2018;102:1169–1184. 10.1016/j.ajhg.2018.04.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gamazon ER, Wheeler HE, Shah KP, Mozaffari S V, Aquino-Michaels K, Carroll RJ, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 2015;47:1091–1098. 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 2016/02/08. 2016;48:245–252. 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wen X, Pique-Regi R, Luca F. Integrating molecular QTL data into genome-wide genetic association analysis: Probabilistic assessment of enrichment and colocalization. PLOS Genet. 2017;13:e1006646 Available: 10.1371/journal.pgen.1006646 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bhutani K, Sarkar A, Park Y, Kellis M, Schork NJ. Modeling prediction error improves power of transcriptome-wide association studies. bioRxiv. 2017;108316 10.1101/108316 [DOI] [Google Scholar]
  • 11.Xu Z, Wu C, Wei P, Pan W. A Powerful Framework for Integrating eQTL and GWAS Summary Data. Genetics. 2017/09/11. 2017;207:893–902. 10.1534/genetics.117.300270 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Yang Y, Shi X, Jiao Y, Huang J, Chen M, Zhou X, et al. CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies. Bioinformatics. 2019. 10.1093/bioinformatics/btz880 [DOI] [PubMed] [Google Scholar]
  • 13.Carithers LJ, Ardlie K, Barcus M, Branton PA, Britton A, Buia SA, et al. A Novel Approach to High-Quality Postmortem Tissue Procurement: The GTEx Project. Biopreserv Biobank. 2015;13:311–319. 10.1089/bio.2015.0032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Geyer PK, Green MM, Corces VG. Tissue-specific transcriptional enhancers may act in trans on the gene located in the homologous chromosome: the molecular basis of transvection in Drosophila. EMBO J. 1990;9:2247–2256. Available: https://pubmed.ncbi.nlm.nih.gov/2162766 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ong C-T, Corces VG. Enhancer function: new insights into the regulation of tissue-specific gene expression. Nat Rev Genet. 2011;12:283–293. 10.1038/nrg2957 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, et al. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol. 2011;12: R10 10.1186/gb-2011-12-1-r10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Swift-Scanlan T, Smith CT, Bardowell SA, Boettiger CA. Comprehensive interrogation of CpG island methylation in the gene encoding COMT, a key estrogen and catecholamine regulator. BMC Med Genomics. 2014;7:5 10.1186/1755-8794-7-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kumar D, Puan KJ, Andiappan AK, Lee B, Westerlaken GHA, Haase D, et al. A functional SNP associated with atopic dermatitis controls cell type-specific methylation of the VSTM1 gene locus. Genome Med. 2017;9:18 10.1186/s13073-017-0404-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009;459:108–112. 10.1038/nature07829 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Berger SL. Histone modifications in transcriptional regulation. Curr Opin Genet Dev. 2002;12:142–148. 10.1016/s0959-437x(02)00279-4 [DOI] [PubMed] [Google Scholar]
  • 21.Cheng C, Yan K-K, Yip KY, Rozowsky J, Alexander R, Shou C, et al. A statistical framework for modeling gene expression using chromatin features and application to modENCODE datasets. Genome Biol. 2011;12:R15 10.1186/gb-2011-12-2-r15 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Cheng C, Gerstein M. Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells. Nucleic Acids Res. 2011;40:553–568. 10.1093/nar/gkr752 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Dong X, Greven MC, Kundaje A, Djebali S, Brown JB, Cheng C, et al. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 2012;13:R53 10.1186/gb-2012-13-9-r53 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Consortium TEP, Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57 Available: 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Spisák S, Lawrenson K, Fu Y, Csabai I, Cottman RT, Seo J-H, et al. CAUSEL: an epigenome- and genome-editing pipeline for establishing function of noncoding GWAS variants. Nat Med. 2015;21:1357 Available: 10.1038/nm.3975 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun. 2018;9:1825 10.1038/s41467-018-03621-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, et al. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet. 2019;51:568–576. 10.1038/s41588-019-0345-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Romanoski CE, Glass CK, Stunnenberg HG, Wilson L, Almouzni G. Roadmap for regulation. Nature. 2015;518:314–316. 10.1038/518314a [DOI] [PubMed] [Google Scholar]
  • 29.Li B, Carey M, Workman JL. The Role of Chromatin during Transcription. Cell. 2007;128:707–719. 10.1016/j.cell.2007.01.015 [DOI] [PubMed] [Google Scholar]
  • 30.Carbonetto P, Stephens M. Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies. Bayesian Anal. 2012;7: 73–108. 10.1214/12-BA703 [DOI] [Google Scholar]
  • 31.Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. J R Stat Soc Ser B (Statistical Methodol. 2005;67:301–320. Available: http://www.jstor.org/stable/3647580 [Google Scholar]
  • 32.Carbonetto P, Zhou X, Stephens M. varbvs: Fast Variable Selection for Large-scale Regression. arXiv Prepr arXiv170906597. 2017.
  • 33.Zhou X, Carbonetto P, Stephens M. Polygenic Modeling with Bayesian Sparse Linear Mixed Models. PLOS Genet. 2013;9:e1003264 Available: 10.1371/journal.pgen.1003264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods. 2012;9:215–216. 10.1038/nmeth.1906 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, McCormick C, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2013/10/03. 2014;24:14–24. 10.1101/gr.155192.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Josephs EB, Lee YW, Stinchcombe JR, Wright SI. Association mapping reveals the role of purifying selection in the maintenance of genomic variation in gene expression. Proc Natl Acad Sci. 2015;112:15390 LP– 15395. 10.1073/pnas.1503027112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ray K, Szabo B. Variational Bayes for high-dimensional linear regression with sparse priors. 2019;1–40. Available: http://arxiv.org/abs/1904.07150 [Google Scholar]
  • 38.Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB. Genic Intolerance to Functional Variation and the Interpretation of Personal Genomes. PLOS Genet. 2013;9: e1003709 Available: 10.1371/journal.pgen.1003709 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Cassa CA, Weghorn D, Balick DJ, Jordan DM, Nusinow D, Samocha KE, et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat Genet. 2017;49:806 Available: 10.1038/ng.3831 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh P-R, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet. 2015;47:1228 Available: 10.1038/ng.3404 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Lu Q, Powles RL, Abdallah S, Ou D, Wang Q, Hu Y, et al. Systematic tissue-specific functional annotation of the human genome highlights immune-related DNA elements for late-onset Alzheimer’s disease. PLOS Genet. 2017;13:e1006933 Available: 10.1371/journal.pgen.1006933 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Lambert J-C, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, Bellenguez C, et al. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat Genet. 2013;45:1452–1458. 10.1038/ng.2802 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Basurto-Islas G, Grundke-Iqbal I, Tung YC, Liu F, Iqbal K. Activation of Asparaginyl Endopeptidase Leads to Tau Hyperphosphorylation in Alzheimer Disease. J Biol Chem. 2013;288:17495–17507. 10.1074/jbc.M112.446070 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Climer LK, Dobretsov M, Lupashin V. Defects in the COG complex and COG-related trafficking regulators affect neuronal Golgi function. Frontiers in Neuroscience. 2015. p. 405 Available: 10.3389/fnins.2015.00405 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Qing-Qing Tao Zhi-Ying Wu Y-CC. The role of CD2AP in the Pathogenesis of Alzheimer's Disease. Aging and disease. pp. 901–907. Available: http://www.aginganddisease.org [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Amlie-Wolf A, Tang M, Mlynarski EE, Kuksa PP, Valladares O, Katanic Z, et al. INFERNO: inferring the molecular mechanisms of noncoding genetic variants. Nucleic Acids Res. 2018;46:8740–8753. 10.1093/nar/gky686 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Wadsworth TL, Bishop JA, Pappu AS, Woltjer RL, Quinn JF. Evaluation of coenzyme Q as an antioxidant strategy for Alzheimer’s disease. J Alzheimer’s Dis. 2008;14:225–234. 10.3233/jad-2008-14210 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Meda SA, Koran MEI, Pryweller JR, Vega JN, Thornton-Wells TA. Genetic interactions associated with 12-month atrophy in hippocampus and entorhinal cortex in Alzheimer’s Disease Neuroimaging Initiative. Neurobiol Aging. 2013;34:1518.e9–1518.e18. 10.1016/j.neurobiolaging.2012.09.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Jaeger CUP and S. Functional Role of Lipoprotein Receptors in Alzheimers Disease. Current Alzheimer Research. 2008. pp. 15–25. 10.2174/156720508783884675 [DOI] [PubMed] [Google Scholar]
  • 50.Need AC, Attix DK, McEvoy JM, Cirulli ET, Linney KL, Hunt P, et al. A genome-wide study of common SNPs and CNVs in cognitive performance in the CANTAB. Hum Mol Genet. 2009;18:4650–4661. 10.1093/hmg/ddp413 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Hong C, Tontonoz P. Liver X receptors in lipid metabolism: opportunities for drug discovery. Nat Rev Drug Discov. 2014;13: 433–444. 10.1038/nrd4280 [DOI] [PubMed] [Google Scholar]
  • 52.Fishilevich S, Nudel R, Rappaport N, Hadar R, Plaschkes I, Iny Stein T, et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database. 2017;2017 10.1093/database/bax028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Renquist BJ, Madanayake TW, Hennebold JD, Ghimire S, Geisler CE, Xu Y, et al. TMEM135 is an LXR-inducible regulator of peroxisomal metabolism. bioRxiv. 2019;334979 10.1101/334979 [DOI] [Google Scholar]
  • 54.Liu JZ, Erlich Y, Pickrell JK. Case–control association mapping by proxy using family history of disease. Nat Genet. 2017;49:325–331. 10.1038/ng.3766 [DOI] [PubMed] [Google Scholar]
  • 55.Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14:128 10.1186/1471-2105-14-128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Urbut SM, Wang G, Carbonetto P, Stephens M. Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nat Genet. 2019;51:187–195. 10.1038/s41588-018-0268-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Nagpal S, Meng X, Epstein MP, Tsoi LC, Patrick M, Gibson G, et al. TIGAR: An Improved Bayesian Tool for Transcriptomic Data Imputation Enhances Gene Mapping of Complex Traits. Am J Hum Genet. 2019;105:258–266. 10.1016/j.ajhg.2019.05.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Karmodiya K, Krebs AR, Oulad-Abdelghani M, Kimura H, Tora L. H3K9 and H3K14 acetylation co-occur at many gene regulatory elements, while H3K14ac marks a subset of inactive inducible promoters in mouse embryonic stem cells. BMC Genomics. 2012;13:424 10.1186/1471-2164-13-424 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Kennedy EM, Goehring GN, Nichols MH, Robins C, Mehta D, Klengel T, et al. An integrated -omics analysis of the epigenetic landscape of gene expression in human blood cells. BMC Genomics. 2018;19:476 10.1186/s12864-018-4842-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012/09/05. 2012;337:1190–1195. 10.1126/science.1222794 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Trynka G, Sandor C, Han B, Xu H, Stranger BE, Liu XS, et al. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nat Genet. 2012/12/23. 2013;45:124–130. 10.1038/ng.2504 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Gusev A, Lee SH, Trynka G, Finucane H, Vilhjálmsson BJ, Xu H, et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am J Hum Genet. 2014/11/06. 2014;95:535–552. 10.1016/j.ajhg.2014.10.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh P-R, et al. An atlas of genetic correlations across human diseases and traits. Nat Genet. 2015/09/28. 2015;47:1236–1241. 10.1038/ng.3406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Reshef YA, Finucane HK, Kelley DR, Gusev A, Kotliar D, Ulirsch JC, et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nat Genet. 2018;50:1483–1493. 10.1038/s41588-018-0196-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Price AL, Patterson N, Hancks DC, Myers S, Reich D, Cheung VG, et al. Effects of cis and trans Genetic Ancestry on Gene Expression in African Americans. PLOS Genet. 2008;4:e1000294 Available: 10.1371/journal.pgen.1000294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Grundberg E, Small KS, Hedman ÅK, Nica AC, Buil A, Keildson S, et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet. 2012;44:1084–1089. 10.1038/ng.2394 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Liu X, Finucane HK, Gusev A, Bhatia G, Gazal S, O’Connor L, et al. Functional Architectures of Local and Distal Regulation of Gene Expression in Multiple Human Tissues. Am J Hum Genet. 2017;100:605–616. 10.1016/j.ajhg.2017.03.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Liu X, Li YI, Pritchard JK. Trans Effects on Gene Expression Can Drive Omnigenic Inheritance. Cell. 2019;177:1022–1034.e6. 10.1016/j.cell.2019.04.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science (80-). 2015;348:648 LP– 660. 10.1126/science.1262110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33: 1–22. Available: https://www.ncbi.nlm.nih.gov/pubmed/20808728 [PMC free article] [PubMed] [Google Scholar]
  • 71.Cheng J, Blum R, Bowman C, Hu D, Shilatifard A, Shen S, et al. A Role for H3K4 Monomethylation in Gene Repression and Partitioning of Chromatin Readers. Mol Cell. 2014;53:979–992. 10.1016/j.molcel.2014.02.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Liang G, Lin JCY, Wei V, Yoo C, Cheng JC, Nguyen CT, et al. Distinct localization of histone H3 acetylation and H3-K4 methylation to the transcription start sites in the human genome. Proc Natl Acad Sci U S A. 2004/05/03. 2004;101:7357–7362. 10.1073/pnas.0401866101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Zhou J, Wang X, He K, Charron J-BF, Elling AA, Deng XW. Genome-wide profiling of histone H3 lysine 9 acetylation and dimethylation in Arabidopsis reveals correlation between multiple histone marks and gene expression. Plant Mol Biol. 2010;72:585–595. 10.1007/s11103-009-9594-7 [DOI] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008315.r001

Decision Letter 0

Seyoung Kim, Jian Ma

3 May 2020

Dear Prof. Zhao,

Thank you very much for submitting your manuscript "Leveraging functional annotation to identify genes associated with complex diseases" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. While the paper proposes a methodology for solving an important problem, the reviewers raised concerns regarding application of the method to analyze data. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. 

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Seyoung Kim

Guest Editor

PLOS Computational Biology

Jian Ma

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Liu et al introduced a new method called T-GEN (Transcriptome-mediated identification of disease-associated Gens with Epigenetic aNnotation) to identify disease-associated genes leveraging epigenetic information. They incorporate epigenomic annotation in their formula as a Bayes prior (variational Bayes method) in imputing gene expression levels and then perform TWAS (transcriptome-wide association study) of disease traits.

This method leverages functional annotation to identify genes associated with complex diseases. They applied T-GEN to 207 complex traits and identified more trait-associated genes (ranging from 7.7 % to 102%) than those from existing methods. Among the identified genes associated with these traits, T-GEN can better identify genes with high (>0.99) pLI scores (higher scores indicate more functional importance of genes) compared to other methods. When T-GEN was applied to late-onset Alzheimer’s disease, they identified 96 genes located at 15 loci, including two novel loci not implicated in previous GWAS. I believe that the method proposed in this paper is an important extension of existing methods in identifying disease associated genes. The paper has been well written. I have the following comments.

1. The authors assumed that SNPs with active epigenetic annotations are more likely to regulate tissue-specific gene expression. I think this is only partially true given our knowledge about human genome. More GWAS SNPs are located in the non-coding and unknown regions. In addition our knowledge about epigenetic annotations is limited, mainly based on prediction by statistical models. Therefore, this assumption is clearly a limitation to the method but may not have a good solution at this point. I wonder if the authors can comment on this and provide some insights in the discussion.

2. ‘Reported GWAS hits are enriched in regions with active epigenetic signals and they can help fine-map true GWAS hits with functional impacts’. I wonder what proportion of GWAS hits have active epigenetic signals.

3. What does ‘positive value’ mean in ‘Removing cis-SNPs without positive values for H3K4me1, H3K4me3, or H3K9ac signals’? What are the effects of the SNPs that show positive values with these three genes?

4. How did the author form a continuous epigenetic annotation score? Did they compare different ways to formulate the annotation score? How did the author choose the current method to form continuous annotation? Is there an advantage to use logit link to associate the epigenetic annotation with the probability of a SNP being an eQTL?

5. The GWAS SNPs with active epigenetic annotation are prioritized. I wonder how they treated the GWAS SNPs without clear annotation.

6. Bayes methods are known to be computational intensive, especially for high-dimensional genotype data. The authors used the variational Bayes method which is less computational intense compared to MCMC. I wonder if the authors can provide the length of hours for imputation of gene expression on each tissue type. I assume that it takes longer in some tissue types and shorter in other tissue types.

Reviewer #2: Liu et al. proposed an interesting idea to incorporate epigenetic annotation into transcriptome-wide association studies (TWAS). They adopted a Bayesian variable selection model to integrate Roadmap epigenetic annotations, GTEx data, and LD hub GWAS summary statistics. Through intensive analyses of 207 traits on LD hub, they demonstrated advantages of their proposed method T-GEN over other methods. Then they thoroughly discussed a detailed application to Alzheimer’s disease. Below I list some comments that may help improve the manuscript.

1. It is worrisome that using GTEx brain cortex BA9 data to impute CommonMind Consortium (CMC) BA9 brain data can only achieve R2 of 0.007. Given GTEx data are only healthy controls, how about only considering CMC controls data with the same age range as GTEx subjects?

PrediXcan published its prediction model online (http://predictdb.org/), including for a newer version of GTEx data (V8). What is the R2 directly using their prediction weights?

It is also counterintuitive that improving gene expression imputation accuracy does not help TWAS. How many genes are left after the FDR filtering?

2. The authors claimed T-GEN discovered more genes. Is there any empirical check of the distribution of the p-values to ensure that there is no inflated FDR?

3. The GTEx data have been updated to V8. Given the largely increased sample size, it may help boost gene expression imputation accuracy and association discovery if the authors can update the analyses to GTEx V8 data. If not for all the traits, updating the main application (Alzheimer’s disease) would show the impact of sample size.

4. Fig 2 shows the percentage of functional SNPs. Can the authors show the number of selected SNPs by each method? Why two methods (elnt.annot and vb.annot) are not compared here?

The authors acknowledged that “We note that T-GEN utilizes the epigenetic information in SNP selections, and the same information is also used in ChromHMM models. Therefore, we expect to select more SNPs annotated as functionally active in ChromHMM models. To further evaluate the functional potential of the SNPs selected by T-GEN, we considered the CADD scores of the identified SNPs across all five methods.” So what presented in Fig 2 is an unfair comparison, and it seems to make more sense to demonstrate the results of CADD scores in Fig 2.

5. Fig 3 is not cited in the paper. There are 8 main figures and 8 supplementary figures. The authors may consider combining similar figures into a bigger figure or compile all supplementary materials in a single file. Now it takes time to download and review supplementary figures one by one.

6. As a computational biology paper, would the authors provide analysis code or software?

7. The authors considered 26 tissue types with Roadmap and GTEx data. For instance, for the application to Alzheimer’s disease, they identified loci in tissue types like ovary and lung. I am not sure if this makes sense in biology. The authors may pursue further on trait-relevant tissue types. E.g., 1) provide heritability enrichment for each identified tissue type; 2) weight tissue types by trait relevance in the testing; 3) only consider relevant tissue types by a cutoff of heritability enrichment. This would also help release the burden of multiple testing.

Typos:

1. “gens” appeared four times, including when mentioning the full name of the proposed method T-GEN. Should it be “genes”?

2. “relavant” in the caption of Fig S2.

3. S1 Text mentioned that there are 208 traits other than 207 stated in the main text.

Reviewer #3: The main idea of this study is to add epigenetic information in gene expression imputation from eQTL SNPs and this helps to identify more statistically predictive and biologically functional SNPs, which leads to the identification of more trait-associated genes. In the process of imputing expression with SNP information, the epigenetics annotation was used to set priorities among candidate SNPs. Experimental results show that the proposed method can select more potent factors than the previous methods.

Major comment:

The study is based on the assumption that SNPs with active epigenetic annotation are more likely to modulate tissue-specific gene expression. Reference to support this or experimental validation would be necessary (paragraph 2 in Introduction).

As the authors noted, it is naturally expected to select more SNPs annotated as functional by adding epigenetic information in the imputation process. Therefore, further validation of selected SNPs is done by showing that the T-GEN identified eQTLs have higher CADD scores, while the score improvement is marginal. The final validation is done through the identified train-associated genes, which are more in number and include higher percentage of functionally conserved genes. Still, it doesn’t seem to fully validate the potential advantage of the proposed method.

Since the main idea in Methodological perspective is to add epigenetic annotation information, it would be helpful to do some experiments regarding the robustness against the bias or incompleteness of the annotation information. Simulation study on this and other aspects that can show the performance behavior of the proposed method would be useful.

This method was compared to Elnt and vb, which are relatively classic methods. I’m wondering if there is more recent research or method to use for comparison.

In the section “More genes can be effectively imputed by T-GEN”, more detailed explanation of the measurement method is needed. In addition, there should be comparison of prediction errors as to whether the proposed method predicts expression matrix Y well from X.

Figure 3 is not explained in the main text. Also, there is an elnt category in Figure 3, and it seems that the values are all zero. If it is just a control (all 0), it may be deleted.

Typo:

In line 297, “gens” should be “genes”.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008315.r003

Decision Letter 1

Seyoung Kim, Jian Ma

14 Aug 2020

Dear Prof. Zhao,

Thank you very much for submitting your manuscript "Leveraging functional annotation to identify genes associated with complex diseases" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations, specifically on making the software available and proofreading the text.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Seyoung Kim

Guest Editor

PLOS Computational Biology

Jian Ma

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I believe the authors have appropriately addressed my comments. I feel that the authors used many 'can' in sentences. Some of the 'can' should be changed to 'may', other 'can' should be removed and just use the verb. For example, “H3K9ac [can] be present in both actively transcribed and bivalent regions." In this sentence, 'can' should be 'might'. Another example, "More specifically, the running

time for model training in each tissue [can] be found in the S11 Table", here should be ".... in each tissue was displayed in the S11 Table". Another "the annotation configuration and

incompleteness [can] also affect the results", here "can" should be "may", "More genes [can] be effectively imputed by T-GEN", here "can be" should be changed to "are"

Reviewer #2: My comments have been addressed.

Reviewer #3: This paper proposes a new method of using functional annotation to identify disease-related genes and it seems a significant contribution to the field. Most of the concerns I have raised about the previous version have been resolved and I recommend to accept this manuscript after minor edits, e.g.

- line 336: substantial number increase in the number of genes --> substantial increase in the number of genes

And the authors would need to include in the manuscript a link that provides the used data and software/code.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008315.r005

Decision Letter 2

Seyoung Kim, Jian Ma

5 Sep 2020

Dear Prof. Zhao,

We are pleased to inform you that your manuscript 'Leveraging functional annotation to identify genes associated with complex diseases' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Seyoung Kim

Guest Editor

PLOS Computational Biology

Jian Ma

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008315.r006

Acceptance letter

Seyoung Kim, Jian Ma

20 Oct 2020

PCOMPBIOL-D-20-00202R2

Leveraging functional annotation to identify genes associated with complex diseases

Dear Dr Zhao,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Laura Mallard

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Number of identified SNPs by each method.

    The figure shows the number of identified eQTLs by all five methods across 26 tissues.

    (PDF)

    S2 Fig. More loci were identified as trait-associated by T-GEN across 207 traits from the LD Hub.

    Applied to 207 traits from the LD Hub, significant trait-associated genes were identified in 26 tissues (p-values threshold: 0.05 divided by the number of gene-tissue pairs). Those identified associated-genes were further grouped into pre-defined cytobands. Each boxplot represents the distribution of the number differences between those identified from our tissue-specific analysis and those identified from the four other methods. Y axis is truncated at the value of 3 times the third quarters of each boxplot for visualization.

    (PDF)

    S3 Fig. The ratio of number of genes identified in trait-relevant tissues over that identified in non-associated tissues.

    Each boxplot shows the distribution of the ratios across 207 traits in the LD Hub.

    (PDF)

    S4 Fig. The Venn diagram of genes identified as associated with LOAD in all five methods.

    Twenty nine genes were identified by all five methods. 62 genes were shared between T-GEN and elastic-net methods, which may indicate the consistency of gene findings in T-GEN and other TWAS methods based on elastic-net models.

    (PDF)

    S5 Fig. Regional Manhanttan plot for SNPs near the identified associated-gene COG4.

    The listed SNP (rs7196032) is one of eQTLs identified by T-GEN in the imputation model of COG4.

    (PDF)

    S6 Fig. Evaluating the imputation accuracy of the five methods.

    a) indicates the comparison of R2 between observed and imputed gene expression levels in 5-fold cross validation analysis. Using 5-fold cross validation in the GTEx data, R2 between imputed and observed expression levels was calculated in elastic net (elnt) and vb.logit models. Dotted lines indicate the mean values of R2 for each method across all tissues. The mean values of vb.annot and T-GEN models are very close to each other, which lead to the overlaid green and blue dotted lines. b) shows R2 between observed gene expression in the CommonMind dataset and predicted gene expression based on five different methods. T-GEN showed the lowest values of R2 among all five methods. Y axis is truncated at the value of 2 times the third quarters of each boxplot for visualization. c) Using GTEx v8 data, T-GEN, elnt and mashr models were trained in the brain cortex BA9 tissue. R2 between observed gene expression in the CommonMind dataset and predicted gene expression were compared across these three methods.

    (PDF)

    S7 Fig. The percentages of identified trait-associated genes in 207 traits having pLI>0.99 and their relationship with their imputation accuracy.

    For each method, all trait-associated gene-tissue models are classified into 20 bins based on their imputation accuracy (R2). For each bin, the percentage of genes having pLI>0.99 is calculated. The linear equation indicates the linear relationship between the mean R2 and the percentage and the p value indicate the significance level of the association.

    (PDF)

    S8 Fig. The percentage of genes identified as trait-associated in any of 207 traits and its relationship with the corresponding imputation accuracy.

    For each method, all gene-tissue models (not just trait-associated ones) are classified into 20 bins based on their imputation accuracy (R2). For each bin, the percentage of gene-tissue models identified as trait-associated (bars) and the mean level of imputation accuracy for trait-associated ones (squares with crosses) are calculated. The linear equation indicates the linear relationship between the mean imputation accuracy and the percentage of trait-associated gene-tissue models. The p value indicates the significance level.

    (PDF)

    S9 Fig. The imputation accuracy in muscle skeletal tissue using GTEx v8 data.

    A) shows the comparision among imputation models using binary-coded annotation, models trained using the probit link function in the annotation layer and models trained using the original T-GEN method. b) shows the results of models trained using annotation information with different missing rates. The rate 0 indicates the results of the original T-GEN method. Red diamonds indicate the mean level of each group.

    (PDF)

    S1 Table. Active states in ChromHMM-15 model.

    This table shows 11/21 active states in the ChromHMM-15 model.

    (XLSX)

    S2 Table. Ratio of identifed SNPs with CADD score lager than 20 by each methods.

    (XLSX)

    S3 Table. 207 traits from LD Hub.

    207 traits from the LD Hub considered in our paper and also their corresponding GWAS studies.

    (XLSX)

    S4 Table. Enrichment pattern of genes with pLI > 0.99 in identified trait-associated genes.

    This table shows the enrichment analysis results for all five methods, the p values were obtained using binomial test.

    (XLSX)

    S5 Table. Numbers of singifcant genes in the most relevent tissue across traits in LDhub.

    This table shows the number of significant genes in the tissues with most-enriched heritability across different traits.

    (XLSX)

    S6 Table. T-GEN results in LOAD case study.

    This table shows T-GEN association results in LOAD, genes with green background in the table are those replicated in GWAX data.

    (XLSX)

    S7 Table. Heritability enrichment results in LOAD.

    eQTLs contributing to identified LOAD-associated genes were identified as an annotation used in LD score regression. Heritability enrichment analysis was further conducted. The table shows the results for all five methods.

    (XLSX)

    S8 Table. eQTL for novel genes identified in AD by T-GEN.

    This table shows eQTLs for novel AD-associated genes identified by T-GEN.

    (XLSX)

    S9 Table. Genes identified associated with LOAD using GTEx v8 models.

    This tables shows identfied LOAD-associated genes by mash, prediXcan and T-GEN in heritability-enriched tissues of LOAD (whole blood and liver). Whether the gene is replicated in the additioanl GWAX data is also indicated in the table.

    (XLSX)

    S10 Table. Numbers of Roadmap annotation categories for each cell types and corresponding tissues.

    This table shows the numbers of roadmap annotation categories (like DNA methylation, H3K4me3 and H3K9ac) in Roadmap cell types and corresponding 26 Roadmap tissues.

    (XLSX)

    S11 Table. CPU hours needed for building gene expression imputation models in T-GEN.

    This table shows the running time needed for building imputation models in T-GEN across 26 tissues using GTEx v6 dataset.

    (XLSX)

    S12 Table. AD enrichment.

    Heritability enrichment of AD in GenoSkyline annotations using LDSC.

    (XLSX)

    S13 Table. Model number.

    T-GEN models in each tissue.

    (XLSX)

    S1 Text. Details of the T-GEN method, discussion on the effects of imputation accuracy and effects of the annotation layer.

    In the supplementary method part, we showed the details of the variational method used in our T-GEN model including the updating procedures of parameters. In the supplementary discussion part, the influences of gene expression imputation accuracy on gene-trait association test is discussed. Also, the effects of different link function in the annotation layer of T-GEN model, ways of configuring annotation information and incompleteness of annotation were discussed.

    (DOCX)

    Attachment

    Submitted filename: T-GEN response_3.docx

    Attachment

    Submitted filename: reviewer_comments_final.docx

    Data Availability Statement

    Roadmap epigenomics project data are available at: https://egg2.wustl.edu/roadmap/web_portal/. GTEx gene expression data are available at: https://gtexportal.org/home/datasets; GTEx genotype data: v6 dbGaP accession phs000424.v6.p1; v8 dbGaP accession phs000424.v8. GWAS summary stats from LD hub are available at: http://ldsc.broadinstitute.org. All trained models and association test results of T-GEN can be found in https://github.com/vivid-/T-GEN.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES