Skip to main content
Nature Communications logoLink to Nature Communications
. 2025 Apr 3;16:3193. doi: 10.1038/s41467-025-58023-x

Genetically regulated eRNA expression predicts chromatin contact frequency and reveals genetic mechanisms at GWAS loci

Michael J Betti 1,, Phillip Lin 1, Melinda C Aldrich 1, Eric R Gamazon 1,2,
PMCID: PMC11968980  PMID: 40180945

Abstract

The biological functions of extragenic enhancer RNAs and their impact on disease risk remain relatively underexplored. In this work, we develop in silico models of genetically regulated expression of enhancer RNAs across 49 cell and tissue types, characterizing their degree of genetic control. Leveraging the estimated genetically regulated expression for enhancer RNAs and canonical genes in a large-scale DNA biobank (N > 70,000) and high-resolution Hi-C contact data, we train a deep learning-based model of pairwise three-dimensional chromatin contact frequency for enhancer-enhancer and enhancer-gene pairs in cerebellum and whole blood. Notably, the use of genetically regulated expression of enhancer RNAs provides substantial tissue-specific predictive power, supporting a role for these transcripts in modulating spatial chromatin organization. We identify schizophrenia-associated enhancer RNAs independent of GWAS loci using enhancer RNA-based TWAS and determine the causal effects of these enhancer RNAs using Mendelian randomization. Using enhancer RNA-based TWAS, we generate a comprehensive resource of tissue-specific enhancer associations with complex traits in the UK Biobank. Finally, we show that a substantially greater proportion (63%) of GWAS associations colocalize with causal regulatory variation when enhancer RNAs are included.

Subject terms: Functional genomics, Computational models, Chromatin structure


Here, the authors present trained models of genetically regulated enhancer RNA expression, finding that genetically regulated expression predicts chromatin contact frequency and that enhancer RNAs play a key role in complex trait heritability.

Introduction

Enhancers are essential mediators of gene expression, regulating spatial and temporal expression patterns through recruitment of DNA-binding proteins and establishment of chromatin conformation1. While the question of how enhancers mediate chromatin activity and gene expression has been well-studied, the biological functions of enhancer RNAs (eRNAs), the RNAs transcribed from these regulatory elements, remain relatively under-explored2. Research over the past decade has suggested that eRNA transcription plays a key role in mediating gene transcription3, facilitating chromatin modifications and enhancer loop formation4,5, and driving cell fate determination2. While canonical genes – mRNAs and long non-coding RNAs (lncRNAs) – are usually spliced and polyadenylated and transcribed from promoters, eRNAs exhibit a higher degree of structural and functional diversity and can be sub-divided into two distinct categories6. The first sub-group, 1D eRNAs, is composed of long, spliced, polyadenylated, unidirectionally transcribed transcripts that can function in trans4. The second sub-class, 2D eRNAs, is composed of short, unspliced, non-polyadenylated, bidirectionally transcribed transcripts that function in cis7, and most eRNAs fall into this latter category. As canonical genes are influenced by expression quantitative trait loci (eQTLs), we expect that eRNA expression should likewise be under at least some degree of genetic control. Additionally, because eRNAs play putative regulatory roles in a range of biological processes, it is reasonable to hypothesize that eRNA expression influences human complex traits, including disease risk.

Recent work investigating the role of eRNAs in neuropsychiatric traits strongly supports this hypothesis8. In one such study, the authors mapped enhancer eQTLs, which they termed EeQTLs, in regions of the brain. They concluded that eRNA expression explains a substantial proportion of neuropsychiatric trait heritability (6.8%). Interestingly, these authors reported that the proportion of heritability explained by eRNAs is largely complementary to, rather than overlapping with, the proportion that can be explained by canonical gene expression alone.

In this work, we present predictive models of eRNA expression trained using whole genome sequencing (WGS) and eRNA expression profiled across 49 cell and tissue types9 (Fig. 1a), quantifying its level of genetic control. Using a deep learning-based framework, we show that pairwise genetically regulated expression (GReX) predicts the three-dimensional chromatin contact frequency of an enhancer-gene or enhancer-enhancer pair (Fig. 1b), lending support to a modulating role in 3D spatial chromatin organization. We then perform an eRNA-based TWAS of schizophrenia (SCZ) risk (Fig. 1c), identifying eRNA associations that are independent of canonical genes, whose causality we investigate using Mendelian randomization (Fig. 1d). Finally, we apply this eRNA-based TWAS methodology across the UK Biobank10,11 to generate a comprehensive reference resource of tissue-specific enhancer associations with the human phenome (Fig. 1e).

Fig. 1. Study workflow.

Fig. 1

a In silico genetic models of genetically regulated expression (GReX) of eRNAs and canonical genes were trained using whole genome sequencing data and corresponding expression profiling generated by the GTEx Consortium. b Using the trained models, GReX of eRNAs and canonical genes was imputed across 72,828 BioVU samples. Mean predicted expression values were used to train deep learning models of three-dimensional contact frequencies observed in Hi-C contact matrices. c eRNA-based and canonical-gene-based TWAS of schizophrenia (SCZ) were run. d Genome-wide significant TWAS associations were tested for causality using Mendelian randomization. This allowed us to identify loci with a causal SCZ-associated eRNA, gene, or both and explore the potential underlying mechanisms by which they could influence disease risk. e The eRNA-based TWAS models were applied the UK Biobank, generating a comprehensive resource of tissue-specific eRNA associations on a phenome-wide scale. f eQTL mapping and colocalization analysis was performed across nearly 1 million independent GWAS associations in the UK Biobank to identify the proportion of signals explained by an eRNA eQTL versus a canonical gene eQTL. All panels were created in BioRender117.

Results

Genetic models of eRNA expression

We trained models1214 of genetically regulated eRNA expression, encompassing a total of 14,471 transcribed enhancers profiled across 49 human cell and tissue types. Because eRNAs were quantified from RNA-seq data, which underwent selection for polyadenylated transcripts prior to sequencing9, 2D eRNAs, which are not polyadenylated, might be underrepresented in the training dataset. Previous characterizations of eRNAs have described 2D transcripts as having a length below 2 kb, while 1D eRNAs are longer than 2 kb1517. Using transcript length, we characterized the proportion of eRNAs in our training dataset that fit the profile of a 2D eRNA versus a 1D eRNA. Of the 14,471 unique eRNA transcripts included in our trained models, 13,580 (93.84%) were less than 2 kb in length, while only 891 (6.16%) has a length greater than or equal to 2 kb (Supplementary Data 1). The median eRNA transcript length was 550 bp (Supplementary Fig. 1). These results suggest that despite transcripts in the training set undergoing selection for polyadenylated transcripts, the vast majority still exhibit characteristics typical of 2D eRNAs.

We compared the resulting eRNA-based models with an analogous set trained on canonical gene expression. We evaluated the proportion of imputable transcripts, mean prediction R2, mean minor allele frequency (MAF) of the SNP features, and mean SNPs/transcript ratio. For each tissue, we observed a higher proportion of imputable canonical genes than imputable eRNAs (Fig. 2a). Prediction R2, however, showed greater cross-tissue variability (Fig. 2b). Across the 49 tissues, 21 had a higher mean prediction R2 for the eRNAs than the canonical genes. And while the mean MAF of model SNP features was essentially the same across all eRNA and canonical-gene-based models (Fig. 2c), we found that across all cell and tissue types, eRNA-based models had a significantly higher SNP per transcript ratio (mean 38.16 SNPs/gene versus 19.94 SNPs/canonical gene across all tissue models, p < 2.2 × 10−16) (Fig. 2d).

Fig. 2. Comparison of eRNA-based models with canonical-gene-based models in 49 GTEx tissues.

Fig. 2

a Across all tissues, the eRNA-based models included a substantially lower proportion of imputable transcripts than the canonical gene-based models. b Mean prediction R2 varied by tissue type, with some tissue models having a higher predictive performance for eRNAs and others having a higher performance for canonical genes. c The mean MAF of variants included in the models did not vary between eRNAs and canonical genes. d The eRNA-based models, across all tissues, yielded a higher mean number of SNPs per gene transcript than the canonical-gene-based models. Source data are provided as a Source Data file.

Regression modeling of GReX as a predictor of contact frequency

Since enhancer-promoter interactions are among the most critical mechanisms of gene regulation, prediction of the functional consequences of genetic variation on chromatin contact could illuminate disease mechanisms. We sought to assess the extent to which genetically regulated eRNA and canonical gene expression predict three-dimensional chromatin contact frequency.

We applied the eRNA and canonical gene models to BioVU18 (N = 72,828), Vanderbilt University Medical Center’s DNA biobank linked to electronic health records, to impute sample-level GReX. We then leveraged two high-resolution Hi-C datasets (4D Nucleome19 Data Portal; Methods) that had been generated from the K562 leukemic cell line and primary astrocytes of the cerebellum. Within the K562 dataset, we identified 85,630 contact pairs that overlapped with enhancer-gene pairs with imputable expression (for each member transcript of the pair). Within the cerebellum dataset, we identified 95,701 contact pairs for enhancer-gene pairs with corresponding imputable expression. A training set was compiled for each tissue, consisting of the mean GReX values (estimated in BioVU) for each transcript pair, along with the normalized Hi-C contact frequency for the corresponding contact pair (Supplementary Figs. 2 and 3).

As a baseline, we fit a linear regression model to predict 3D contact frequency for enhancer-enhancer or enhancer-gene pairs using the mean predicted GReX of the corresponding transcripts in BioVU. An 80/20 train-test split was used to train the models. In whole blood, the training data consisted of 32,133 enhancer-gene pairs (where the eRNA was upstream and the gene was downstream), 34,682 gene-enhancer pairs (with the gene being upstream and the enhancer downstream), and 1689 enhancer-enhancer pairs. In the cerebellum training data, these numbers were 37,710; 36,117; and 2733, respectively. The resulting linear model for each tissue exhibited poor performance (R2 ≈ 0).

Due to the low performance of the initial linear model, we next trained four non-linear models (polynomial regression, random forest regression, support vector regression, and gradient boosting regression) to predict contact frequency using GReX data (Supplementary Figs. 4 and 5). Of the non-linear models, gradient boosting regression showed the highest performance in both whole blood (R2 = 0.08 using 290 boosting stages) and cerebellum (R2 = 0.13 using 290 boosting stages). The still-low performance justified the use of a more complex neural network-based approach.

Training a baseline contact model using directly assayed expression

Prior to training a neural network using GReX data, we trained a baseline contact prediction model using eRNA and gene transcript counts quantified via a nuclear run-on assay in the K562 cell line (Supplementary Fig. 6). As nuclear run-on assays are considered a gold standard for detecting nascent eRNA transcription, a model trained on these expression data should be an ideal benchmark against which a GReX-based model can be compared.

An initial neural network model was trained using eRNA and canonical gene transcription to predict chromatin contacts in the K562 cell line. Hyperparameter tuning was conducted across a pre-defined search space, and five-fold cross validation was used to quantify model performance (see Methods). The optimal model architecture (Supplementary Fig. 7a) consisted of two neurons in the input layer (for the normalized expression levels of the upstream and downstream transcripts), two hidden layers with 150 neurons each (using a ReLU activation function), and a single output neuron. Hidden weights were initialized using a normal distribution, while those in the output neuron were initialized with zeros. The model was trained over 90 epochs using a batch size of 160. The Adagrad20 optimizer was utilized with a learning rate of 0.2. This optimal model architecture achieved a mean R2 of 0.23 across the validation folds (Supplementary Fig. 7b) and 0.27 in the independent test set.

GReX of enhancer-gene pairs predicts chromatin contact frequency

We trained fully-connected neural networks to model the non-linear relationship between GReX and chromatin contact frequency in whole blood and cerebellum. The optimal model architecture for the whole blood-based model (Supplementary Fig. 8a) had two neurons in the input layer (for the GReX values of the upstream and downstream transcripts), two hidden layers with 120 neurons each (utilizing a hard sigmoid activation function), and a single output neuron. Hidden weights were initialized using a Kaiming uniform distribution, while those in the output neuron were initialized using a Kaiming normal distribution. The model was trained over 90 epochs with a batch size of 110, using the NAdam21 optimizer with a learning rate of 0.01. This optimal model achieved a mean R2 of 0.22 in both the validation folds (Supplementary Fig. 8b) and independent test set (Supplementary Fig. 8c). Notably, we found that the GReX-based model trained in whole blood maintained its predictive performance in the K562-derived nuclear run-on dataset (R2 = 0.15).

Among the 17,126 transcript pairs in the test set, we observed a median relative error E (see Methods) of 2.80 (Supplementary Data 2). The best-predicted contact pair in these data was between the enhancer ENSR00000041089 and gene PELI3 (true value = 6.0, prediction = 5.95, E = 7.0 × 10−3), a protein-coding gene associated with erythrocyte and lymphocyte counts22,23.

The optimal cerebellum-based model architecture (Fig. 3a) consisted of two hidden layers with 90 neurons each (using the Softsign activation function). Weights in these hidden layers were initialized following a Xavier normal distribution, while those in the output layer a uniform distribution. The model was trained for 50 epochs with a batch size of 80, using Adagrad20 as the optimizer and a learning rate of 0.3. The optimal model achieved a mean R2 of 0.37 across the validation folds and 0.38 in the independent test set (Fig. 3b), capturing non-linear patterns of GReX associated with contact frequency (Fig. 3c). Interestingly, the two-dimensional GReX profile for contact pairs was found to be constrained to a subregion of the total space (Fig. 3c).

Fig. 3. Deep learning model with tissue-specific eRNA and canonical gene GReX as features predicts three-dimensional contact frequency.

Fig. 3

a Grid search across 13 hyperparameters was used to find the optimal model architecture (see Methods). b The neural network was trained in cerebellum GReX and contact frequency data for 50 epochs, achieving a mean prediction R2 of 0.37 across the validation folds and 0.38 in the independent test set. Within a second tissue type not previously seen by the model, whole blood, we observed a prediction R2 of 0.18, which shows some cross-tissue portability. c Contact frequency prediction in the cerebellum test set (denoted by color) as a function of the GReX of the upstream and downstream transcript levels (x and y axes, respectively). The two-dimensional GReX space for contact pairs is constrained to lie in the colored region. d SHAP values representing the relative mean contributions of the upstream and downstream transcripts to contact frequency predictions, which we found to be 46.94% and 53.06%, respectively. Source data are provided as a Source Data file.

In addition to the enhanced prediction R2 in cerebellum, we observed a decreased median relative error in the cerebellum test set compared with whole blood (E = 0.85). The best-predicted contact pair among the 19,141 in these data was between enhancer ENSR00000186261 and WDR41 (true value = 1.0, prediction = 0.99, E = 1.80 × 10−4), a protein-coding gene implicated in frontotemporal dementia and amyotrophic lateral sclerosis24,25 (Supplementary Data 3).

Next, using the same respective test sets for whole blood and cerebellum, we utilized SHapley Additive exPlanations (SHAP)26 to determine the relative contributions of the upstream and downstream GReX features to model prediction. In both tissues, we found the mean relative contribution of the downstream transcript to be greater than that of the upstream transcript (60.67% versus 39.33% in whole blood and 53.06% versus 46.94% in cerebellum, Fig. 3d).

The 3D contact frequency model in each tissue converged with different optimal hyperparameters. We thus evaluated the cross-tissue portability. We postulated that a predictive model that can achieve good performance in a tissue not available during training may learn high-level principles underlying the global relationship between expression and chromatin contact, rather than more stochastic patterns specific to the training tissue.

We utilized the whole blood-based model to predict contact frequency across the previously unseen cerebellum data. Across the 95,701 unique enhancer-enhancer and enhancer-gene pairs in these cerebellum data, the model achieved a prediction R2 of only 0.01, a marked decrease from the R2 of 0.22 in the tissue-matched test set.

We also tested the performance of the cerebellum-trained model on the combined 85,630 transcript pairs in the whole blood data. The cerebellum-based neural network achieved markedly higher cross-tissue performance, with a prediction R2 of 0.18 in whole blood, nearly matching the performance of the whole blood-based model itself in this tissue.

Contact frequency shows low negative correlation with genomic distance

Previous work has demonstrated that linear genomic distance between two loci is generally a predictor of their contact frequency27. However, unlike (tissue type independent) genomic distance, 3D chromatin interaction exhibits highly tissue-specific patterns, especially involving a non-ubiquitously expressed gene28.

To explore the degree to which genomic distance might be influencing our predictions, we calculated the Pearson correlation of the association of the genomic distance for a transcript pair (used in the deep learning models’ test sets) with both the predicted and observed contact frequency for the pair (Supplementary Fig. 9). Across the 17,126 transcript pairs in whole blood and across the 19,141 in cerebellum, we found the correlation with both predicted (R = −0.05 and R = −0.08, respectively) and observed contact frequency (R = −0.10 in whole blood and R = −0.11 in cerebellum) to be low, suggesting that the deep learning-based contact frequency predictions were largely independent of genomic distance.

TWAS identifies eRNA and canonical gene associations with SCZ risk

A previous study8 trained FUSION-based29 models of enhancer expression in two brain regions, the dorsolateral prefrontal cortex (DLPFC) and anterior cingulate cortex (ACC). Utilizing these models, which included 8,702 unique enhancers, the authors performed a TWAS of SCZ30, yielding 98 enhancer associations outside of the major histocompatibility complex (MHC) region. Using our eRNA models encompassing a total of 14,471 unique enhancers expressed across 49 cell and tissue types, including a more diverse sampling of brain regions, we performed TWAS using the same SCZ GWAS results. Despite a higher multiple testing threshold for genome-wide significance (p < 1.23 × 10−6), we identified 392 significant enhancer-tissue associations outside of the MHC region (Fig. 4a, b), including 114 in the brain (Fig. 4c, d), representing 133 unique enhancers. Top associations in the brain include ENSR00000320019 (p = 1.31 × 10−28 in amygdala), ENSR00000320042 (p = 5.94 × 10−26 in substantia nigra), ENSR00000195227 (p = 2.48 × 10−21 in cortex), ENSR00000032823 (p = 3.67 × 10−16 in cerebellum), and ENSR00000195227 (p = 3.24 × 10−15 in hippocampus).

Fig. 4. eRNA-based TWAS of schizophrenia with corresponding GWAS and canonical gene TWAS associations.

Fig. 4

TWAS was performed using the summary statistics from a logistic regression-based GWAS of schizophrenia30 (plotted on the bottom half of a and c in grayscale). The red line in these plots indicates the Bonferroni-corrected genome-wide significance threshold (p = 5 x 10−8), and the blue line indicates suggestive significance (p = 1 x 10−6). Because the eRNA-based TWAS (plotted on the upper half of each plot) included 40,749 eRNA-tissue pairs (14,471 unique transcripts), the Bonferroni-corrected p-value threshold used for genome-wide significance was 1.23 x 10−6 (plotted in red) and 3.46 × 10−6 (plotted in blue). Because the canonical-gene-based TWAS (plotted on the lower half of b and d) included 344,814 gene-tissue pairs (26,138 unique transcripts), the Bonferroni-corrected p-value threshold used for genome-wide significance was 1.45 × 10−7 (plotted in red) and 1.91 × 10−6 (plotted in blue) for suggestive significance. a The full set of eRNA TWAS results for all 49 GTEx tissues plotted to mirror the GWAS results. b eRNA-based and canonical-gene-based TWAS results across all 49 tissues. c The eRNA-based TWAS results from 13 brain-derived tissues plotted against the GWAS results. d Brain-based eRNA and canonical-gene TWAS results. e Hi-C contact data (10 kb resolution) from primary astrocytes of the cerebellum was used to identify enhancer-gene contacts in the brain. Significant transcribed enhancer and canonical gene associations from TWAS in physical contact were tested for putative causality. The heatmap on the left depicts all Hi-C contacts (from chr20) prior to filtering, with the color scale corresponding to normalized contact frequency. The righthand plot shows a subset of those contacts between a transcribed enhancer and canonical gene.

Next, we ran TWAS using PrediXcan models of canonical gene expression previously trained in the same cell and tissue types14. Compared with the FUSION-based gene expression models, which included 10,669 unique genes across two brain regions (204 of which were associated with SCZ), the PrediXcan-based models included 26,133 unique genes expressed across 49 cell and tissue types, including 17,843 genes expressed in one or more brain regions. Across the cell and tissue types, we identified 2755 gene-tissue associations outside of the MHC region that reached genome-wide significance (p < 1.45 × 10−7), including 498 in the brain. Top hits within the brain include PRSS16 (p = 3.53 × 10−29 in cerebellar hemisphere and p = 5.30 × 10−26 in cerebellum), HFE (p = 3.71 × 10−27 in cortex), BTN3A3 (p = 1.58 × 10−26 in cortex), CNNM2 (p = 4.72 × 10−24 in frontal cortex BA9), and GNL3 (p = 8.15 × 10−24 in cortex), each of which has been previously implicated as a SCZ risk gene3136.

We investigated whether the original GWAS could have indirectly implicated these TWAS associations. Overall, of the 206 independent non-MHC loci that reached genome-wide significance in the original GWAS (p < 5 × 10−8), 6 had only an eRNA association, and 92 had only a canonical gene association. A further 45 loci had both an eRNA and canonical gene association in at least one tissue type, indicating a notably high degree of mirroring between eRNA and canonical-gene-based TWAS associations (Fig. 4b and d). Collectively, the total number of eRNA TWAS associations in the brain (114) could not be fully accounted for by the findings from the original GWAS, indicating the presence of independent eRNA associations with disease.

Mendelian randomization illuminates causal mechanisms underlying SCZ risk

We utilized Mendelian randomization (MR) to investigate the causality of the eRNA and canonical gene associations with SCZ risk. Specifically, we conducted this analysis for each genome-wide significant association (p < 1.23 × 10−6 for eRNA associations and p < 1.45 × 10−7 for canonical gene associations).

Of the 392 significant TWAS eRNA associations outside of the MHC region, 222 (56.63%) showed evidence of causal effect on SCZ from MR, representing 104 unique transcribed enhancers (Supplementary Data 4). Of these putatively causal eRNA effects, 64 were in the brain, representing 34 unique enhancers. Out of the 2755 significant gene associations identified in the canonical-gene-based TWAS, 1297 (47.07%) were predicted to be causal, representing 396 unique genes (Supplementary Data 5). 226 of these putatively causal effects were in the brain, representing 130 unique genes.

The presence of putatively causal enhancers and canonical genes in a disease-associated locus raises the question of which transcript (eRNA or canonical gene) was causally “upstream” in the regulatory network. We asked whether an eRNA’s causal effect on SCZ risk was due to the eRNA expression’s influence on contact with a canonical gene in cis or, alternatively, if the eRNA association was not mediated through the expression of a canonical gene. To evaluate physical contact, we leveraged high-resolution enhancer-gene contact maps captured by Hi-C in astrocytes of the cerebellum. In brain, we identified 14 causal (based on MR) transcribed enhancers that were in physical contact with a causal canonical gene, representing 10 unique enhancer-gene pairs that were composed of four unique eRNAs and nine unique canonical genes (Supplementary Data 6). Additionally, we identified 57 additional causal enhancer-tissue pairs that were not in contact with a causal canonical gene in the Hi-C data, representing 33 unique eRNAs. Finally, we found 232 causal canonical gene-tissue pairs in the brain that were not in contact with a causal eRNA, representing 127 unique genes. The low proportion of physical interaction between causal enhancers and causal canonical genes (only ~4.6% of all causal transcript-tissue pairs) suggests that the mechanisms of eRNAs and genes in the context of SCZ risk are largely independent.

Exploring the epigenomic context of causal eRNAs associated with SCZ

Several potential mechanisms of eRNA activity have previously been described in the literature2, which can largely be bisected into two main classes: contact-dependent and contact-independent mechanisms. Under the contact-dependent model, eRNAs interact with the cohesin complex to promote interactions between the transcribed enhancer and nearby genes5,37. Under a contact-independent model, by contrast, transcribed eRNAs recruit transcription factors and other important chromatin modifiers to mediate chromatin state and activity level. For example, eRNA expression has been shown to increase recruitment of transcription factors such as YY16,38. They can also interact with and recruit histone acetyltransferases CBP and p300 to increase H3K27ac39,40. Transcribed enhancers may also interact with the PRC2 complex, inhibiting deposition of the repressive mark H3K27me340,41. The resulting open chromatin state can then facilitate assembly of super enhancer complexes and/or transcription of nearby genes42.

To explore by which of these proposed mechanisms (contact-dependent or contact-independent) the brain-specific, causal eRNAs might influence SCZ risk, we utilized chromatin accessibility profiles (ATAC-seq and DNase-seq), as well as ChIP-seq of functionally informative histone modifications (H3K27ac, H3K27me3, and H3K4me1) and chromatin-associated proteins (RAD21, SMC3, CTCF, and EP300) in the SK-N-SH neuronal cell line to illuminate the regulatory landscape of these disease-associated transcripts. If these causal eRNAs influence SCZ risk via direct mediation of enhancer-gene interactions, we would expect to observe a strong enrichment of either cohesin complex subcomponents RAD21 and SMC3 or CTCF (Fig. 5a). In fact, however, of the 34 unique causal eRNAs tested, we observed RAD21 enrichment in only three (~9%), SMC enrichment in only one (~3%), and CTCF enrichment in only two (6%) (Supplementary Fig. 10a). Notably, we did not find enrichment of any of these chromatin contact-associated proteins within the subset of these eRNAs in 3D contact with a causal gene. Based on this low enrichment of chromatin contact-associated proteins, in addition to the overall low proportion of causal eRNAs in physical contact with a causal canonical gene, we conclude that these results do not support the contact mediation model as a plausible explanation for eRNA associations with SCZ risk.

Fig. 5. Functional analysis of causal eRNAs suggests that these transcripts mediate SCZ risk via contact-independent rather than contact-dependent mechanisms.

Fig. 5

a We hypothesized that if the causal eRNAs influenced SCZ risk via contact-dependent mechanisms, we should observe positive enrichment of cohesin complex sub-components RAD21 and SMC3 or CTCF. If the causal eRNAs influenced SCZ risk via contact-independent mechanisms, we hypothesized that one should observe enrichment of chromatin accessibility, histone marks H3K27ac and H3K4me1, and histone acetyltransferase EP300, as well as a depletion of H3K27me3. We would also expect an absence of RAD21, SMC3, and CTCF. Our functional analyses supported a contact-independent model. Expected enrichment is indicated with a + and marked in green, while expected depletion is indicated with a – and marked in red. b Transcription factor binding motif enrichment analysis using log likelihood ratio identified motifs for 135 unique factors that were significantly enriched in the sequences of the causal SCZ-associated eRNAs. Several of these TFs (EGR24345, SOX1046,47, TCF4/ITF248,49, and SP450,51) have previously been implicated in SCZ. The significance threshold (FDR < 0.05) is indicated in red. Source data are provided as a Source Data file. a was created in BioRender118.

We next explored an alternative model to explain the potential mechanism by which the causal eRNAs mediate risk for SCZ. If causal eRNA expression plays a role in maintaining an open chromatin state, we should expect to observe an enrichment of ATAC-seq and DNase-seq peaks, as well as EP300 and H3K27ac, in addition to a depletion of H3K27me3. If there is super enhancer assembly in these regions, we would also expect to observe enrichment of histone mark H3K4me1. In concordance with these expectations, we observed overlap with either ATAC-seq or DNase-seq peaks in 12 (~35%) of the 34 of the causal eRNA regions, EP300 or H3K27ac enrichment in seven (~21%), and H3K4me1 enrichment in six (~18%) (Supplementary Fig. 10b). In total, we observed enrichment for at least one of these open chromatin-associated marks in 25 (~74%) of the 34 causal eRNAs. Notably, H3K27me3 was completely absent from these same causal enhancer regions. These results collectively support the contact-independent model of eRNA activity, in which causal transcribed enhancers influence SCZ risk via their role in establishing an open chromatin state.

Motif enrichment analysis within SCZ-associated eRNAs

In addition to recruitment of chromatin modifiers, contact-independent eRNAs can also “capture” transcription factors (TFs), increasing TF occupancy at the transcribed enhancer38. To determine whether SCZ-associated enhancers were enriched for specific transcription factor binding sites, we performed motif enrichment analysis for all putative causal eRNA associations in brain (Supplementary Data 7 and 8). Across these 34 unique sequences, we observed motif enrichment (FDR < 0.05) for 135 TFs (Fig. 5b), several of which (EGR24345, SOX1046,47, TCF4/ITF248,49, and SP450,51) have previously been implicated in SCZ. Enrichment of these factors’ binding motifs at SCZ-associated eRNAs suggests a possible trans mechanism of eRNA-mediated SCZ risk, by which genetically regulated eRNA expression levels influence TF occupancy at disease-relevant enhancers.

Phenome-wide TWAS of eRNAs in the UK Biobank

We performed TWAS across 4,671 complex traits using the eRNA models. We identified 467 traits with at least one genome-wide significant (p < 2.60 × 10−10) eRNA association (Fig. 6a), representing 88,348 significant eRNA-tissue associations (Supplementary Data 9-11).

Fig. 6. eRNA eQTLs are associated with complex traits across the phenome and help to explain 63% more GWAS signals than canonical gene eQTLs alone.

Fig. 6

a Depicted are the top 50 heritable UK Biobank traits ranked by number of significant (p < 2.60 x 10−10) eRNA-tissue TWAS associations. The color for each phenotype corresponds to the p-value of its most significant eRNA association. b Using a Bayesian framework, we identified 18,815 genome-wide significant (p < 5 × 10−8) GWAS signals within the UK Biobank that only colocalized (posterior probability > 0.7) with an eRNA eQTL, representing a 63% increase over the number of independent associations that can be explained by canonical gene eQTLs alone. Source data are provided as a Source Data file.

Traits most highly enriched for significant eRNA associations include mean signal-to-noise ratio, a measure of hearing ability (1430 significant eRNA-tissue associations, lead association p = 6.49 × 10−143), as well as sitting height, a well-studied phenotype known to be highly polygenic5254 (1094 significant eRNA-tissue associations, lead association p = 1.71 × 10−106). Additional notable phenotypes highly enriched for significant eRNA-tissue associations included blood cell traits such as platelet distribution width (676 associations), neutrophil percentage (625 associations), hemoglobin concentration (583 associations), lymphocyte count (562 associations), and leukocyte count (394 associations); as well as smoking status (554 associations), cholesterol (536 associations), manifestations of mania or irritability (374 associations), weight (357 associations), and skin color (221 associations). We present the TWAS summary statistics as a comprehensive, tissue-wide resource of eRNA associations across the human phenome.

Enhancer perturbation links eRNAs to canonical target gene expression

Epigenomic analysis of causal SCZ-associated eRNAs supported a context-independent, rather than contact-dependent model. These data alone, however, cannot fully explain the underlying mechanisms by which eRNA expression influences disease risk. To investigate whether disease-associated eRNAs have an effect on canonical gene expression, we utilized CRISPR perturbation assays in the K562 cell line targeting 109 complex trait-associated, transcribed enhancers identified by TWAS (Supplementary Data 12-13). Of these 109 enhancers, we identified 14 (12.84%) for which CRISPR perturbation resulted in a significant change in expression of a corresponding gene. Notably, we did not observe Hi-C contacts between any of these 14 eRNAs and their target gene(s). We also performed eQTL mapping in whole blood for both eRNAs and canonical genes. Among the 22 unique eRNA-gene pairs identified using CRISPR perturbation, we observed only one mapped SNP eQTL (FDR < 0.1) that was shared by both the eRNA and canonical gene in a pair. Thus, these eRNAs appear to mediate canonical gene expression independently of chromatin contacts or shared SNP eQTLs.

Among some of the disease-relevant eRNAs linked to canonical gene expression were ENSR00000013481 and ENSR00000032851, both associated with SCZ; ENSR00000117322, associated with manifestations of mania or irritability; and ENSR00000320642, associated with hypertension. Perturbation of SCZ-associated eRNAs ENSR00000013481 and ENSR00000032851 resulted in decreased expression of VPS45 (fold change = 0.75, p = 1.21 × 10−3) and NT5C2 (fold change = 0.74, p = 7.66 × 10−4), respectively. Both genes have previously been implicated in SCZ and reach significance in our canonical gene-based TWAS of SCZ5557.

In addition to the two SCZ-associated eRNAs, perturbation of ENSR00000117322 (associated with Manifestations of mania or irritability) resulted in decreased expression of the SCZ-associated gene RTN4 (fold change = 0.83, p = 6.91 × 10−9), while perturbation of hypertension-associated eRNA ENSR00000320642 resulted in decreased MRPS10 expression (fold change = 0.82, p = 1.42 × 10−8). This gene codes for a mitochondrial ribosomal protein and has previously been linked to cardiac disorders5860. Neither of these eRNA-linked genes reached significance in a canonical gene-based TWAS.

Inclusion of eRNAs increases GWAS signals explained by an eQTL

Recent work reports systematic differences in identified genetic effects on complex traits (through GWAS) and gene expression (via eQTL mapping)61, so that most GWAS signals are not explained by known eQTLs6264. A number of strategies may bridge this colocalization gap, including utilizing larger sample sizes for increased eQTL mapping power, mapping eQTLs in more expansive sets of cell and tissue types, as well as considering QTLs modulating chromatin structure and splicing61. We investigated whether the use of eRNA eQTLs might improve GWAS signal interpretability.

Across all tissues, we observed a mean Jaccard similarity index of 0.05 between eRNA eQTLs (Methods) and canonical gene eQTLs, indicating a notably low overlap between the eQTL two sub-classes (Supplementary Fig. 11). Overlap was highest in the testis (Jaccard similarity index = 0.09) and lowest in brain putamen (Jaccard similarity index = 0.03).

The previously reported systematic discordance between GWAS signals and canonical gene eQTLs relies partly on the observation that eQTLs tend to be proximal to genes, while GWAS signals are more distal61. We found that, like GWAS hits, eQTLs specific to eRNAs were significantly more distal to a TSS than canonical gene eQTLs (median distance 41.93 kb versus 14.36 kb, Mann-Whitney U test p < 2.2 × 10−16).

We finally aimed to explore whether considering eRNA eQTLs might help to bridge the gap between eQTLs and GWAS signals. Using the same sets of mapped eQTLs, we performed colocalization analysis65 across 996,391 independent, genome-wide significant (p < 5 × 10−8) associations from the UK Biobank10, representing 4,671 complex traits. We identified 26,926 GWAS associations that colocalized with an eRNA eQTL (posterior probability 0.7) in at least one tissue, and a further 29,669 GWAS associations colocalized with a canonical gene eQTL (Supplementary Data 14-15). In total, there were 48,484 GWAS associations that colocalized with either an eRNA or canonical gene eQTL, with 8,111 (16.75%) colocalizing with both and the remaining 40,303 (83.25%) colocalizing with only one of the eQTL sub-classes. Notably, we identified 18,815 GWAS signals exclusively colocalizing with an eRNA eQTL, resulting in a substantial 63% increase in the total number of GWAS signals with shared causal variants with eRNAs (Fig. 6b). Collectively, these results show that eRNA analysis can substantially improve our ability to mechanistically interpret GWAS associations.

Discussion

In this work, we develop in silico models of genetically regulated eRNA expression and present several notable findings, from chromatin contact frequency prediction to eRNA-based TWAS. Across all tissues, eRNA expression showed a significantly higher mean number of SNP eQTLs per transcript versus canonical gene expression. We propose two possible explanations. Since enhancers have a high degree of tissue-specific and context dependent activity6670, eRNA expression likewise may be under a higher degree of fine-tuned genetic control than protein-coding genes. Alternatively, the high number of SNP eQTLs per eRNA could also be due to the rapid evolution of enhancer sequences relative to those of canonical genes71. Enhancers, as non-coding elements in the genome, may exhibit a higher tolerance to mutation than a typical protein-coding gene.

We imputed genetically determined eRNA expression and canonical gene expression across more than 70,000 individuals in a large-scale biobank, BioVU. The mean GReX values generated for two of the tissues (whole blood and cerebellum) were subsequently used to train two independent neural networks to predict 3D chromatin contact frequency. Previous work has demonstrated that genomic distance, Hi-C, and other 1D epigenomic datasets can be used to predict chromatin contact27,7277. Here, we demonstrate that GReX, too, is predictive of chromatin contact frequency.

Our deep learning models, trained in cerebellum and whole blood, predicted tissue-dependent contact frequency with substantially greater performance than both the corresponding linear and non-linear regression models, which showed limited predictive power. Notably, the whole blood-based GReX model retained comparable predictive accuracy in a genome-wide nuclear run-on dataset from K562, indicating that the model captures key patterns underlying the relationship between expression and contact frequency.

While the cerebellum-based (deep learning) model appeared to be portable to a tissue not observed during training (R2 = 0.18 in whole blood), the whole blood-based model had limited predictive power (R2 = 0.01 in cerebellum). Two possible explanations may explain this discrepancy in cross-tissue performance. First, relative to other tissue types, whole blood is highly heterogeneous, composed of a variety of different cell types and metabolites whose relative proportions can be highly variable within a given sample7880. Second, gene expression in whole blood is highly dynamic relative to other tissues, with rapid changes in expression patterns induced in response to a wide range of environmental stimuli, including even the changing of the seasons81. The lower stochasticity of cerebellum expression could account, at least in part, for the greater cross-tissue generalizability of the cerebellum-based model, capturing fundamental patterns in the relationship between GReX and contact frequency.

The use of functional information enhances our ability to interpret the phenotypic consequences of complex genetic variation82. Because of cost, time, and limited sample volumes, it is often not feasible to generate rich omics datasets, at a biobank scale. In recent years, approaches such as PrediXcan13 have offered a means of circumventing this barrier in silico, allowing researchers to impute genetically regulated expression, both at the gene level and even for individual splice isoforms83,84, using only germline variation. Generating a high-resolution tissue-specific chromatin contact map remains methodologically challenging due to the quadratic scale of the data. Here, we have utilized the GReX models to gain further insights into 3D chromatin organization, coupling genetically determined transcription and contact maps.

Because the vast majority of genetic associations with complex disease are within non-coding regions of the genome85, TWAS have become a valuable tool for inferring the relevant gene(s) in a GWAS locus86. A previous TWAS of schizophrenia, for example, found that the disease-associated, non-coding genetic variation identified at locus 16p11.2 results in increased expression of the gene MAPK387. The mechanistic contribution of eRNAs to disease risk remains to be elucidated. Recent work focused on neuropsychiatric phenotypes strongly suggested that enhancer expression quantitative trait loci (EeQTLs), or SNPs regulating the transcription of enhancers in the brain, can be used to investigate disease mechanisms8. The authors concluded that most of the disease heritability captured by eRNAs is independent of that explained by eQTLs of protein-coding genes. Based on these initial findings in the brain, eRNA expression in other tissues is likely to be a broadly important contributor to disease biology across the human phenome. We therefore make available eRNA-based models trained in a large collection of tissues to facilitate downstream genomic applications.

Using eRNA-based TWAS, we identified 392 eRNAs associated with schizophrenia, representing 104 unique eRNAs. This same approach was then applied to a set of phenome-wide GWAS traits in the UK Biobank10, which we present as a significant curated resource for elucidating the phenomic consequences of enhancers. Among a sample of 109 complex trait-associated eRNAs, experimental perturbation resulted in a change in canonical gene expression for ~12%. Because these CRISPR perturbations results are derived from the leukemic K562 cell line, it is possible that a larger proportion of these eRNAs may regulate canonical gene expression in other cell types. Notably, of those that were linked to a canonical gene in K562, however, no eRNA-gene pairs showed evidence of physical interaction in Hi-C contact data. We also did not observe eQTL sharing between eRNAs and canonical genes across linked pairs. These results suggest that rather than a pleiotropic model, in which the same set of eQTLs regulate GReX of both eRNA and canonical gene transcription at a given locus, eRNA and canonical GReX are under largely independent mechanisms of genetic control.

Due to the low frequency of observed contacts between complex trait-associated eRNAs and canonical genes, we explored the epigenomic context of these enhancer sequences to further illuminate potential mechanisms by which these causal eRNAs mediate disease risk. Among causal eRNAs associated with SCZ, we observed high enrichment for functional features associated with open chromatin state and relatively low enrichment of features associated with chromatin loop formation. One key caveat of these functional analyses is that the ChIP-seq assays utilized, particularly those of EP300, RAD21, SMC3, and CTCF, profile protein binding to DNA rather than RNA. Due to the unavailability of RIP-seq and CLIP-seq profiles of these factors in neuronal cells, we are unable to directly assess eRNA-protein interactions and must instead use DNA binding as a proxy. A more comprehensive characterization of the epigenomic consequences of eRNA perturbation represents an exciting direction for future work.

The functional results that we do present, however, along with significant binding motif enrichment within causal eRNAs for SCZ-associated TFs, strongly suggest that these causal disease-associated eRNA act via contact-dependent rather than contact-independent mechanisms. Under this model, genetically regulated eRNA expression levels influence the recruitment of both chromatin modifiers and TF occupancy at corresponding enhancers. Genetic control of these contact-independent regulatory mechanisms would likely influence canonical gene expression downstream, as supported by CRISPR perturbation of some of these disease-associated enhancers.

Finally, within the comprehensive catalog of disease-associated eRNAs in the UK Biobank, we show that the use of eRNAs holds promise in closing the so-called colocalization gap with GWAS traits, facilitating a 63% increase in the number of significant GWAS associations (from the UK Biobank) that can be explained by eQTLs relative to the use of canonical genes alone.

Our study has some important limitations. The underlying eRNA expression data on which our GReX models are trained are enriched for polyadenylated transcripts88. While 1D RNAs are similar in structure to mRNAs and lncRNAs (long, spliced, poly-adenylated, and unidirectionally transcribed), 2D eRNAs, which constitute the majority of eRNAs, are structurally distinct from canonical gene RNAs (short, unspliced, non-polyadenylated, and bidirectionally transcribed). Thus, our eRNA dataset likely underrepresents the full set of 2D eRNAs detectable using a direct quantification approach such as a nuclear run-on assay. However, transcript length characterization of our dataset suggests that over 90% of eRNA transcripts included in these models exhibit characteristics of 2D eRNAs, which comprise the majority of naturally occurring eRNAs15. In addition, approximately 85% of GTEx samples were derived from individuals of European ancestry89. Thus, the generalizability of these models to non-European ancestries remains to be investigated90. Furthermore, the neural network approach we implemented for contact frequency prediction captures only a proportion of the variation in 3D chromatin organization. As multiple enhancers may jointly regulate a target gene, additional local features in the form of GReX may further enhance the contact map prediction performance. Nevertheless, as demonstrated here, these eRNA models provide a powerful set of tools to explore a relatively understudied component of the genome that is ripe for new discoveries.

Methods

Ethics

This research includes genetic data from deceased human individuals from the GTEx Project9. The protected data for the GTEx Project (for example, genotype and RNA-seq data) are available via access request to dbGaP accession no. phs000424.v8.p2.

Analyses using BioVU data comply with all ethical regulations as approved by the Vanderbilt University Medical Center institutional review board (IRB 151187 and IRB 160372). All requests for raw (for example, genotype and phenotype) data and materials are reviewed by Vanderbilt University Medical Center to determine whether the request is subject to any intellectual property or confidentiality obligations. For example, patient-related data not included in the paper may be subject to patient confidentiality. Any such data and materials that can be shared will be released via a material transfer agreement.

In silico genetic models of eRNA expression

The Human enhancer RNA Atlas (HeRA)91 (https://hanlab.uth.edu/HeRA/) provides a publicly available resource of eRNA expression values processed from the raw GTEx9,92 RNA-seq reads. These expression values were first normalized using the first 5 principal components, the first 5 PEER covariates93, age, sex, and sequencing platform. Next, using an implementation of PrediXcan13 provided by MR-JTI14, we trained models for each of the 49 GTEx cell and tissue types using whole genome sequencing (WGS) and the corresponding quantified eRNA expression data generated from 507 donors. PrediXcan models of canonical gene expression were previously trained using the implementation provided in the MR-JTI repository and retrieved from Zenodo under accession code 3842289 (JTI)94.

Training set for nuclear run-on-based chromatin contact prediction

Because both eRNAs and canonical genes are 5′-capped95, K562 GRO-cap dataset enriched for capped RNAs was obtained from the ENCODE Data Portal96 (accession number ENCSR363AKK) in bigWig format. Files quantifying transcription from the plus and minus strands were combined and converted to bedGraph format. Using bedtools97 (v2.30.0), transcripts in the nuclear run-on assay that overlapped with a known human eRNAs annotated by the ENSEMBL98, FANTOM599, or Roadmap96 consortia or a known GENCODE100,101 (v32) gene were identified. Transcript counts annotated with the same eRNA or gene were combined. Prior to model training, these final expression values were log1p-normalized.

Training set for GReX-based chromatin contact prediction

The trained models of eRNA and canonical gene expression were utilized to impute GReX in 72,828 BioVU samples across all 49 cell and tissue types. The BioVU data consisted of individuals of European ancestry (31,861 males, 40,584 females, and 383 with unknown sex) genotyped on the Illumina MEGA array, followed by genotype imputation using the HRC panel. Age among individuals in the sample ranged from 0-90, with a median age of 56.

High-resolution Hi-C datasets that had been generated in the K562 leukemic cell line (accession number 4DNFI18UHVRO) and astrocytes of the cerebellum (accession number 4DNFIWCAQUIK) were retrieved from the 4D Nucleome19 Data Portal (https://data.4dnucleome.org) in mcool format. Raw sequencing reads underwent initial processing, contact matrix aggregation, and normalization using the gold standard Hi-C processing pipeline detailed at https://data.4dnucleome.org/resources/data-analysis/hi_c-processing-pipeline. The Hi-C dataset representing whole blood (K562) included 907,136,828 filtered reads, while the cerebellum-based dataset included 428,475,763. Both datasets showed similar quality control metrics, such as cis/trans ratio, % long-range intrachromosomal reads, and very good convergence (Supplementary Data 16). Contacts were normalized using the ICE (iterative correction and eigenvalue decomposition) algorithm102.

With the processed mcool file, Cooler (v0.8.2)103 was utilized to export contacts at a 10 kb resolution using the cooler dump function, and an R script was written to convert these outputs to BEDPE format (see GitHub104). We defined an initial set of contact regions as those with one or more normalized contact in each Hi-C dataset. We then filtered these contact pairs down to a subset in which the respective 10 kb contact regions overlapped with either two annotated eRNAs or an eRNA and canonical gene. The eRNA annotations used were from the same collection utilized by the authors of the Human enhancer RNA Atlas (HeRA)91, and consisted of human eRNAs annotated by the ENSEMBL98, FANTOM599, and Roadmap96 consortia. Canonical gene annotations were obtained from GENCODE (v32)100,101.

These contact data, along with mean eRNA and canonical gene GReX values for whole blood and cerebellum in BioVU, respectively, comprised the training set. The resulting dataset contained 85,630 unique genome-wide enhancer-gene contact pairs in whole blood and 95,701 in cerebellum. Because the genomic distance distribution was left-skewed, these values underwent log1p transformation.

Linear regression models of chromatin contact frequency

Using scikit-learn105 (v1.2.2), two linear regression models were fit to predict contact frequency for enhancer-enhancer and enhancer-gene pairs, one based on genomic distance between the two elements and the other based on GReX of the respective eRNA and gene. An 80/20 train-test split was used, meaning that the whole blood training set consisted of 68,504 unique contact pairs, while the held-out test set was composed of 17,126. The number of contact pairs in the cerebellum training and test sets were 76,560 and 19,141, respectively.

Non-linear models of chromatin contact frequency

Using scikit-learn105 (v1.2.2), four non-linear models (polynomial regression, random forest regression, support vector regression, and gradient boosting regression) were fit to predict enhancer-enhancer and enhancer-gene contact frequency using GReX data from both whole blood and cerebellum. An 80/20 train-validation split was used with grid search to select the optimal set of hyperparameters for each model.

Polynomial regression grid search utilized degrees ranging from 1 to 10. For random forest regression grid search, we tested tree numbers ranging from 10 to 100. Support vector regression grid search tested epsilon values ranging from 0.1 to 1. Gradient boosting regression grid search tested boosting stage numbers ranging from 10 to 300. R2 was used as the selection metric for determining each best-performing model.

Training deep learning models based on eRNA and canonical gene GReX

All deep learning models were trained on an Nvidia Tesla K80 GPU running CUDA106. Using PyTorch107 (v1.13.1), two fully connected neural networks were trained to predict enhancer-gene contact frequency. The first model was trained using solely the mean genetically regulated eRNA and canonical gene expression imputed in BioVU, while the second model also included the genomic distance between the two transcripts.

Both models were trained using five-fold cross validation, and hyperparameters were tuned using grid search. An 80/20 train-test split was used. During grid search, each network was iteratively optimized for the number of hidden neurons, number of hidden layers, batch size and number of training epochs, optimizer and learning rate, weight initialization in the hidden and output layers, hidden layer activation function, dropout, and weight constraint, and L1 and L2 regularization parameters. Models were trained using both R2 and root mean squared error (RMSE) as the selection criterion and showed comparable results (Supplementary Fig. 12). Optimal models were ultimately selected based on maximization of prediction R2.

Because this was a regression problem, mean squared error (MSE) served as the loss function:

MSE=1ni=1nYiYi^2 1

Here, n is equal to the number of data points, Yi represents the observed outcome, and Yi^ represents the predicted outcome. We assumed a combination of L1 and L2 regularization. Predictive performance was evaluated using the R2 metric:

R2=1i=1nYi^Yi2i=1nYiȲ2 2

Here, n is equal to the number of data points, Yi represents the observed outcome, Yi^ represents the predicted outcome, and yi¯ represents the mean of the observed outcomes.

SCZ PrediXcan TWAS

Using the eRNA models and previously published canonical gene expression models14, S-PrediXcan108, implemented in the MetaXcan GitHub repository (https://github.com/hakyimlab/MetaXcan), was utilized to perform summary statistics based TWAS on a recent GWAS meta-analysis of schizophrenia30. Multiple testing was accounted for using a Bonferroni correction. Thus, for the eRNA-based TWAS, the p-value threshold for genome-wide significance was 1.23 × 10−6 0.0540,749eRNAtissuepairs, while the threshold for suggestive significance was 3.46 × 10−6 0.0514,471uniqueeRNAs. Meanwhile, for the canonical gene-based TWAS, genome-wide significant and suggestive p-value thresholds of 1.45 × 10−7 0.05344,814genetissuepairs and 1.91 × 10−6 0.0526,138uniquegenes, respectively, were used.

Mendelian randomization of significant SCZ TWAS loci

Genome-wide significant associations from the respective eRNA-based and canonical gene-based TWAS were tested for causality using Mendelian randomization implemented in MR-JTI14 (https://github.com/gamazonlab/MR-JTI). After testing each eRNA-tissue and canonical gene-tissue association for causality, the identified causal associations were localized to their respective genetic loci to assess which loci had a causal eRNA only, canonical gene only, or both for SCZ.

Investigating potential functional mechanisms of causal eRNAs

ChIP-seq datasets generated in the SK-N-SH neuronal cell line were downloaded from the ENCODE Project96,109 data portal (https://www.encodeproject.org) in narrowPeak format for H3K27ac (ENCFF362OBM), H3K27me3 (ENCFF277NRX), H3K4me1 (ENCFF580GTZ), RAD21 (ENCFF051ZRW), SMC3 (ENCFF756DQA), CTCF (ENCFF244QKO), and EP300 (ENCFF654KAP). ATAC-seq (ENCFF716JUM) and DNase-seq (ENCFF752OZB) datasets were also retrieved for the same cell line. All datasets based on hg38 were lifted over to hg19 using liftOver110.

Transcription factor binding motif enrichment

FIMO111 (v5.4.1, https://meme-suite.org/meme/doc/fimo.html) was used to search for enrichment of HOCOMOCO112 v11 human TF binding motifs in the set of 34 causal SCZ-associated eRNAs in brain. Default parameters were used, and motifs with a q-value (FDR) < 0.05 were considered to have significant enrichment.

Phenome-wide eRNA-based TWAS in the UK Biobank

Using S-PrediXcan108, the eRNA-based models were applied to GWAS summary statistics across 4,671 traits in the UK Biobank10 to generate a comprehensive resource of tissue-specific enhancer associations. A genome-wide significant threshold of p < 2.60 × 10−10 was used, with the multiple testing threshold for genome-wide significance determined using Bonferroni correction:

peRNA=0.05/41,086eRNAtissuepairs4671complextraits 3

Enhancer perturbation analysis for disease-associated eRNAs

A previously published enhancer perturbation dataset was obtained for the K562 cell line. Using CRISPRi, 5920 human enhancers were perturbed, and resulting changes in gene expression were measured113. Perturbed enhancer coordinates were converted to UCSC BED format, and enhancers that overlapped with a known eRNA annotated by the ENSEMBL98, FANTOM599, or Roadmap96 consortia were identified using bedtools97 (v2.30.0, https://bedtools.readthedocs.io/en/latest/).

eRNA and canonical gene eQTL mapping

The same genotype and normalized expression datasets used for eRNA and canonical gene-based GReX model training was used to perform eQTL mapping. Cis-eQTLs were mapped using Matrix eQTL114 (v2.3).

Comparing eRNA eQTLs and canonical gene eQTLs

For each tissue, the overlap between the eRNA eQTLs and the canonical gene eQTLs s was assessed using the Jaccard similarity index.

To assess the distance from a canonical gene, we used the human TSS coordinates from refTSS115 version 4.1 (https://reftss.riken.jp/datafiles/4.1/human/refTSS_v4.1_human_coordinate.hg38.bed.txt.gz). The coordinates were lifted from hg38 to hg19 using liftOver110 (http://hgdownload.soe.ucsc.edu/admin/exe/liftOver.gz). Distance from the closest TSS for all eRNA and canonical gene eQTLs was computed using bedtools (v2.30.0, https://bedtools.readthedocs.io/en/latest/).

Colocalization of GWAS signals with eRNA and canonical gene eQTLs

Across 4671 complex traits in the UK Biobank10, 996,391 independent, genome-wide significant (p < 5 × 10−8) associations were identified. For each significant association, all GWAS SNPs and SNP eQTLs at the corresponding locus were utilized to perform colocalization analysis using coloc65 (v5.2.3). Colocalization was run in each of the 49 GTEx tissues for both eRNA eQTLs and canonical gene eQTLs. A GWAS signal was considered to colocalize with an eQTL if the posterior probability was >= 0.7 in at least one tissue.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

41467_2025_58023_MOESM2_ESM.pdf (174.6KB, pdf)

Description of Additional Supplementary Files

Supplementary Data 1 (622.5KB, xlsx)
Supplementary Data 2 (718KB, xlsx)
Supplementary Data 3 (1,014.8KB, xlsx)
Supplementary Data 4 (19.1KB, xlsx)
Supplementary Data 5 (63.3KB, xlsx)
Supplementary Data 6 (23.8KB, xlsx)
Supplementary Data 7 (701KB, xlsx)
Supplementary Data 8 (11.6KB, xlsx)
Supplementary Data 9 (12MB, xlsx)
Supplementary Data 10 (34.8KB, xlsx)
Supplementary Data 11 (11.1KB, xlsx)
Supplementary Data 12 (1.7MB, xlsx)
Supplementary Data 13 (13.4KB, xlsx)
Supplementary Data 14 (307.8MB, gz)
Supplementary Data 15 (358.9MB, gz)
Supplementary Data 16 (10.7KB, xlsx)
Reporting Summary (2.3MB, pdf)

Source data

Source Data (18.3MB, xlsx)

Acknowledgements

This research was supported by National Institutes of Health (NIH) grants NHGRI R35HG010718 (E.R.G.), NHGRI R01HG011138 (E.R.G.), NIA AG068026 (E.R.G.), NIGMS R01GM140287 (E.R.G.), NIMH R01MH126459 (E.R.G.), NIA R56AG089926 (E.R.G.), and NIH/NCI U01CA253560 (M.C.A.). This research was supported by a gift to E.R.G. from the Scott Hamilton CARES Foundation (in honor of Carlee Vaughn).

Author contributions

M.J.B. and E.R.G. designed the study. M.B. performed the data analysis, with contributions from P.L., and wrote the manuscript, with contributions from E.R.G. and M.C.A. E.R.G. supervised the study.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Data availability

The eRNA GReX models, along with TWAS, Mendelian randomization, and colocalization results generated in this study are have been deposited in Zenodo under the accession code 14027849116. The trained contact frequency models are deposited on GitHub (https://github.com/mjbetti/erna-grex) and Zenodo under accession code 14557414104. The previously published canonical gene GReX models are deposited in Zenodo under the accession code 384228994. Nuclear run-on data from the K562 cell line are available from the ENCODE Portal96 under accession code ENCSR363AKK. Hi-C datasets for K562 and astrocytes of the cerebellum are available from the 4D Nucleome Data Portal19 under accession codes 4DNFI18UHVRO and 4DNFIWCAQUIK, respectively. ChIP-seq and chromatin accessibility datasets for SK-N-SH are available from the ENCODE Portal96 under accession codes ENCFF362OBM, ENCFF277NRX, ENCFF580GTZ, ENCFF051ZRW, ENCFF756DQA, ENCFF244QKO, ENCFF654KAP, ENCFF716JUM, and ENCFF752OZB. The curated dataset of human TSS coordinates is available from refTSS115 (https://reftss.riken.jp/datafiles/4.1/human/refTSS_v4.1_human_coordinate.hg38.bed.txt.gz). The quantified eRNA expression data used to train GReX models are available from HeRA91 (https://hanlaboratory.com/HeRA/). Due to the nature of the GTEx donor consent agreement, raw genotype data from GTEx V89 individuals are available under restricted access. These data are available via access request to dbGaP accession no. phs000424.v10.p2 [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v10.p2]. Access to de-identified BioVU genotype data18 requires a collaboration with a Vanderbilt faculty member. Investigators may contact BioVU support (biovu@vumc.org) for additional information or assistance in establishing a collaboration. All requests for BioVU data and materials are reviewed by Vanderbilt University Medical Center to determine whether the request is subject to any intellectual property or confidentiality obligations. Any such data and materials that can be shared will be released via a material transfer agreement. The compiled datasets used to train contact frequency models, TF binding motif enrichment results, perturbed TWAS eRNAs, and Hi-C QC metrics, and underlying data used to plot figures are available in the Supplementary Information/Source Data file. Source data are provided with this paper.

Code availability

All code generated for this work is publicly accessible on GitHub (https://github.com/mjbetti/erna-grex). A permanent version of all code is deposited on Zenodo under accession code 14557414104.

Competing interests

Dr. Gamazon has performed consulting for Thryv Therapeutics. Dr. Gamazon is a co-inventor on patents for molecular signatures of cardiovascular phenotypes and metabolic health, and the use of RNAs as therapeutics and diagnostic biomarkers. This had no influence on the research presented in this study. The remaining authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Michael J. Betti, Email: michael.j.betti@vanderbilt.edu

Eric R. Gamazon, Email: eric.gamazon@vumc.org

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-025-58023-x.

References

  • 1.Panigrahi, A. & O’Malley, B. W. Mechanisms of enhancer action: the known and the unknown. Genome Biol.22, 108 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Arnold, P. R., Wells, A. D. & Li, X. C. Diversity and emerging roles of enhancer rna in regulation of gene expression and cell fate. Front Cell Dev. Biol.7, 377 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ren, C. et al. Functional annotation of structural ncRNAs within enhancer RNAs in the human genome: implications for human disease. Sci. Rep.7, 15518 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tsai, P.-F. et al. A muscle-specific enhancer rna mediates cohesin recruitment and regulates transcription in trans. Mol. Cell71, 129–141.e8 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Pezone, A. et al. RNA stabilizes transcription-dependent chromatin loops induced by nuclear hormones. Sci. Rep.9, 3925 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Sartorelli, V. & Lauberth, S. M. Enhancer RNAs are an important regulatory layer of the epigenome. Nat. Struct. Mol. Biol.27, 521–528 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Han, Z. & Li, W. Enhancer RNA: What we know and what we can achieve. Cell Prolif.55, e13202 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Dong, P. et al. Population-level variation in enhancer expression identifies disease mechanisms in the human brain. Nat. Genet.54, 1493–1503 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Consortium, G. T. Ex The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science369, 1318–1330 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Neale, B. UK Biobank GWAS Round 2. Neale Labhttp://www.nealelab.is/uk-biobank/ (2018).
  • 11.Churchhouse, C. Rapid GWAS of thousands of phenotypes for 337,000 samples in the UK Biobank. Neale labhttp://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank (2017).
  • 12.Zou, H. & Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B Stat. Methodol.67, 301–320 (2005). [Google Scholar]
  • 13.Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet.47, 1091–1098 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zhou, D. et al. A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis. Nat. Genet.52, 1239–1246 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Li, Q. et al. Enhancer RNAs: mechanisms in transcriptional regulation and functions in diseases. Cell Commun. Signal.21, 191 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Koch, F. et al. Transcription initiation platforms and GTF recruitment at tissue-specific enhancers and promoters. Nat. Struct. Mol. Biol.18, 956–963 (2011). [DOI] [PubMed] [Google Scholar]
  • 17.Natoli, G. & Andrau, J.-C. Noncoding transcription at enhancers: general principles and functional models. Annu. Rev. Genet.46, 1–19 (2012). [DOI] [PubMed] [Google Scholar]
  • 18.Roden, D. M. et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin. Pharmacol. Ther.84, 362–369 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Dekker, J. et al. The 4D nucleome project. Nature549, 219–226 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Duchi, J., Hazan, E. & Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. (2011).
  • 21.Dozat, T. Incorporating Nesterov Momentum into Adam. Proceedings of the 4thInternational Conference on Learning Representations (2016).
  • 22.Sakaue, S. et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat. Genet.53, 1415–1424 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Chen, M.-H. et al. Trans-ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations. Cell182, 1198–1213.e14 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Amick, J., Tharkeshwar, A. K., Amaya, C. & Ferguson, S. M. WDR41 supports lysosomal response to changes in amino acid availability. Mol. Biol. Cell29, 2213–2227 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sullivan, P. M. et al. The ALS/FTLD associated protein C9orf72 associates with SMCR8 and WDR41 to regulate the autophagy-lysosome pathway. Acta Neuropathol. Commun.4, 51 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lundberg, S. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv [cs.AI] (2017).
  • 27.Hahn, S. & Kim, D. Physical origin of the contact frequency in chromosome conformation capture data. Biophys. J.105, 1786–1795 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Schmitt, A. D. et al. A Compendium of Chromatin Contact Maps Reveals Spatially Active Regions in the Human Genome. Cell Rep.17, 2042–2059 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet.48, 245–252 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Trubetskoy, V. et al. Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature604, 502–508 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Girgenti, M. J., LoTurco, J. J. & Maher, B. J. ZNF804a regulates expression of the schizophrenia-associated genes PRSS16, COMT, PDE4B, and DRD2. PLoS One7, e32404 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chen, P. et al. Polymorphism of Transferrin Gene Impacts the Mediating Effects of Psychotic Symptoms on the Relationship between Oxidative Stress and Cognition in Patients with Chronic Schizophrenia. Antioxidants (Basel)11, (2022). [DOI] [PMC free article] [PubMed]
  • 33.Buxton, D. S., Batten, D. J., Crofts, J. J. & Chuzhanova, N. Predicting novel genomic regions linked to genetic disorders using GWAS and chromosome conformation data - a case study of schizophrenia. Sci. Rep.9, 17940 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Rose, E. J. et al. Effects of a novel schizophrenia risk variant rs7914558 at CNNM2 on brain structure and attributional style. Br. J. Psychiatry204, 115–121 (2014). [DOI] [PubMed] [Google Scholar]
  • 35.Hederih, J. et al. Genetic underpinnings of schizophrenia-related electroencephalographical intermediate phenotypes: A systematic review and meta-analysis. Prog. Neuropsychopharmacol. Biol. Psychiatry104, 110001 (2021). [DOI] [PubMed] [Google Scholar]
  • 36.Yang, Z. et al. The genome-wide risk alleles for psychiatric disorders at 3p21.1 show convergent effects on mRNA expression, cognitive function, and mushroom dendritic spine. Mol. Psychiatry25, 48–66 (2020). [DOI] [PubMed] [Google Scholar]
  • 37.Li, W. et al. Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation. Nature498, 516–520 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Sigova, A. A. et al. Transcription factor trapping by RNA in gene regulatory elements. Science350, 978–981 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Bose, D. A. et al. RNA binding to CBP stimulates histone acetylation and transcription. Cell168, 135–149.e22 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Pnueli, L., Rudnizky, S., Yosefzon, Y. & Melamed, P. RNA transcribed from a distal enhancer is required for activating the chromatin at the promoter of the gonadotropin α-subunit gene. Proc. Natl Acad. Sci. USA112, 4369–4374 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Cao, R. et al. Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science298, 1039–1043 (2002). [DOI] [PubMed] [Google Scholar]
  • 42.Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell155, 934–947 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Hu, T.-M., Chen, C.-H., Chuang, Y.-A., Hsu, S.-H. & Cheng, M.-C. Resequencing of early growth response 2 (EGR2) gene revealed a recurrent patient-specific mutation in schizophrenia. Psychiatry Res228, 958–960 (2015). [DOI] [PubMed] [Google Scholar]
  • 44.Marballi, K. K. & Gallitano, A. L. Immediate early genes anchor a biological pathway of proteins required for memory formation, long-term depression and risk for schizophrenia. Front. Behav. Neurosci.12, 23 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Yamada, K. & Yoshikawa, T. From the EGR gene family to common pathways in schizophrenia: single genes versus convergent pathways. Future Neurol.2, 347–351 (2007). [Google Scholar]
  • 46.Iwamoto, K. et al. DNA methylation status of SOX10 correlates with its downregulation and oligodendrocyte dysfunction in schizophrenia. J. Neurosci.25, 5376–5381 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Yuan, A. et al. Effect of SOX10 gene polymorphism on early onset schizophrenia in Chinese Han population. Neurosci. Lett.521, 93–97 (2012). [DOI] [PubMed] [Google Scholar]
  • 48.Sepp, M. et al. The intellectual disability and schizophrenia associated transcription factor tcf4 is regulated by neuronal activity and protein kinase A. J. Neurosci.37, 10516–10527 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Xia, H. et al. Building a schizophrenia genetic network: transcription factor 4 regulates genes involved in neuronal development and schizophrenia risk. Hum. Mol. Genet.27, 3246–3256 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Zhou, X. Over-representation of potential SP4 target genes within schizophrenia-risk genes. Mol. Psychiatry27, 849–854 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Chen, J. et al. Role played by the SP4 gene in schizophrenia and major depressive disorder in the Han Chinese population. Br. J. Psychiatry208, 441–445 (2016). [DOI] [PubMed] [Google Scholar]
  • 52.You, C. et al. Polygenic scores and parental predictors: an adult height study based on the united kingdom biobank and the framingham heart study. Front. Genet.12, 669441 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet.42, 565–569 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Ge, T., Chen, C.-Y., Neale, B. M., Sabuncu, M. R. & Smoller, J. W. Phenome-wide heritability analysis of the UK Biobank. PLoS Genet13, e1006711 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Hervoso, J. L. et al. Splicing-specific transcriptome-wide association uncovers genetic mechanisms for schizophrenia. Am. J. Hum. Genet.111, 1573–1587 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Plooster, M., Brennwald, P. & Gupton, S. L. Endosomal trafficking in schizophrenia. Curr. Opin. Neurobiol.74, 102539 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Duarte, R. R. R. et al. The psychiatric risk gene NT5C2 regulates adenosine monophosphate-activated protein kinase signaling and protein translation in human neural progenitor cells. Biol. Psychiatry86, 120–130 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Gao, F. et al. A defect in mitochondrial protein translation influences mitonuclear communication in the heart. Nat. Commun.14, 1595 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Lanfear, D. E., Yang, J. J., Mishra, S. & Sabbah, H. N. Genome-wide approach to identify novel candidate genes for beta blocker response in heart failure using an experimental model. Discov. Med.11, 359–366 (2011). [PMC free article] [PubMed] [Google Scholar]
  • 60.Melka, M. G. et al. Genome-wide scan for loci of adolescent obesity and their relationship with blood pressure. J. Clin. Endocrinol. Metab.97, E145–50 (2012). [DOI] [PubMed] [Google Scholar]
  • 61.Mostafavi, H., Spence, J. P., Naqvi, S. & Pritchard, J. K. Systematic differences in discovery of genetic effects on gene expression and complex traits. Nat. Genet.55, 1866–1875 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Chun, S. et al. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet.49, 600–605 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Umans, B. D., Battle, A. & Gilad, Y. Where are the disease-associated eQTLs? Trends Genet37, 109–124 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Connally, N. J. et al. The missing link between genetic association and regulatory function. Elife11, (2022). [DOI] [PMC free article] [PubMed]
  • 65.Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet10, e1004383 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Ko, J. Y., Oh, S. & Yoo, K. H. Functional enhancers as master regulators of tissue-specific gene regulation and cancer development. Mol. Cells40, 169–177 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Gosselin, D. et al. Environment drives selection and function of enhancers controlling tissue-specific macrophage identities. Cell159, 1327–1340 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Huang, J. et al. Dynamic control of enhancer repertoires drives lineage and stage-specific transcription during hematopoiesis. Dev. Cell36, 9–23 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Siersbæk, R. et al. Transcription factor cooperativity in early adipogenic hotspots and super-enhancers. Cell Rep.7, 1443–1455 (2014). [DOI] [PubMed] [Google Scholar]
  • 70.Whyte, W. A. et al. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell153, 307–319 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell160, 554–566 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Dsouza, K. B. et al. Learning representations of chromatin contacts using a recurrent neural network identifies genomic drivers of conformation. Nat. Commun.13, 3704 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Yang, R. et al. Epiphany: predicting Hi-C contact maps from 1D epigenomic signals. Genome Biol.24, 134 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Zheng, X., Wang, J. & Wang, C. HiCArch: A Deep Learning-based Hi-C Data Predictor. bioRxiv 2021.11.26.470146 10.1101/2021.11.26.470146 (2021)
  • 75.Li, W., Wong, W. H. & Jiang, R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res47, e60 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Agarwal, A. & Chen, L. DeepPHiC: predicting promoter-centered chromatin interactions using a novel deep learning approach. Bioinformatics39, (2023). [DOI] [PMC free article] [PubMed]
  • 77.Zhang, S., Chasman, D., Knaack, S. & Roy, S. In silico prediction of high-resolution Hi-C interaction matrices. Nat. Commun.10, 5449 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Kleiveland, C. R. Peripheral Blood Mononuclear Cells. (Springer, 2015).
  • 79.Koh, E. T., Chi, M. S. & Lowenstein, F. W. Comparison of selected blood components by race, sex, and age. Am. J. Clin. Nutr.33, 1828–1835 (1980). [DOI] [PubMed] [Google Scholar]
  • 80.Kresovich, J. K., Parks, C. G., Sandler, D. P., Weinberg, C. R. & Taylor, J. A. The Role of Blood Cell Composition in Epidemiologic Studies of Telomeres. Epidemiology31, e34–e36 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.De Jong, S. et al. Seasonal changes in gene expression represent cell-type composition in whole blood. Hum. Mol. Genet.23, 2721–2728 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Young, D. L. & Fields, S. The role of functional data in interpreting the effects of genetic variation. Mol. Biol. Cell26, 3904–3908 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Liu, Y. et al. Gene Expression and RNA Splicing Imputation Identifies Novel Candidate Genes Associated with Osteoporosis. J. Clin. Endocrinol. Metab.105, e4742–57 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Bhattacharya, A. et al. Isoform-level transcriptome-wide association uncovers extensive novel genetic risk mechanisms for neuropsychiatric disorders in the human brain. bioRxiv10.1101/2022.08.23.22279134 (2022). [DOI] [PMC free article] [PubMed]
  • 85.Giral, H., Landmesser, U. & Kratzer, A. Into the Wild: GWAS Exploration of Non-coding RNAs. Front Cardiovasc Med5, 181 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Wainberg, M. et al. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet.51, 592–599 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Gusev, A. et al. Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nat. Genet.50, 538–548 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Altmäe, S., Molina, N. M. & Sola-Leyva, A. Omission of non-poly(A) viral transcripts from the tissue level atlas of the healthy human virome. BMC Biol. vol.18, 179 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Gay, N. R. et al. Impact of admixture and ancestry on eQTL analysis and GWAS colocalization in GTEx. Genome Biol.21, 233 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Chen, F. et al. Multi-ancestry transcriptome-wide association analyses yield insights into tobacco use biology and drug repurposing. Nat. Genet.55, 291–300 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Zhang, Z. et al. HeRA: an atlas of enhancer RNAs across human tissues. Nucleic Acids Res49, D932–D938 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.GTEx Consortium. et al. Genetic effects on gene expression across human tissues. Nature550, 204–213 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc.7, 500–507 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Gamazon, E. & Zhou, D. JTI Version 1.0. Zenodo10.5281/zenodo.3842289 (2020).
  • 95.Napoli, S., Munz, N., Guidetti, F. & Bertoni, F. Enhancer RNAs (eRNAs) in cancer: The jacks of all trades. Cancers (Basel)14, 1978 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.ENCODE Project Consortium et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature583, 699–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics26, 841–842 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res49, D884–D891 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Abugessaisa, I. et al. FANTOM5 CAGE profiles of human and mouse reprocessed for GRCh38 and GRCm38 genome assemblies. Sci. Data4, 170107 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res22, 1760–1774 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Frankish, A. et al. GENCODE 2021. Nucleic Acids Res49, D916–D923 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Imakaev, M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods9, 999–1003 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Abdennur, N. & Mirny, L. A. Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics36, 311–316 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Betti, MJ & Gamazon, E. mjbetti/erna-grex: v1. Zenodo10.5281/zenodo.14557414 (2024).
  • 105.Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. arXiv [cs.LG] (2012).
  • 106.Nickolls, J., Buck, I., Garland, M. & Skadron, K. Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for? Queueing Syst.6, 40–53 (2008). [Google Scholar]
  • 107.Paszke, A. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv [cs.LG] (2019).
  • 108.Barbeira, A. N. et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun.9, 1825 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Hinrichs, A. S. et al. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res34, D590–8 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111.Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics27, 1017–1018 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res46, D252–D259 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell176, 377–390.e19 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Shabalin, A. A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics28, 1353–1358 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Abugessaisa, I. et al. refTSS: A Reference Data Set for Human and Mouse Transcription Start Sites. J. Mol. Biol.431, 2407–2422 (2019). [DOI] [PubMed] [Google Scholar]
  • 116.Betti, M. J. & Gamazon, E. eRNA GReX. Zenodo10.5281/zenodo.14027849 (2024).
  • 117.Betti, M. https://BioRender.com/s62i616 (2025).
  • 118.Betti, M. https://BioRender.com/x12w509 (2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

41467_2025_58023_MOESM2_ESM.pdf (174.6KB, pdf)

Description of Additional Supplementary Files

Supplementary Data 1 (622.5KB, xlsx)
Supplementary Data 2 (718KB, xlsx)
Supplementary Data 3 (1,014.8KB, xlsx)
Supplementary Data 4 (19.1KB, xlsx)
Supplementary Data 5 (63.3KB, xlsx)
Supplementary Data 6 (23.8KB, xlsx)
Supplementary Data 7 (701KB, xlsx)
Supplementary Data 8 (11.6KB, xlsx)
Supplementary Data 9 (12MB, xlsx)
Supplementary Data 10 (34.8KB, xlsx)
Supplementary Data 11 (11.1KB, xlsx)
Supplementary Data 12 (1.7MB, xlsx)
Supplementary Data 13 (13.4KB, xlsx)
Supplementary Data 14 (307.8MB, gz)
Supplementary Data 15 (358.9MB, gz)
Supplementary Data 16 (10.7KB, xlsx)
Reporting Summary (2.3MB, pdf)
Source Data (18.3MB, xlsx)

Data Availability Statement

The eRNA GReX models, along with TWAS, Mendelian randomization, and colocalization results generated in this study are have been deposited in Zenodo under the accession code 14027849116. The trained contact frequency models are deposited on GitHub (https://github.com/mjbetti/erna-grex) and Zenodo under accession code 14557414104. The previously published canonical gene GReX models are deposited in Zenodo under the accession code 384228994. Nuclear run-on data from the K562 cell line are available from the ENCODE Portal96 under accession code ENCSR363AKK. Hi-C datasets for K562 and astrocytes of the cerebellum are available from the 4D Nucleome Data Portal19 under accession codes 4DNFI18UHVRO and 4DNFIWCAQUIK, respectively. ChIP-seq and chromatin accessibility datasets for SK-N-SH are available from the ENCODE Portal96 under accession codes ENCFF362OBM, ENCFF277NRX, ENCFF580GTZ, ENCFF051ZRW, ENCFF756DQA, ENCFF244QKO, ENCFF654KAP, ENCFF716JUM, and ENCFF752OZB. The curated dataset of human TSS coordinates is available from refTSS115 (https://reftss.riken.jp/datafiles/4.1/human/refTSS_v4.1_human_coordinate.hg38.bed.txt.gz). The quantified eRNA expression data used to train GReX models are available from HeRA91 (https://hanlaboratory.com/HeRA/). Due to the nature of the GTEx donor consent agreement, raw genotype data from GTEx V89 individuals are available under restricted access. These data are available via access request to dbGaP accession no. phs000424.v10.p2 [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v10.p2]. Access to de-identified BioVU genotype data18 requires a collaboration with a Vanderbilt faculty member. Investigators may contact BioVU support (biovu@vumc.org) for additional information or assistance in establishing a collaboration. All requests for BioVU data and materials are reviewed by Vanderbilt University Medical Center to determine whether the request is subject to any intellectual property or confidentiality obligations. Any such data and materials that can be shared will be released via a material transfer agreement. The compiled datasets used to train contact frequency models, TF binding motif enrichment results, perturbed TWAS eRNAs, and Hi-C QC metrics, and underlying data used to plot figures are available in the Supplementary Information/Source Data file. Source data are provided with this paper.

All code generated for this work is publicly accessible on GitHub (https://github.com/mjbetti/erna-grex). A permanent version of all code is deposited on Zenodo under accession code 14557414104.


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES