Abstract
Non-coding transcriptional regulatory elements are critical for controlling the spatiotemporal expression of genes. Here, we demonstrate that the sizes and number of enhancers linked to a gene reflect its disease pathogenicity. Moreover, genes with redundant enhancer domains are depleted of cis-acting genetic variants that disrupt gene expression, and they are buffered against the effects of disruptive non-coding mutations. Our results demonstrate that dosage-sensitive genes have evolved a robustness to the disruptive effects of genetic variation by expanding their regulatory domains. This solves a puzzle about why genes associated with human disease are depleted of cis-eQTLs (cis-expression quantitative trait loci), suggesting that this relationship might complicate gene identification in causal genome-wide association studies (GWASs) using eQTL information, and establishes a framework for identifying non-coding regulatory variation with phenotypic consequences.
Keywords: enhancer, gene regulation, eQTLs, pathogenicity, causal gene, intolerance, enhancer domains, Mendelian disease, EDS
Introduction
Non-coding regulatory elements, such as transcriptional enhancers, are critical for the precise spatiotemporal regulation of gene expression. Transcriptional regulation is a highly complex process that is often mediated by arrays of enhancer elements that can be separated from their regulated genes by more than one megabase.1 Studies in Drosophila have demonstrated that key developmental genes are often regulated by multiple “shadow” enhancers with redundant activity patterns and are thus protected against genetic perturbations.2, 3, 4 Recent work suggests that a similar organizational structure might be present in mammalian genomes. Mammalian developmental genes have been reported to be located near more enhancer elements than the average gene.5, 6, 7 Furthermore, targeted deletions of ultraconserved enhancers that regulate key developmental genes have led to viable mice, occasionally with subtle abnormal phenotypes.8, 9, 10 Finally, differences in the activity of enhancer elements are often not reflected by changes in gene expression.11, 12, 13
In keeping with the key role that non-coding enhancer elements play in gene regulation, many groups have shown that enhancers are enriched for disease-associated common genetic variants.14, 15, 16 In contrast, disease-associated rare variants have mostly been identified within the protein-coding regions of genes, and studies employing whole-exome sequencing (WES) have effectively implicated genes and disease-causing mutations in conditions including epilepsy, idiopathic pulmonary fibrosis, ALS, and others.17, 18, 19 Despite intense interest in developing complementary approaches to implicate rare non-coding disease mutations, progress has been limited, and many studies employing whole-genome sequencing (WGS) have failed to discover equivalently large sets of rare-variant signals in non-coding regions.20, 21, 22, 23 Our limited understanding of enhancer biology is a major roadblock in the successful application of WGS for disease diagnosis. In particular, disruptive non-coding mutations are difficult to recognize because the specific functional nucleotides within enhancers remain unknown.24,25 The target genes of enhancers are also poorly resolved because experimental methods have limited resolution and/or limited throughput, and computational methods often suffer from poorly understood accuracy.24,26, 27, 28, 29 These limitations complicate the systematic study of enhancers in rare-disease genetic-mapping studies. Thus, the development of a framework that allows us to study enhancers in human disease will have major implications for genetic-mapping studies.
Inspired in part by the shadow enhancer model, we hypothesized that the disease pathogenicity of a gene could be predicted by its regulatory landscape. Here, we use computational predictions of enhancer-gene interactions to develop a simple scoring system to rank genes by the number of functional regulatory nucleotides in their enhancer elements. Remarkably, the number of functional regulatory nucleotides is highly reflective of a gene’s pathogenicity and is independent of and complementary to existing metrics of intolerance and constraint. This result also provides a genome-wide assessment of the performance of computational methods for predicting which important enhancers regulate human genes. We combine features associated with enhancer-domain size to create an enhancer-domain score (EDS), which reflects the total size and redundancy of a gene’s non-coding regulatory architecture. Notably, this score negatively correlates with whether the associated gene carries a cis-acting eQTL, suggesting that, similar to observations in Drosophila, mammalian genes implicated in human disease have evolved a robustness to regulatory genetic variation. In addition, candidate causal genes at genome-wide association study (GWAS) loci have high EDSs, suggesting that these genes might be less likely to be discovered via expression quantitative trait loci (eQTLs). Indeed, we provide evidence showing that using eQTL information to implicate causal genes at GWAS loci is complicated by the EDS relationship. Finally, we show that the consideration of these enhancer regions of genes provides an appropriate framework for the identification of disease-causing mutations in regulatory sequences and emphasizes the importance of using approaches that assess the cumulative burden of genetic variation occurring in implicated enhancer regions.
Methods
Calculation of EDSs
Enhancer-Domain Sizes Per Gene
To maximize the set of transcriptional regulatory elements across tissues, we considered predicted enhancer elements from the Roadmap Epigenomics Consortium.30 Enhancer predictions from the 15-state ChromHMM model are available for 127 human tissues or cell types and were downloaded from the Roadmap Epigenomics website (Web Resources). We used four computational approaches to predict enhancer-gene interactions: activity-linking (primary method used in main figures, e.g., Figure 1), proximity-linking (independent approach that treats each tissue individually and does not use RNA-seq information; used in Supplemental Figures), JEME, and FOCS (both methods that were used for Figure S7). Activity-linking-based enhancer-gene links involving enhancers predicted with the 15-state ChromHMM model across the same 127 human tissues were obtained from Liu et al., 201731 (code originally developed in Ernst et al., 201132) and were downloaded from the Roadmap Epigenomics Gene-Enhancer Linking page hosted by the Ernst lab (see Web Resources). Proximity-based enhancer-gene links were generated with the GREAT v3.0.0 webtool (see Web Resources33) set to default settings (“Basel plus extension” linking method, 5 kb upstream, 1 kb downstream, plus distal up to 1,000 kb). We ran GREAT after splitting the set of ChromHMM Epigenome Roadmap enhancers into bins of 500,000 enhancers each. JEME enhancer-promoter interactions were obtained from the Yip lab at the Chinese University of Hong Kong (see Web Resources) via the elastic net networks from ENCODE and Roadmap.34 FOCS links were obtained from the “Roadmap Epigenomics Enhancer-Promoter links with annotations” file (see “FOCS enhancer-promoter interactions” in Web Resources).35 Finally, we also incorporated one experimental method to assign enhancer-promoter interactions. These promoter interactions were downloaded from the Gene Expression Omnibus (GEO: GSE86189), “all_interaction.po“ Supplementary file.36
Figure 1.
Enhancer Domain Size Is Associated with Gene Pathogenicity
(A) Overview of the approach to quantify the enhancer domain size for a gene on the basis of linked enhancer elements.
(B) Comparison of enrichment for MGI mouse essential genes in five gene sets that rank highly by the following regulatory metrics: total number of enhancers, total count of enhancer nucleotides, total count of conserved enhancer nucleotides, total number of nucleotides within predicted cis-regulatory modules (CRM), and total number of nucleotides within transcription factor binding sites (TFBSs). The top 3,000 genes ranked by each metric are used for the enrichment foreground set. Conserved enhancer nucleotides represent the union of conserved nucleotides across vertebrates, placental mammals, and primates.
(C) Genes ranking highly by the size of their enhancer domains are enriched for disease-relevant gene sets. The top 3,000 genes are used as the foreground. p values are from Fisher's exact test.
(D and E) Genes with large enhancer domains are enriched for MGI essential genes (D) and OMIM disease genes (E) across a wide range of cut-offs. The red area corresponds to the top 3,000 genes. Points represent enrichment of genes with an enhancer nucleotide count greater than the cut-off.
(F) Enrichment of genes with large enhancer domains (top 3,000) within developmental disease gene sets is high across disease genes that affect a variety of human organs.
(G) Enrichment of MGI essential genes by enhancer domain size is not entirely due to filtering for highly evolutionarily conserved genes. “High TSS conservation” genes (bottom four bars) correspond to the 3,000 genes with the greatest TSS conservation.
To quantify enhancer domains across all tissues, we used mergeBed (BEDTools v2.26.037) to merge the set of regulatory elements linked to each gene. We considered evolutionarily conserved nucleotides identified by phastCons for three comparisons: phastCons100way (100 vertebrates), phastCons46wayPlacental (46 placental mammals), and phastCons46wayPrimates (46 primates). To count the number of evolutionarily conserved nucleotides, we downloaded BED files of evolutionarily conserved elements from the UCSC Genome Table Browser and assigned them to linked enhancer elements by using the intersectBed tool (BEDTools v2.26.0). Correlations between enhancer nucleotides and conserved nucleotides (Figures S8C and S8D) were calculated with the R software, and plots were generated with the hexbinplot function (hexbin package) with the xbins = 100 setting. We obtained BED files for cis-regulatory modules and transcription factor (TF) binding sites across human tissues from the UniBind database (see Web Resources).38
EDS
We quantified the following 108 features related to enhancer domains per gene (see Figure 2):
-
(1)
Size of enhancer domains (4 metrics, 2 linking methods = 8 features): total count of all nucleotides, all conserved nucleotides, all discrete enhancers, and all discrete conserved elements.
-
(2)
Redundancy of TF motifs in enhancers (2 metrics, 2 linking methods = 4 features): redundancy was calculated with motifs for 651 TFs from the ENCODE TF motif dataset (see Web Resources). BED files for each motif were intersected with enhancer domains for each human gene, and the number of unique matches per TF was counted per gene. Each gene was assigned a vector with 651 elements; this vector corresponded to the number of motif matches for each of the 651 TFs, and redundancy was determined from the average number of matches for the five most abundant TF motifs. This step was performed for enhancer-promoter interactions predicted with activity linking and proximity linking, and we separately considered TF motifs in full-length enhancers, as well as only the evolutionarily conserved elements within enhancers. In our tests, we observed high correlation between redundancy scores that were calculated from the average of the top three, five, and ten motifs. Further, although the TF redundancy score per gene correlates with the size of a gene's enhancer domain, because genes with larger enhancer domains will have more opportunities to contain redundant TF motifs, we find that this redundancy score is informative for disease-gene enrichment even after we control for the size of the enhancer domain.
-
(3)
Enhancer domain sizes per organ; groupings from Figure S9 were used (24 organ types, 2 metrics per type: all nucleotides and conserved nucleotides, 2 linking methods = 96 features).
Figure 2.
Creation of an EDS and Comparison to Gene Constraint Metrics
(A) Overview of the approach to generating an EDS by using 108 enhancer features related to enhancer domain size and redundancy.
(B) Predictive accuracy for EDS compared to LOEUF, RVIS, and pLI on a held-out test set of genes not used for training EDS.
(C) Venn diagram overlap of the top 3,000 genes ranked by pLI score, RVIS, and EDS. High-EDS genes summed to 3,001 as the result of a tie.
(D) Top GO categories enriched in pLI and RVIS-only gene sets (top, red) and EDS-only gene sets (bottom, blue). A full list of GO enrichments is provided in Tables S2 and S3.
(E) Top, the top enriched UniProt categories in EDS-only gene sets; bottom, examples of homeobox TFs with high EDS rank and low pLI and RVIS ranks (outside the top 3,000 for both pLI and RVIS).
(F) RVIS and pLI scores are strongly dependent on gene CDS length. The shaded area corresponds to commonly applied cut-offs for gene intolerance.
(G) Association between EDSs and gene length. The red shaded area corresponds to the top 3,000 genes by EDS.
(H) Distribution of CDS lengths for protein-coding genes. The red line corresponds to 2 kb; 65% of human genes are less than 2 kb in length.
(I) Left, Venn diagram groups of gene sets corresponding to bars in bar plot. Middle, EDS in genes with CDS length < 2 kb is more predictive than pLI score or RVIS for genes in the OMIM database; right, in genes with CDS length > 2 kb, the greatest enrichment for OMIM genes was observed when genes ranked highly by all three metrics (pLI, RVIS, and EDS) were considered.
With these 108 features, we trained an elastic net classifier (glmnet package, R) on a training dataset of 14,224 genes. Training was performed with 10-fold cross-validation for which alpha was set to 0.9 and against a binary vector where OMIM, ClinVar, Mouse Genome Informatics (MGI) Mouse Essential, and Developmental Disorders Genotype-Phenotype Database (DDG2P) genes were set to 1 and all remaining genes were set to 0. Test set prediction was performed on a held-out test dataset of 6,096 randomly selected genes.
Enrichment of Genes Linked to Human Disease in High-EDS Genes
Sources of the gene lists used for the creation of Figure 1E are listed in Table S6. We first converted all gene lists to Ensembl gene IDs by using the Ensembl gene IDs and symbols listed in the GTF file available from Ensembl GRCh38. We generated the list of organs affected by developmental diseases (used for Figure 1H) by using grep on the “organ specificity list” column in DDG2P (see Web Resources). We calculated enrichment by considering the proportion of high-EDS genes in the gene set compared to the set of high-EDS genes in the entire human genome. All p values for this section and all other analyses in the paper were calculated with two-sided statistical tests.
Comparison of the EDS to pLI, RVIS, and TSS Conservation
Probability of being loss-of-function intolerant (pLI) scores were downloaded from ExAC v0.3.1 (see Web Resources), and residual-variance intolerance scores (RVISs) were downloaded from RVIS v3 (see Web Resources). Gene Ontology (GO) enrichments were performed with DAVID v6.8 (see Web Resources) with a background set of 20,047 genes for which EDS scores, pLI scores, and RVISs are annotated. All gene lists used for GO enrichment calculations and resulting lists of enriched GO categories are available in Tables S2 and S3.
Transcription start site (TSS) conservation scores were calculated on the 1 kb, 5 kb, and 10 kb windows upstream of annotated TSS elements via the UCSC Genome Browser Table Browser interface. Conserved elements in TSS elements were identified by intersection against phastCons conserved elements (the phastCons dataset was the same as that used as for the enhancer analyses). Per-gene counts were calculated on the basis of the number of conserved nucleotides per TSS window; in cases where genes had multiple annotated TSSs, the maximum count across TSS windows was used.
Tissue-Specific EDSs
We grouped 121 of 127 human tissues and cell types from the Roadmap Epigenomics Project into 18 tissue groups based on groupings provided by the Roadmap Epigenomics Consortium (see Roadmap Epigenomics metadata in Web Resources). Six additional tissue groups (adrenal, bone, cervix, kidney, ovary, and spleen) were excluded because only one Roadmap tissue sample is available per group, which could lead to problems with high noise and a lack of enhancer sample size. A list of the groupings, including excluded samples, is available in Table S4. To identify tissue-specific enhancer elements for each tissue group, we merged all enhancers and quantified the number of additional tissues (outside of the original tissue group) where the enhancer is present, then selected enhancers active in fewer than 10 additional samples. Tissue-specific EDSs were calculated according to the same process that was followed for the organism-level EDSs. For any individual tissue-specific EDS, p values for enrichment of disease genes within the top 250 genes were calculated with Fisher’s exact test against the set of all genes. Multiple testing correction was performed with the Benjamini-Hochberg correction.
Spatiotemporal Gene-Expression Patterns by EDS Bins
We used two sets of RNA-seq data across human and mouse tissues to assess spatiotemporal expression patterns for genes: gene expression across 57 human tissues profiled by the Roadmap Epigenomics project (downloaded from the Roadmap Epigenomics portal; see Web Resources) and mouse single-cell RNA-seq from different time points (embryonic, fetal, and adult, downloaded from Han et al., 2018)7. Because the Roadmap Epigenomics dataset samples different organs in different levels of detail (e.g., ten regions of the adult brain versus one sample each for the kidneys, spleen, and liver), we grouped tissues into 17 “tissue groups” to avoid biasing our analyses (note that for tissue-specific EDS scores, we used 18 tissue groups; the discrepancy is due to incomplete RNA-seq profiling for Roadmap tissues). Because many of the tissue groups represented adult time points, we also considered a separate set of eight embryonic tissue groups. Tissue-to-group assignments are listed in Table S4. We processed the mouse single-cell RNA-seq dataset to assign cells into 87 cell clusters from a variety of embryonic, fetal, and adult tissues. We generated a single RNA-seq expression vector per cell cluster by summing across all constituent cells, converted mouse gene expression data to human data by using Ensembl’s BioMart database (mouse gene GRCm38.p6), and for genes with multiple human orthologs, selected the ortholog with the highest gene expression. We then converted counts to reads per kilobase of transcript, per million mapped reads (RPKM) and set a minimum cut-off of 10 RPKM.
Gene Expression Variability by EDS Bins
To measure the variability of gene expression, we used three datasets of gene expression across individuals: the Genotype-Tissue Expression (GTEx) v7 dataset (48 tissues with > 80 individuals each), gene expression across induced pluripotent stem cell (iPS)-derived cardiomyocytes (Knowles et al., 2017,39 42 individuals in control untreated group), and iPS-derived sensory neurons (Schwartzentruber et al., 2017,40 51 individuals). For GTEx samples, we calculated the coefficient of variation per gene (when expressed, transcripts per million (TPM) ≥1 cutoff) across individuals and aggregated across tissues by using the mean and minimum. Both the mean and minimum yielded the same trend (the EDS bin corresponding to the top 20% EDS had the greatest coefficient of variation), and we show violin plots for the mean coefficient of variation in Figure S8. We set minimum expression cut-offs of 10 RPKM for the Knowles et al. and Schwartzentruber et al. datasets (TPM data were not available), and we found the same trend as that observed in GTEx samples.
Proportion of eGenes by EDS Bins
We performed eQTL analyses by using processed data from 48 tissues generated by the GTEx Consortium (v7). Data were downloaded from the public GTEx portal (filename containing eQTL links for all tissues: GTEx_Analysis_v7_eQTL.tar.gz), and lists of eGenes (genes with a statistically significant eQTL association) and significant SNP-gene associations were taken from the ∗.v7.signif_variant_gene_pairs.txt.gz files. The sample sizes for the 48 tissues ranged from 80 to 491. For each tissue, we compared the number of eGenes discovered within each EDS bin to the total number of genes in the EDS bin. Top genes by EDS in Figures 3A and 3C and Figures S9A and S9C were considered to be those ranking in the top 3,000 by activity and proximity-linking, respectively.
Figure 3.
Genes with High EDS and Redundant Enhancer Domains Are Resilient to Genetic Perturbations
(A) The top 3,000 EDS genes are significantly depleted of eQTL eGenes from the GTEx project. Error bars correspond to the standard deviation of depletion (normalized to the set of “all genes”) across 48 GTEx tissues with n ≥ 70 samples.
(B) eGene depletion depends on a gene’s EDS ranking; the top 20% of genes are most depleted. Error bars represent the standard deviation across 48 GTEx tissues tested. All values are scaled to the mean rate of eGenes across percentile bins for each tissue because GTEx sample sizes, and therefore statistical power, differ per tissue.
(C) Depletion of eGenes is partially independent of purifying selection. Genes with high EDS (orange) are further depleted of eQTLs across the range of constraint (pLI, left) and evolutionary-conservation (TSS conservation, right) bins. p values are from paired t-tests across 48 GTE tissues, and error bars represent the standard deviation.
(D) Inverse relationship between EDS percentile and PrediXcan R2 for gene expression across the 48 tissues profiled by the GTEx consortium. The y-axis represents the proportion of genes with predicted gene expression R2 > 0.1; Wheeler et al., 201652 calculated R2 by using elastic net regression and local common genetic variation. Error bars represent the standard deivation across GTEx tissues.
(E) Inverse relationship between EDS percentile and median PrediXcan elastic net R2 of genes (values from Wheeler et al., 2016). Only the subset of genes with significant heritability (FDR < 0.1) was used for calculations. Error bars represent the standard deivation across GTEx tissues.
(F) Genes in the top 20% EDS bin (blue) have lower PrediXcan predicted expression R2 than all other genes (red). Density plots for all 48 GTEx tissues are available in Figures S10 and S11, and the p value is calculated from a Mann-Whitney U test for the two groups of genes (without binning genes with R2 > 0.25) .
(G) Genes with high enhancer redundancy were identified by quantification of pairwise enhancer activity patterns across human tissues. Left, example gene with high enhancer redundancy. Right, example gene with low enhancer redundancy.
(H) Compared to low redundancy genes (blue), high enhancer redundancy is associated with a reduced rate of eGenes (orange). p values correspond to a paired t test between high and low Jaccard-index bins conducted across 48 GTEx tissues, and error bars represent the standard deviation. The proportion of eGene values was scaled to the mean value of all bins in Figure 3B.
To calculate pairwise Jaccard indices, we merged all linked enhancers across tissues per gene. For each linked enhancer, we constructed a binary activity vector of length 127, indicating whether the enhancer is active in each of the 127 human tissues profiled by the Epigenome Roadmap project. Activity for linked enhancers was defined as overlapping a ChromHMM-called enhancer element (15-state model) in the corresponding tissue. For each gene with more than three linked enhancers, we calculated the average pairwise Jaccard index across enhancers by using the binary activity vectors per enhancer.
SNV Burden in Regulatory Regions
We compiled lists of active enhancers per tissue by using the same tissue groupings previously used for Figure S5 (groupings are listed in Table S4). For SNV burden analyses (Figure 4), we considered SNVs within the entire enhancer elements, as well as those within evolutionarily conserved regions (merger of phastCons elements across primates, vertebrates, and placental mammals). We used pre-processed genotype information from the GTEx project (dbGAP: phs000424.v7.p2) to select for rare variants. Because the GTEx sample size was modest (148 individuals), we used allele frequency information from the BRAVO database (TOPMed Freeze 5, URL in Web Resources) after mapping BRAVO variants from UCSC Genome Browswer build hg38 to hg19 by using the LiftoverVCF tool in Picard tools (v2.9.0). We mapped variants with minor-allele frequency (MAF) < 0.01 to enhancer elements grouped by tissue group, and for each GTEx individual, we counted the number of rare variants per gene per tissue group for each of the four enhancer sets (activity-linked entire enhancers, proximity-linked entire enhancers, activity-linked conserved enhancer elements, and proximity-linked conserved enhancer elements).
Figure 4.
Expression of High-EDS Genes Is Less Affected by Rare Non-Coding Genetic Variation
(A) Overview of framework for conducting rare SNV burden analyses. In brief, the number of rare SNVs (MAF < 0.01) within enhancer elements active in a tissue are counted and compared against the allele-specific expression of the gene. All analyses in Figure 4 were performed with enhancers assigned to genes by activity linking. Proximity-linking results are presented in Figure S21.
(B) Mean number of rare SNVs per gene across individuals. Analyses in (C)–(F) were performed separately for genes where mean burden across individuals was less than or greater than 1.
(C and D) A greater burden of rare SNVs is associated with higher rates of allele-specific expression (ASE) events for low-count genes (C) and high-count genes (D). Enhancers linked to genes by the activity-based linking method were used. The curve corresponds to the Loess curve across GTEx tissues, and the shaded region represents the 95% CI.
(E and F) Genes with high EDS scores (orange, top 3,000 by EDS) have lower rates of ASE events than genes with low EDS (all genes outside top 3,000, blue) at the same rare variant cut-off numbers, for both low-count genes (E) and high-count genes (F). ASE values are scaled to the set of all genes (C and D). The curve corresponds to the Loess curve across GTEx tissues, and the shaded region represents the 95% CI.
Next, we linked the SNV burden to allele-specific expression of genes. Allele-specific expression information per gene, per tissue, and per individual was obtained from the GTEx project (dbGAP: phs000424.v7.p2). For each tissue sampled from each GTEx individual, we filtered the processed allele-specific expression data to create (1) a tissue-specific list of genes with significant allelic expression (adjusted p value < 0.05), and (2) a tissue-specific background list of all genes tested for allelic expression (i.e., containing a heterozygous SNP in that individual). Because we wanted to assess whether a high burden of SNVs in linked enhancers was associated with allelic expression, we matched enhancer tissue groups to the corresponding tissues profiled by the GTEx project (see Table S4 for list). In total, we matched 37 GTEx tissues to 14 different enhancer groupings from the Roadmap Epigenomics project.
We first noticed that a small number of GTEx individuals had significant allelic expression for a very large number of genes (e.g., one individual had allele-specific expression for 9,462 genes in skin, but allele-specific expression for <1,000 genes in all other tissues). We hypothesized that this could be due to sequencing or data-processing artifacts. For each GTEx tissue, we therefore removed the ten individuals with the greatest number of genes showing significant allelic expression. A small number of genes also consistently showed allele-specific expression across a large number of individuals. A literature search revealed that many of these are imprinted genes (e.g., MEG3, PLAGL1, or L3MBTL1) or histocompatibility leukocyte antigen (HLA) genes. To avoid potentially confounding signals, we therefore removed genes listed in the Catalogue of Parent of Origin Effects (see Web Resources) and genes identified in a global survey of imprinting (Baran et al., 201541). The full list of 109 genes excluded from allele-specific expression analysis is listed in Table S7. Finally, we also removed samples marked as outliers by the GTEx project (see Web Resources).
In considering the distribution of regulatory SNV burdens per gene (e.g., Figure 4B and Figure S15A), we realized that genes with high and low average SNV burdens would need to be analyzed separately. In genes with low average SNV burden (“low-count genes”, Figure 4), most individuals have zero or one rare SNV in a linked enhancer. In these genes, the raw number of rare SNVs directly reflects the relative burden (e.g., observing five SNVs is high if most other individuals have <1). In contrast, the mean SNV burden for “high count genes” has a wide spread, and the raw count of SNV burdens need to be transformed to z scores before analysis (e.g., observing five SNVs in an individual could be either high or low, depending on whether the mean SNV count across individuals is 1.5 or 20). After we split genes into high- and low-count groups (mean SNV count > 1 and < 1, respectively), we calculated the proportion of gene-by-individual pairings with allele-specific expression at each SNV burden cut-off in each tissue. Because the proportion of genes showing allele-specific expression per tissue is different, we merged tissues together to generate the plots shown in Figure 4C–4F and Figures S15B–S15E after we had scaled each tissue to the allele-specific expression rate in all genes (“0+” bin for low-count genes and “All” for high-count genes). Loess curves and 95% confidence intervals (CIs) were generated with the geom_smooth function in ggplot2.
PrediXcan Prediction Performance in GTEx by EDS Bins
To supplement the eQTL/eGene analyses, we also assessed whether the coefficient of determination R2 of the fitted elastic net regression of gene expression on cis-genotypes in GTEx individuals was dependent on a gene’s EDS bin. We obtained R2 values from PrediXcan trained on GTEx v7 (available at the PredictDB Data Repository under the “download-by-tissue” folder; see Web Resources) and used the “pred.perf.R2” variable in the gtex_v7_[tissue]_imputed_europeans_tw_0.5_signif.db files accessed with the RSQLite package in R.
GWAS Analyses
From the European Bioinformatics Iinstitute (EBI) GWAS catalog,42 we identified ten GWAS studies that had >75 loci and covered traits with a range of ages of onset and affected tissues. In the first part of our analysis (Figures 5A–5E), we defined GWAS candidate genes as those that the GREAT tool (v3.0.0) had identified as being the single nearest gene to the lead SNP at each locus.33 We identified eQTL target genes of these same loci by using eQTL data from the GTEx project (v7) and selecting significant eGenes by using the ∗_v7.signif_variant_gene_pairs.txt files. We identified eGenes by using eQTLs present in any tissue (main analyses), as well as only those in a trait-relevant tissue (used in Figure S15). Significant transcriptome-wide association study (TWAS) genes for each trait were downloaded from Mancuso et al., 2017.43 GO enrichments for all gene sets were performed with DAVID v6.8 (see Web Resources).
Figure 5.
Candidate Causal Genes at GWAS Loci Have High EDS and Are Different from eQTL Targets
(A) Candidate GWAS genes from ten GWASs are enriched for high EDS genes. Candidate GWAS genes were identified by selection of the single nearest gene to the lead SNP at each GWAS locus. p values are from Fisher’s exact test of the proportion of high-EDS genes within candidate GWAS genes compared to all genes.
(B) eQTL targets at GWAS loci have lower EDS scores than the nearest genes. Comparison of the proportion of high-EDS genes within the set of eQTL target genes at GWAS loci versus the nearest genes. p values are from Fisher’s exact test of the proportion of high-EDS genes in each gene set.
(C) Example comparison of the entire distribution of EDS scores for the nearest genes versus eQTL targets for one GWAS (resting heart rate). The p value is from a Mann-Whitney U test of EDS scores for genes in each gene set.
(D and E) Significantly enriched GO biological process terms for sets of nearest genes, eQTL target genes, and TWAS significant genes for two GWASs (height [D] and resting heart rate [E]). A full list of all enriched GO terms for ten considered GWASsis available in Table S5.
(F) Literature-based causal genes at GWAS loci are mostly not eQTL targets. eQTLs were identified from the GTEx project and selected from any available tissue. The best eQTL gene was selected on the basis of the strongest p value. A list of loci and correct assignments is available in Table S5.
(G) Literature-based causal GWAS genes have higher EDS scores than eQTL targets of the GWAS loci. The p value is from a Mann-Whitney U test.
To compile a list of candidate literature-based causal genes at GWAS loci (Figures 5F and 5G), we collected putative causal genes from four sources (genes are listed in Table S5). (1) For the original set, we scanned the literature for instances where loci were near a corresponding Mendelian disease gene, where the locus was studied in a focused experimental study, or where the locus was near a gene targeted by a therapeutic for the same disease. For each lead SNP in these loci, we identified all SNPs in strong linkage disequilibrium (r2 > 0.8, European cohort from 1000 Genomes Project) and removed the subset of loci that overlapped a protein-coding exon in the putative causal gene (note that loci overlapping other non-causal genes were not removed). After filtering, we were left with 68 literature-based causal genes at 82 GWAS loci. (2) Metabolic GWAS is a recent preprint wherein Ndungu et al. reported collecting putative causal genes from a GWAS on 46 circulating metabolites.44 (3) UKBB GWAS, is a variety of GWASs performed on UK Biobank traits. In this study, putative causal genes were manually curated and assigned on the basis of relevant nearby Mendelian traits and biochemical pathways; we accessed these data on September 25, 2019. (4) Lastly, ProGeM metabolite causal genes were downloaded from Stacey et al.45
Results
Genes with Large Functional Regulatory Domains Are Associated with Developmental Diseases
Our study rests on the hypothesis that the non-coding transcriptional regulatory landscape of a gene reflects properties of the gene itself. We therefore sought to construct an enhancer regulatory score for human genes and assess its ability to prioritize genes important in human disease (Figure 1A). Determining which cis-acting regulatory elements are associated with which target genes remains an outstanding problem in the field of genomics and gene regulation.24,26 Although many approaches have been developed to address this challenge, most methods are currently constrained to a limited number of enhancer elements or tissue types. To generate a genome-wide compendium of transcriptional enhancer elements and their target genes, we initially relied on predictions (“activity-linking,” Figure S1) that Ernst et al. and Liu et al. developed to quantify the correlation between predicted enhancer activity and gene expression from 127 human tissues profiled from the Roadmap Epigenomics Project (Figure S1).5,31
Under the shadow enhancer model, developmental genes in Drosophila have multiple redundant enhancers to buffer against deleterious regulatory mutations.2,3 We therefore hypothesized that human genes with more linked enhancers are more important in mammalian development and, consequently, human disease. To test this, we ranked genes by metrics that reflected the size of their transcriptional regulatory elements:
-
(1)
The number of discrete enhancer elements linked to a gene
-
(2)
The total number of nucleotides within all linked enhancers
-
(3)
Within all linked enhancers, the total number of evolutionarily conserved nucleotides (the union of conserved elements across vertebrates, placental mammals, and primates; see Figure S2)
-
(4)
Within all linked enhancers, the total number of nucleotides that the UniBind database predicts as being in cis-regulatory modules.38
-
(5)
Within all linked enhancers, the total number of nucleotides annotated by the UniBind database as being in TF binding sites38
Strikingly, genes highly ranked by these metrics are indeed significantly enriched for haploinsufficient genes (Figure S3), developmentally important genes in mice (Figure 1B), and genes deposited in the OMIM database and linked to human disease (Figure S3), thus suggesting a general biological principle whereby developmentally important genes have larger functional regulatory domains. Among the five metrics tested, the total number of conserved nucleotides within all linked enhancers (metric #3) was consistently the most enriched metric for disease-relevant genes (Figure 1B and Figure S3).
Genes linked to a high number of conserved enhancer nucleotides are consistently enriched for many Mendelian disease-relevant gene sets, including genes in OMIM, DDG2P genes,46 and genes with a “likely pathogenic” or “confirmed pathogenic” ClinVar variant (Figure 1C). These genes are also substantially more likely to be haploinsufficient and lead to embryonic lethality when knockedout in mice. This enrichment persists over a range of cut-offs and is strongest for genes with the greatest number of conserved regulatory nucleotides (Figures 1D and 1E). Furthermore, consistent with previous reports that mutations in olfactory receptors are tolerated in humans and do not lead to developmental disease,47 we note a depletion of olfactory receptor genes. We also took DDG2P genes reported to be associated with developmental disease and categorized them by the affected organ, and we observed that highly conserved enhancer nucleotide genes are enriched for developmental diseases affecting a wide range of human organs (Figure 1F),46 indicating that the relationship is not driven by any individual tissue but might represent a fundamental biological principle.
Although genes associated with human disease are enriched in non-evolutionary conservation gene sets (e.g., genes with high enhancer number, metric #1 above, or high total enhancer nucleotide number, metric #2), enrichment is greatest when the number of conserved enhancer nucleotides is counted. We therefore questioned whether this relationship simply reflects evolutionary conservation instead of regulatory architecture and possible shadow enhancer-like functions. Using the same fixed TSS window size for all genes, we first considered the relative evolutionary conservation of gene TSSs (Figure S4). Although greater TSS conservation is indeed associated with enrichment for genes associated with development and Mendelian disease, the combination of enhancer domain size and TSS conservation substantially out performs TSS conservation alone (Figure 1G and Figure S5).
As a second test, we fixed the percentage of evolutionarily conserved nucleotides within enhancers while varying the total amount of these conserved enhancer nucleotides. We reasoned that purifying selection might manifest to increase the percentage of conserved nucleotides within enhancer regions, whereas other regulatory architecture properties might increase the absolute number of conserved nucleotides. We show that even when the proportion of conserved nucleotides in enhancers is held constant, an increase in total nucleotide count in the implicated regulatory sequences enriches for genes associated with Mendelian disease (Figure S6). Together, these results suggest that evolutionary conservation does not fully explain the observed relationship and that non-coding transcriptional regulatory architecture reflects gene function.
To control for possible artifacts from the enhancer-gene linking approach used, we repeated the analyses presented above by using four additional enhancer-gene linking approaches: (1) JEME34 and FOCS,35 two methods that also consider the correlation between enhancer activity and nearby gene expression; (2) “proximity-linking,” an independent approach that assigns enhancers to their nearest gene by using the GREAT algorithm and does not consider enhancer activity patterns or gene expression33 (Figure S1D); and (3) promoter capture Hi-C, a method for experimentally identifying genomic regions that physically interact with promoters across 27 human tissues and cell types.36 In all cases, we observed that genes with large enhancer domains are enriched for the disease-related gene sets and that the greatest enrichment was observed for genes that rank highly under the proximity-linking EDS (Figures S7 and S8). These results indicate that the enrichment observed for activity-linking EDS is not due to artifacts from the activity-correlation approach used above.
Tissue-Specific Enhancer Nucleotide Counts Reflect Disease Phenotypes
The activity of an enhancer is often restricted to a small set of tissues. We therefore reasoned that the number of functional nucleotides in tissue-specific enhancers could predict the affected tissues for genes associated with developmental disease. To quantify the size of tissue-specific enhancer domains, we split the tissues into 18 “tissue groups” that reflected different organs of the body (see Table S4 for groupings). We then counted the number of conserved regulatory nucleotides for tissue-specific enhancers present in each of the 18 tissue groups (Figure S9A, see Methods). Because any individual tissue group will have fewer linked enhancers than the full set of 127 tissues, we note that a median of 2,354 genes in the 18 tissue groups had a non-zero count for tissue-specific conserved enhancer nucleotides, in contrast to 19,038 genes that had a non-zero count for tissue-specific conserved enhancer nucleotides when we summed across all tissues (Figures S9B and S9C). Given the greater influence of noise, we therefore only considered the top 250 genes (top ~10%) ranked by the number of conserved enhancer nucleotides per tissue group as a proof-of-concept analysis.
To assess whether the tissue-specific enhancer domains can be informative for predicting phenotypes in developmental disease, we considered the affected organs for diseases involving the top 250 genes for each tissue-specific enhancer domain. We relied on annotations from the DDG2P database, which includes clinician-curated information on the organ specificity of developmental diseases.46 We observed that genes involved in diseases affecting different organs are enriched for having large tissue-specific enhancer domains in the corresponding tissue. For example, genes with associated musculature phenotypes are significantly more likely to rank highly by number of enhancer nucleotides in skeletal muscle-related tissues than are genes that do not have associated musculature phenotypes. Additionally, genes with associated “heart/cardiovascular/lymphatic” phenotypes are significantly more likely to rank highly by enhancer nucleotides in heart- and fat-specific tissues (Figure S9D) than are genes without such phenotypes. These results offer a demonstration that the count of functional nucleotides within tissue-specific enhancers can reflect individual organs affected by genes associated with developmental disease.
Creation of an EDS and Comparisons to Population-Based Metrics of Gene Constraint
We have shown that a variety of metrics reflecting the size of enhancer domains can enrich for genes associated with developmental disease. Next, we combined 108 features related to these enhancer domain properties; such features included the number of enhancers, number of nucleotides (conserved and total), number of redundant TF motifs, proportion of conserved nucleotides, and nucleotide counts per human tissue group. We trained an elastic net classifier by using these 108 enhancer features to predict whether a gene is involved in human development and disease (i.e., whether it is present in OMIM, ClinVar, DDG2P, or databases of autosomal-dominant or haploinsufficient genes or whether it leads to embryonic lethality when knocked out in mice; Figure 2A, EDS distribution in Figure S10). We refer to the prediction output of this classifier as the EDS. On a held-out test set (6,096 genes in test set, 14,224 genes in the training set), the EDS predicts genes involved in human disease and embryonic lethality with area under the receiver operating characteristic (AUROC) ~0.66 and area under the precision-recall curve (AUPR) ~0.37 (Figure 2B). This is comparable to the values we obtained by using constraint-based metrics of gene essentiality; these metrics represent the absence of deleterious variants within 60,000–150,000 human exomes as predictors (e.g., such metrics include pLI, RVIS, and LOEUF, the loss-of-function observed/expected upper bound fraction), Figure 2B and Figure S11). When individual gene sets are considered, EDS performs slightly worse than pLI, LOEUF, and RVIS for prediction of genes that lead to embryonic lethality in mice after knockout (AUROC of 0.67 versus range of 0.67–0.72 for pLI, LOEUF, and RVIS; Figure S11A), consistent with the latter constraint metrics’ directly reflecting selection against deleterious LoF variants, and EDS performs slightly better for the prediction of whether genes listed in OMIM are associated with human disease (AUPR of 0.61 versus range of 0.48–0.56 for pLI, LOEUF, and RVIS, Figure S11D).
We find that although pLI, RVIS, and EDS prioritize overlapping sets of genes (Figure 2C), the EDS adds value beyond constraint-based gene metrics. For example, genes with high coding constraint but poor EDS (high pLI or high RVIS, low EDS) are enriched for core housekeeping cellular functions, including mRNA processing, DNA replication, and cell division (Figure 2D and Table S2). These genes are essential to cellular function but are not involved in a specific human developmental process. In contrast, the opposite gene set (low pLI or low RVIS, high EDS) is enriched for genes involved in GO categories related to pattern specification, embryonic development, and organ development (Figure 2B and Table S3). Because EDS is trained with a supervised learning approach on disease and developmental gene sets, we repeated these comparisons by using only the count of conserved nucleotides and observed the same result (Figure S12). As a particularly striking example of the discrepancy between EDS and constraint-based metrics, we noticed that genes with high EDS but poor pLI or RVIS are significantly enriched for the “Homeobox” UniProt category (Figure 2E, false discovery rate (FDR) < 1.2 × 10−37), including members of the HOX, IRX, NKX, and PITX families (Table S3). Many of these genes have well-documented roles in congenital disease. Such genes include that encoding NKX2-5, involved in congenital heart disease; HOXD13, involved in synpolydactyly; and PITX1, involved in limb malformations48, 49, 50 (Figure 2E). In aggregate, these 100 homeobox genes have a median pLI of 0.40 and a RVIS percentile of 48%, which is dramatically weaker than the commonly applied cut-offs of 0.90 for pLI and 20% for RVIS, illustrating that these are not borderline genes that were narrowly missed by pLI and RVIS (Figure 2E).
We investigated the reason that many critical developmental TFs and other genes associated with developmental disease would rank poorly by pLI and RVIS, and we noticed that gene length is a major factor behind this discrepancy (Figure 2F). In particular, 65% of all human genes have coding sequences less than 2 kb in length (~667 amino acids, Figure 2H). These short genes have 3.1-fold and 4.4-fold fewer pLI and RVIS-defined highly constrained genes than do genes with coding sequences longer than 2 kb (p = 4.44 × 10−272 for pLI, p < 2.2 × 10−308 for RVIS, Fisher’s exact test). This is most likely due to the increased statistical power to identify a depletion, in comparison to expectations, of LoF and missense variants for long genes. This length-dependent effect is especially pronounced for genes with coding sequences less than 1 kb in length, where only 5.0% and 1.3% of genes reach the pLI and RVIS intolerance thresholds, respectively; for long genes (coding sequence [CDS] length > 2 kb), these percentages are 35.6% and 46.2%, respectively. In contrast, we note that the calculation of EDS does not incorporate any gene body properties, and consequently, although EDS is also associated with gene length (Figure 2G), the relationship is attenuated in comparison to pLI and RVIS (11.8% above threshold for CDS length < 1 kb; 25.2% above threshold for CDS length > 2 kb).
Furthermore, we observed that the predictive accuracies of pLI, LOEUF, and RVIS are influenced by gene length: when restricted to short genes (e.g., CDS length < 1 kb and < 2 kb), EDS performs substantially better than constraint-based metrics (Figures S11B, S11C, S11E, and S11F). To further investigate the effect of gene length on the prediction of disease genes, we compared the enrichment of OMIM genes for different gene sets from Figure 2C. For shorter genes (coding sequence < 2 kb), EDS is the only informative metric for OMIM gene enrichment (group “C” versus groups A, B, and D in the middle panel of Figure 2I), whereas for long genes, the greatest enrichment of OMIM genes is observed for the set of genes that rank highly by all three metrics (group D in the right panel of Figure 2I). Collectively, these results demonstrate that the EDS is an informative metric for prioritizing genes associated with Mendelian disease and that the EDS is complementary to population-based constraint metrics such as pLI and RVIS. Of particular importance to interpreting variation in patient genomes, the EDS metric provides an approach for recognizing disease-causing genes that are too small for intolerance and constraint scores that are based on currently available population genetic data to be able to provide sufficient guidance about pathogenicity.
Gene Expression Properties Associated with the EDS
Given the complexity of enhancer-gene regulatory processes, the strong relationship between EDS and disease relevance is probably due to a combination of many factors, including that genes with larger regulatory domains (1) have more complex spatiotemporal gene expression patterns, (2) are more resistant to environmental perturbations, and (3) are more resistant to perturbation from genetic variants.
Indeed, in considering expression patterns across adult human tissues, embryonic human tissues, and mouse single-cell clusters, we observed that genes with large enhancer domains are less likely to be ubiquitously expressed and are instead more likely to have expression in an intermediate number of tissues (Figure S13). This suggests that the EDS is in part connected with a gene’s spatiotemporal gene-expression pattern. We also investigated whether genes with greater EDS are more resistant to environmental perturbations. Using data of gene expression across human individuals from the GTEx Consortium51 (range: 70 to 491 individuals per tissue, 48 tissues), we surprisingly observed that stability in gene expression is inversely correlated with the size of the enhancer domain and that genes with the largest regulatory domains have the most variable expression across individuals (Figure S14A). To rule out the possibility that this is due to adult post-mortem time points collected by GTEx, we show that the same relationship exists in panels of human iPS-derived cardiomyocytes (42 individuals39) and iPS-derived sensory neurons (51 individuals40) (Figures S14B and S14C).
Resilience against Genetic Perturbations
We also investigated whether EDS is related to the shadow enhancer principle, which would predict that high EDS genes are more resistant to expression perturbation from genetic variants. We considered two forms of genetic variation: (1) common variation in the form of eQTLs that in consequence show allele-dependent differences in expression across human individuals,51 and (2) gene-expression perturbation from rare variation identified with matched whole-genome sequencing and RNA-seq. Under a null model with no regulatory buffering, we would expect that high EDS genes are more likely to be gene targets of eQTLs (“eGenes”) because of their larger regulatory elements that have more opportunities to overlap disruptive genetic variants. Indeed, we observed that genes not linked to any enhancers are significantly depleted of eGenes (Figure S15). However, among genes linked to at least one enhancer, we observed that high-EDS genes are ~20% less likely to be eGenes than all genes are (p = 8.65 × 10−23, paired t test across 48 GTEx tissues, Figures 3A and 3B). We also observed this effect when we grouped genes by the total number of conserved enhancer nucleotides (Figure S16A) and the total number of discrete enhancer elements (Figure S16B) by using either enhancer-promoter linking method. Furthermore, the depletion of eGenes is consistently observed in all profiled GTEx tissues (Figure S16C). We observed a similar trend for cell lines resembling earlier developmental time points when we used eQTL data from iPS-derived cardiomyocytes and iPS-derived sensory neurons (Figure S17), indicating that this trend is not specific to the adult developmental time points profiled by the GTEx project. We also compared the coefficient of determination for predicting gene expression across GTEx individuals by using local, common genetic variation (i.e., elastic net R2 in PrediXcan52), and we observed that high-EDS genes have significantly weaker R2 values than all genes combined (Figures 3D–3F and Figures S18 and S19). Together, these results suggest that a high regulatory nucleotide count buffers a gene’s expression against the effects of genetic variation.
We hypothesize that the inverse relationship between EDS and eQTL presence is driven by a combination of (1) greater purifying selection acting on high-EDS genes and (2) greater regulatory redundancy available when genes are linked to more cognate enhancer sequences. Indeed, consistent with a contribution from purifying selection, genes with a high pLI are depleted of eQTLs (Figure 3C). However, the incorporation of enhancer domain characteristics to gene sets with high pLI or high TSS conservation causes a further depletion of eGenes (Figure 3C). For example, among genes with high pLI or genes with low pLI, those with high conserved enhancer nucleotide count are further depleted relative to those with low conserved enhancer nucleotide count. Similar patterns were observed for genes with high and low conservation in their TSS regions. In addition, consistent with a role for regulatory redundancy, we observed a depletion of eGenes when we binned genes by the number of discrete enhancer elements (Figure S16B). To further pursue this direction, we created an additional metric of enhancer redundancy by scoring genes on the basis of the pairwise Jaccard index for enhancer activity across 127 human tissues. Consistent with our hypothesis, we observed that genes with a high pairwise Jaccard index (i.e., the gene has multiple enhancers with similar activity patterns across tissues) are less likely to be eGenes, across a wide range of EDS bins (Figure 3H for EDS bins and Figures S16D and S16E for enhancer conserved nucleotide count bins). Taken together, these results suggest a combined influence from regulatory buffering and purifying selection on reducing the likelihood of eQTL linkages for high-EDS genes.
Because analyses relying on eQTLs are influenced by purifying selection, we also considered the impact of rare variation on gene expression. A prediction from the regulatory buffering model is that genes associated with disease have larger enhancer domains and will be protected from rare non-coding point mutations. To test this prediction, we quantified the relationship between the burden of rare non-coding SNVs in regulatory regions and perturbed gene expression, as measured by the presence of allele-specific expression. We hypothesized that although environmental perturbations would often lead to bi-allelic changes in expression, the genetic perturbation of a cis-acting regulatory element should lead to a mono-allelic change in gene expression. These mono-allelic expression patterns can be identified by the detection of allele-specific expression on the basis of the imbalance of transcript expression in individuals with a heterozygous exonic variant.
We developed a framework to compare the burden of rare SNVs (MAF < 0.01) in enhancer elements linked to genes against the allele-specific expression rate (Figures 4A and 4B and Table S4). Involving 148 individuals profiled by both WGS and RNA-seq across 48 tissues (GTEx v7), the GTEx project offers an ideal dataset on which to apply this framework. Consistent with expectations, we observed that an increased non-coding-SNV burden is associated with greater allele-specific expression rates in a dose-dependent manner (Figures 4C and 4D, Figures S21A–S21C; see Methods for details on analysis). Moreover, this expression perturbation is dependent on a gene’s EDS: high-EDS genes are less likely to be disrupted by a high SNV burden, indicating that in nearby high EDS genes, a greater SNV burden is needed to achieve the same rate of transcriptional disruption as in low EDS genes (Figures 4E and 4F and Figures S21D and S21E). In summary, these results indicate that high-EDS genes, which are enriched in human disease, are more resistant to regulatory perturbation.
As a control, we shifted the position of the linked enhancers used in our analyses by 200 kb and 500 kb both upstream and downstream of their original positions, to maintain the total number of regulatory nucleotides linked to each gene. In all cases (different shifting windows, gene sets tested, and linking methods), the SNV burden in the shifted enhancers was more weakly associated with the allele-specific expression rate than in the true set of linked enhancers (Figure S22). Together, these results indicate that a high SNV burden in enhancers is more likely to result in allele-specific expression of the linked gene and suggests that quantifying regulatory burden in disease studies can aid in identification of potential causal genes.
EDS Framework Complicates eQTL Prioritization of Genes at GWAS Loci
Finally, our observations that high-EDS genes are enriched in their relevance in Mendelian disease but depleted of eQTLs led us to investigate the common practice of using eQTL overlap to prioritize causal genes associated with complex disease from GWASs. Because most GWAS loci are non-coding and are believed to influence gene expression, many investigators have used eQTL overlap with and without co-localization of causal variants as a method of prioritizing causal genes at individual GWAS loci53, 54, 55, 56, 57, 58, 59, 60, 61, 62 and have integrated eQTL and GWAS data to perform TWASs to discover causal genes genome-wide.43,63,64 In particular, because many genetic loci have eQTL links to a gene, eQTL overlap has become common practice for causal gene prioritization and is applied without co-localization in 70% of GWASs.65
We considered loci on the basis of the diversity of affected organ systems and the large number of significant genetic loci implicated in each of ten GWASs GWASs (>75 loci per study; studies listed in Table S5).53, 54, 55, 56, 57, 58, 59, 60, 61, 62 Consistent with the results we obtained for Mendelian diseases (e.g., Figures 1D and 1G), we observed that in nine of the ten GWASs tested, candidate GWAS genes (initially defined as the single nearest gene to each “lead SNP,” commonly selected as the SNP with the strongest p value per locus) are significantly enriched for high-EDS genes (Figure 5A). Because EDS is trained in part with genes associated with Mendelian disease, we replicated these analyses by using only the size of the enhancer domains (conserved enhancer nucleotide count from activity linking; Figure S23). These results confirm that genes involved in complex human diseases have a high EDS and large enhancer domains.
Although applying eQTL information to prioritize biologically relevant causal genes is appealing, our data indicate that (1) GWAS genes a have higher-than-normal EDS (Figure 5A) and (2) genes with high EDS are less likely to be eGenes (Figure 3). Furthermore, GWAS loci tend to have more SNPs in linkage disequilibrium,16 leading to a greater likelihood that the locus is also associated with additional eQTL regulatory signals unrelated to the trait of interest. When considered together, these observations set a prior expectation that eQTLs at GWAS loci are more likely to point to genes that have a low EDS and are not associated with disease, therefore complicating the use of eQTL data for the identification of disease-associated genes.
To test this possibility, we first compared the EDSs of eQTL target genes at GWAS loci against the EDSs of the nearest genes; this is the default but imprecise approach for assigning genes to each GWAS locus as a comparison against a widely recognized poor benchmark for causal gene assignment.27,29 Consistent with our hypothesis, the EDSs of eQTL target genes were significantly lower than those of the nearest genes (n-fold difference where n < 1 for all ten GWASs, p < 0.05 for eight out of ten GWASs, Mann-Whitney U test). This trend persisted when only eQTLs identified in specific disease-relevant tissues were considered, and also when enhancer domain size was used instead of EDS (Figures S24 and S25). These data indicate that eQTLs have a tendency to implicate low-EDS genes with small enhancer domains (Figures 5B and 5C).
The discrepancy between the nearest gene targets and eQTL targets can be due to two effects:
-
(1)
Current eQTL studies often implicate incorrect, non-causal genes. Detecting true causal genes requires greater eQTL sample sizes because these genes have higher EDSs.
-
(2)
Current eQTL studies tend to implicate the correct causal genes. These causal genes currently identified by eQTLs are biased toward low EDSs because these are most easily identified at current levels of statistical power.
We reasoned that if scenario two is true, GO enrichments of eQTL target genes will yield more biologically relevant categories than GO enrichments of the genes that are nearest to GWAS loci, in light of the fact that the nearest gene approach is recognized to be imprecise.27,29 In contrast, under scenario one we expect GO enrichment results of eQTL target genes to be less biologically relevant. Across the GWAS studies we tested, we observed a consistent trend where eQTL target genes have weaker or no enrichment for disease-relevant GO categories when these genes are compared to the nearest ones; this finding is consistent with scenario one (eQTLs implicate non-causal genes) and suggests that eQTL target genes are generally not biologically relevant (example GO enrichments can be found in Figures 5D and 5E; a full list of GO enrichments for all ten GWAS traits can be found in Table S5). We also observed that genes with significant associations in TWASs (which integrate GWAS and eQTL information) have low EDS distributions and are likewise not enriched for disease-relevant GO categories (Figure 5D, Figures S26 and S27 and Table S5).43 Together, these results suggest that assigning causal genes through eQTL overlap often nominates non-causal genes at GWAS loci. To prioritize causal genes, we also considered co-localization of causal variants by using the LocusCompare tool;65 however, very few genes were co-localized for all sets of GWAS loci (e.g., two genes, including one pseudogene, out of 289 loci for obesity, posterior probability > 0.75; 14 genes, including one pseudogene, out of 697 loci for height, Table S5).
To further investigate whether eQTLs implicate the correct causal genes, we manually compiled a list of 400 literature-based putative causal genes at 467 unique GWAS lead SNPs selected from published GWASs (Table S5, Sheets 3 and 4). We note that these are not the set of nearest genes used in Figures 5A–5E. We compiled these loci from four sources. (1) Genes compiled for this study (“original set”) were chosen if the candidate causal gene satisfied one of the following three criteria: (i) the gene is implicated in the Mendelian form of the GWAS trait in OMIM (e.g., chondrodysplasia for height, long-QT syndrome for cardiac QT interval length, or Wolfram syndrome for type II diabetes), (ii) the gene was implicated at the GWAS locus by a focused experimental study (e.g., IRX3 and IRX5 at the rs1421085/FTO obesity locus66), or (iii) the gene is targeted by a therapeutic that treats the disease studied in the GWAS (e.g., HMGCR for cholesterol or DRD2 for neuropsychiatric disease). (2) We used genes compiled for a study of causal genes for metabolite traits (“metabolite GWAS”).44 (3) We used genes compiled from a variety of UK Biobank traits (“UKBB GWAS,” see Table S5). (4) We used metabolite causal genes from expert curation in ProGeM (“ProGeM genes”).45 We further excluded loci where the lead SNP or any SNP in strong linkage disequilibrium (r2 > 0.8, CEU cohort from the 1000 Genomes Project) overlapped a protein-coding exon of the candidate causal gene because these loci could act through non-regulatory mechanisms. A list of these genes is provided in Table S5 (sheets 3 and 4).
At these GWAS lead SNPs, the literature-based causal gene is the top eQTL target (by p value) in only 26% of cases and is one of the targets (FDR < 0.05 in GTEx project, any tissue) in 38% of cases (including the 26% where it is the best hit; Figure 5F). eQTL information is incorrect for 21% of loci where the literature-based causal gene is not an eQTL target but where a different gene at the locus is. In contrast, assigning the nearest gene to each GWAS locus correctly identified the literature-based causal gene at 73% of loci (Table S5). Many of the selected GWAS causal genes are implicated in Mendelian disease and are therefore subject to stronger purifying selection, which might bias these results. We therefore repeated this analysis while considering only putative causal genes not listed in OMIM and observed the same proportional breakdown of loci per group (Figure S28).
Finally, we note that literature-based causal genes have significantly higher EDSs and larger enhancer domains than eQTL targets of the 467 GWAS loci (Figure 5G). This relationship also persists when one considers only causal genes that are not implicated in Mendelian disorders (Figures S29 and S30) or that have a low pLI (pLI < 0.2, Figure S31) and are subject to weaker purifying selection. We also individually considered each of the four sources of putative causal genes and found the same direction of effects in each case, suggesting that these effects are consistent across independent methods for assigning putative causal GWAS genes (Figures S29A and S30A). These data further indicate that eQTLs preferentially implicate non-causal, low-EDS gene targets. In summary, our results suggest that the genes most relevant for disease have higher EDSs and are consequently less likely to be implicated by eQTLs.
Discussion
In this study, we show that the size and redundancy of a gene’s regulatory domains is closely related to the gene’s importance in development and disease. We combined multiple metrics reflecting enhancer domain size and redundancy to construct an EDS that is predictive for disease-causing genes and has comparable accuracy to commonly used metrics of gene essentiality; such methods include pLI, LOEUF, and RVIS, which were generated from human exome sequencing data. When EDS differs with these population-scale metrics of gene constraint, EDS is often more effective at identifying developmental genes and genes associated with human disease, especially in the case of genes with short coding sequences, at current cohort sizes.
We investigated the factors that could contribute to the relationship between EDS and gene pathogenicity and show that a higher EDS is associated with differences in robustness and spatiotemporal patterns of gene expression. Furthermore, high-EDS genes are depleted of eQTLs and are less affected by rare variants in regulatory regions. Because genes with the greatest EDS and largest enhancer domains are enriched for developmentally important genes, they are also under higher purifying selection, which leads to challenges in distinguishing between the effects of purifying selection and regulatory robustness. For example, work from the GTEx Consortium has shown that genes that are intolerant to missense and loss-of-function protein-coding mutations are also depleted of common and rare non-coding variants that disrupt gene expression.67,68 To control for the confounding effects of purifying selection, throughout this study we replicated the EDS enrichment analyses by using different bins of gene pLI and promoter evolutionary conservation (e.g., Figures 1G and 3C and Figure S5) or by specifically considering genes that are not linked to Mendelian diseases (e.g., Figures S28–S30) or that have low pLI (Figure S31). These results suggest that the transcriptional regulatory regions of high-EDS genes are subject to purifying selection and have evolved robustness to regulatory variation.
The computational approaches we used to link enhancers to genes are perceived to have substantial false-positive and false-negative rates.27, 28, 29 Thus, it is striking that the EDS we calculated is effective for predicting genes associated with mammalian development and human disease. One explanation for the success of the EDS is that although any individual predicted enhancer-gene link is unreliable, aggregation of all predicted enhancer-gene links across 127 human tissues results in a robust and informative score. We observed the highest enrichments when we used the computational activity-linking and proximity-linking approaches, and we observed lower enrichments when we used using FOCS and JEME, both approaches that also incorporate activity correlation.34,35 This might be due to the fact that FOCS has substantially fewer enhancers assigned per gene than do activity-linking approaches (FOCS, median 990 regulatory base pairs assigned per gene versus 33,800 bp assigned by activity-linking) and to the addition of a supervised learning step that trains on eQTL links in JEME. We also observed that enrichment for genes associated with development and disease is highest when we counted the number of evolutionarily conserved nucleotides, but it also occurred when we considered other metrics that reflect enhancer functionality. For example, such metrics include the number of all enhancer nucleotides, the number of nucleotides in TF binding sites, and the number of discrete enhancer elements that regulate a gene. Because the majority of computationally predicted enhancer elements do not display activity in vivo, we believe that the success of the EDS when conserved nucleotides are used therefore reflects filtering for high-confidence enhancer elements.69 In addition to the computational predictions, it is notable that we observed significant enrichment for genes associated with development and disease when we used enhancer-promoter predictions from an experimental approach (promoter capture Hi-C) that called regulatory interactions in only 27 human tissues and cell types.36 This further supports the role of the size of the enhancer domain as a predictor of gene pathogenicity and indicates that the enrichments we observed in this study were not due to computational artifacts or other confounders. As high-quality genome-wide predictions of enhancers and enhancer-gene links become available in more tissue types and time points, the accuracy of the EDS and tissue-specific EDS can be improved, and non-conserved nucleotides can also be incorporated into the EDS. Other regulatory elements, including promoters and open-chromatin sites, can also be included alongside enhancers so that a comprehensive regulatory score can be constructed. Additionally, a machine-learning model can be trained to incorporate additional regulatory features, including the number of enhancers, pairwise Jaccard index, and site-frequency spectrum of genetic variation at regulatory regions. As more regulatory predictions become available, future studies can also combine EDS and tissue-specific EDS with metrics of genic constraint to create an integrated score to predict gene pathogenicity and tissue-specific phenotypes.
To illustrate the practical consequences of the relationship between EDS and gene pathogenicity, we focused on the problem of identifying GWAS genes that causally impact complex diseases. We show that candidate causal genes at GWAS loci have a high EDS and are distinct from eQTL targets of the same loci, thus providing a biological framework for understanding why cis-eQTL approaches to gene prioritization have struggled to identify causal genes from GWASs despite widespread applications. We note that under an omnigenic model where most genes in the genome generate association signals, we expect the relationship between EDS and GWAS genes to weaken as more genes are implicated by GWASs. Other recent studies have commented on the perils of using regular eQTL overlaps or TWASs to prioritize causal genes from GWASs.70,71 Our results support a conjecture raised by Hormozdiari et al.,72 who propose that the causal eQTL signals at GWAS loci might be “secondary signals in comparison to the stronger associations found in current eQTL studies.” As eQTL study sample sizes grow and the causal GWAS eQTL interactions are detected, we expect the concurrent discovery of additional non-causal signals that obscure prioritization of the causal gene. Co-localization analyses that aim to identify instances where GWAS and eQTL loci share the same causal genetic variant should reduce the false-positive rate when eQTL data are used for prioritization of causal genes; however, the degree of improvement will depend on the prevalence of gene expression pleiotropy in eQTL data (i.e., whether the GWAS causal variant(s) influence expression of more than one gene). Recent co-localization studies have noted a limited causal variant overlap between GWAS and eQTL signals,73 and our own analyses show that current eQTL sample sizes identify few GWAS loci with co-localized eQTL signals. However, as eQTL study sample sizes increase, and more diverse cell types and environmental conditions are profiled, we expect the co-localization approach to be more fruitful for the prioritization of causal genes. Collectively, these results provide new insights into the identification of genes associated with complex diseases, as well as the disruption of gene regulation by regulatory variants.
Declaration of Interests
D.B.G. is a founder of and holds equity in Pairnomix and Praxis, serves as a consultant to AstraZeneca, and has received research support from Janssen, Gilead, Biogen, AstraZeneca, and UCB. X.W. declares no competing interests.
Acknowledgments
We wish to thank Abhishek Sarkar, Hae Kyung Im, and Tuuli Lappalainen for very helpful discussions and Athma Pai, Sarah A. Dugger, Michael Wainberg, Nasa Sinnott-Armstrong, and Andrew S. Allen for helpful comments on the manuscript. This work was supported by funds from the NIH under grant U01 MH105670 and by a genome-sequencing and -analysis grant from Biogen.
Published: February 6, 2020
Footnotes
Supplemental Data can be found online at https://doi.org/10.1016/j.ajhg.2020.01.012.
Contributor Information
Xinchen Wang, Email: xw2553@cumc.columbia.edu.
David B. Goldstein, Email: dg2875@cumc.columbia.edu.
Web Resources
Roadmap Epigenomics Enhancer-Gene Linking, www.biolchem.ucla.edu/labs/ernst/roadmaplinking
BRAVO Database, https://bravo.sph.umich.edu/freeze5/hg38/
DAVID GO Enrichment Tool, https://david.ncifcrf.gov/
Developmental Disorders Genotype-Phenotype Database, https://decipher.sanger.ac.uk/ddd#ddgenes
Roadmap Epigenomics Enhancer Annotations, https://egg2.wustl.edu/roadmap/web_portal
Roadmap Epigenomics Metadata, https://egg2.wustl.edu/roadmap/web_portal/meta.html
Roadmap Epigenomics RNA-Seq Data, http://egg2.wustl.edu/roadmap/data/byDataType/rna/expression
ENCODE TF Motifs, http://compbio.mit.edu/encode-motifs/
ExAC pLI Scores, ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/functional_gene_constraint/fordist_cleaned_exac_r03_march16_z_pli_rec_null_data.txt
GTEx Project, https://www.gtexportal.org/home/documentationPage#staticTextAnalysisMethods
Catalogue of Parent of Origin Effects, http://igc.otago.ac.nz
JEME Enhancer-Promoter Interactions, http://yiplab.cse.cuhk.edu.hk/jeme/
Mouse scRNA-Seq, https://figshare.com/s/865e694ad06d5857db4b
PredictDB Data Repository, https://predictdb.org
GREAT Tool, http://great.stanford.edu/public/html/index.php
RVIS Intolerance Scores, http://genic-intolerance.org/data/GenicIntolerance_v3_12Mar16.txt
UniBind Database, https://unibind.uio.no
Supplemental Data
List of Genes with Enhancer Domain Scores, pLI Scores, and RVIS
List of Enriched GO Categories for Genes with a High pLI Score or RVIS and a Low EDS
List of Enriched GO Categories for Genes with a Low pLI Score or RVIS and a High EDS
List of Tissue Groupings and Matched Groupings between Epigenome Roadmap Enhancers and GTEx Tissues
GWAS GO Enrichment Categories and Assignment of Literature-Based Causal Genes
Sources of Gene Lists Used in Enrichment Analyses
List of Literature-Based Imprinted Genes Removed during Allele-Specific Expression Analysis
References
- 1.Fullwood M.J., Liu M.H., Pan Y.F., Liu J., Xu H., Mohamed Y.B., Orlov Y.L., Velkov S., Ho A., Mei P.H. An oestrogen-receptor-α-bound human chromatin interactome. Nature. 2009;462:58–64. doi: 10.1038/nature08497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hong J.W., Hendrix D.A., Levine M.S. Shadow enhancers as a source of evolutionary novelty. Science. 2008;321:1314. doi: 10.1126/science.1160631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cannavò E., Khoueiry P., Garfield D.A., Geeleher P., Zichner T., Gustafson E.H., Ciglar L., Korbel J.O., Furlong E.E.M. Shadow Enhancers Are Pervasive Features of Developmental Regulatory Networks. Curr. Biol. 2016;26:38–51. doi: 10.1016/j.cub.2015.11.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Perry M.W., Boettiger A.N., Bothma J.P., Levine M. Shadow enhancers foster robustness of Drosophila gastrulation. Curr. Biol. 2010;20:1562–1567. doi: 10.1016/j.cub.2010.07.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ernst J., Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 2010;28:817–825. doi: 10.1038/nbt.1662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Osterwalder M., Barozzi I., Tissières V., Fukuda-Yuzawa Y., Mannion B.J., Afzal S.Y., Lee E.A., Zhu Y., Plajzer-Frick I., Pickle C.S. Enhancer redundancy provides phenotypic robustness in mammalian development. Nature. 2018;554:239–243. doi: 10.1038/nature25461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Han X., Chen S., Flynn E., Wu S., Wintner D., Shen Y. Distinct epigenomic patterns are associated with haploinsufficiency and predict risk genes of developmental disorders. Nat. Commun. 2018;9:2138. doi: 10.1038/s41467-018-04552-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ahituv N., Zhu Y., Visel A., Holt A., Afzal V., Pennacchio L.A., Rubin E.M. Deletion of ultraconserved elements yields viable mice. PLoS Biol. 2007;5:e234. doi: 10.1371/journal.pbio.0050234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Dickel D.E., Ypsilanti A.R., Pla R., Zhu Y., Barozzi I., Mannion B.J., Khin Y.S., Fukuda-Yuzawa Y., Plajzer-Frick I., Pickle C.S. Ultraconserved Enhancers Are Required for Normal Development. Cell. 2018;172:491–499.e15. doi: 10.1016/j.cell.2017.12.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Nolte M.J., Wang Y., Deng J.M., Swinton P.G., Wei C., Guindani M., Schwartz R.J., Behringer R.R. Functional analysis of limb transcriptional enhancers in the mouse. Evol. Dev. 2014;16:207–223. doi: 10.1111/ede.12084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Pai A.A., Pritchard J.K., Gilad Y. The genetic and mechanistic basis for variation in gene regulation. PLoS Genet. 2015;11:e1004857. doi: 10.1371/journal.pgen.1004857. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Deplancke B., Alpern D., Gardeux V. The Genetics of Transcription Factor DNA Binding Variation. Cell. 2016;166:538–554. doi: 10.1016/j.cell.2016.07.012. [DOI] [PubMed] [Google Scholar]
- 13.Waszak S.M., Delaneau O., Gschwind A.R., Kilpinen H., Raghav S.K., Witwicki R.M., Orioli A., Wiederkehr M., Panousis N.I., Yurovsky A. Population Variation and Genetic Control of Modular Chromatin Architecture in Humans. Cell. 2015;162:1039–1050. doi: 10.1016/j.cell.2015.08.001. [DOI] [PubMed] [Google Scholar]
- 14.Wang X., Tucker N.R., Rizki G., Mills R., Krijger P.H., de Wit E., Subramanian V., Bartell E., Nguyen X.-X., Ye J. Discovery and validation of sub-threshold genome-wide association study loci using epigenomic signatures. eLife Sciences. 2016;5:e10557. doi: 10.7554/eLife.10557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Finucane H.K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.-R., Anttila V., Xu H., Zang C., Farh K., ReproGen Consortium. Schizophrenia Working Group of the Psychiatric Genomics Consortium. RACI Consortium Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Cirulli E.T., Lasseigne B.N., Petrovski S., Sapp P.C., Dion P.A., Leblond C.S., Couthouis J., Lu Y.-F., Wang Q., Krueger B.J., FALS Sequencing Consortium Exome sequencing in amyotrophic lateral sclerosis identifies risk genes and pathways. Science. 2015;347:1436–1441. doi: 10.1126/science.aaa3650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Allen A.S., Berkovic S.F., Cossette P., Delanty N., Dlugos D., Eichler E.E., Epstein M.P., Glauser T., Goldstein D.B., Han Y., Epi4K Consortium. Epilepsy Phenome/Genome Project De novo mutations in epileptic encephalopathies. Nature. 2013;501:217–221. doi: 10.1038/nature12439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.de Ligt J., Willemsen M.H., van Bon B.W.M., Kleefstra T., Yntema H.G., Kroes T., Vulto-van Silfhout A.T., Koolen D.A., de Vries P., Gilissen C. Diagnostic exome sequencing in persons with severe intellectual disability. N. Engl. J. Med. 2012;367:1921–1929. doi: 10.1056/NEJMoa1206524. [DOI] [PubMed] [Google Scholar]
- 20.Turner T.N., Coe B.P., Dickel D.E., Hoekzema K., Nelson B.J., Zody M.C., Kronenberg Z.N., Hormozdiari F., Raja A., Pennacchio L.A. Genomic Patterns of De Novo Mutation in Simplex Autism. Cell. 2017;171:710–722.e12. doi: 10.1016/j.cell.2017.08.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Short P.J., McRae J.F., Gallone G., Sifrim A., Won H., Geschwind D.H., Wright C.F., Firth H.V., FitzPatrick D.R., Barrett J.C., Hurles M.E. De novo mutations in regulatory elements in neurodevelopmental disorders. Nature. 2018;555:611–616. doi: 10.1038/nature25983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wright C.F., FitzPatrick D.R., Firth H.V. Paediatric genomics: diagnosing rare disease in children. Nat. Rev. Genet. 2018;19:253–268. doi: 10.1038/nrg.2017.116. [DOI] [PubMed] [Google Scholar]
- 23.Natarajan P., Peloso G.M., Zekavat S.M., Montasser M., Ganna A., Chaffin M., Khera A.V., Zhao W., Bloom J.M., Engreitz J.M. Deep-coverage whole genome sequences and blood lipids among 16,324 individuals. Nat. Commun. 2018;9:3391. doi: 10.1038/s41467-018-05747-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Shlyueva D., Stampfel G., Stark A. Transcriptional enhancers: from properties to genome-wide predictions. Nat. Rev. Genet. 2014;15:272–286. doi: 10.1038/nrg3682. [DOI] [PubMed] [Google Scholar]
- 25.Pennacchio L.A., Bickmore W., Dean A., Nobrega M.A., Bejerano G. Enhancers: five essential questions. Nat. Rev. Genet. 2013;14:288–295. doi: 10.1038/nrg3458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Fulco C.P., Munschauer M., Anyoha R., Munson G., Grossman S.R., Perez E.M., Kane M., Cleary B., Lander E.S., Engreitz J.M. Systematic mapping of functional enhancer-promoter connections with CRISPR interference. Science. 2016;354:769–773. doi: 10.1126/science.aag2445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jin F., Li Y., Dixon J.R., Selvaraj S., Ye Z., Lee A.Y., Yen C.-A., Schmitt A.D., Espinoza C.A., Ren B. A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature. 2013;503:290–294. doi: 10.1038/nature12644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Javierre B.M., Burren O.S., Wilder S.P., Kreuzhuber R., Hill S.M., Sewitz S., Cairns J., Wingett S.W., Várnai C., Thiecke M.J., BLUEPRINT Consortium Lineage-Specific Genome Architecture Links Enhancers and Non-coding Disease Variants to Target Gene Promoters. Cell. 2016;167:1369–1384.e19. doi: 10.1016/j.cell.2016.09.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Whitaker J.W., Nguyen T.T., Zhu Y., Wildberg A., Wang W. Computational schemes for the prediction and annotation of enhancers from epigenomic assays. Methods. 2015;72:86–94. doi: 10.1016/j.ymeth.2014.10.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J., Roadmap Epigenomics Consortium Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Liu Y., Sarkar A., Kheradpour P., Ernst J., Kellis M. Evidence of reduced recombination rate in human regulatory domains. Genome Biol. 2017;18:193. doi: 10.1186/s13059-017-1308-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ernst J., Kheradpour P., Mikkelsen T.S., Shoresh N., Ward L.D., Epstein C.B., Zhang X., Wang L., Issner R., Coyne M. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473:43–49. doi: 10.1038/nature09906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.McLean C.Y., Bristor D., Hiller M., Clarke S.L., Schaar B.T., Lowe C.B., Wenger A.M., Bejerano G. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 2010;28:495–501. doi: 10.1038/nbt.1630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Cao Q., Anyansi C., Hu X., Xu L., Xiong L., Tang W., Mok M.T.S., Cheng C., Fan X., Gerstein M. Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines. Nat. Genet. 2017;49:1428–1436. doi: 10.1038/ng.3950. [DOI] [PubMed] [Google Scholar]
- 35.Hait T.A., Amar D., Shamir R., Elkon R. FOCS: a novel method for analyzing enhancer and gene activity patterns infers an extensive enhancer-promoter map. Genome Biol. 2018;19:56. doi: 10.1186/s13059-018-1432-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Jung I., Schmitt A., Diao Y., Lee A.J., Liu T., Yang D., Tan C., Eom J., Chan M., Chee S. A compendium of promoter-centered long-range chromatin interactions in the human genome. Nat. Genet. 2019;51:1442–1449. doi: 10.1038/s41588-019-0494-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Gheorghe M., Sandve G.K., Khan A., Cheneby J., Ballester B., Mathelier A. A map of direct TF-DNA interactions in the human genome. Nucleic Acids Res. 2019;47:e21. doi: 10.1093/nar/gky1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Knowles D.A., Burrows C.K., Blischak J.D., Patterson K.M., Serie D.J., Norton N., Ober C., Pritchard J.K., Gilad Y. Determining the genetic basis of anthracycline-cardiotoxicity by molecular response QTL mapping in induced cardiomyocytes. eLife. 2018;7 doi: 10.7554/eLife.33480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Schwartzentruber J., Foskolou S., Kilpinen H., Rodrigues J., Alasoo K., Knights A.J., Patel M., Goncalves A., Ferreira R., Benn C.L., HIPSCI Consortium Molecular and functional variation in iPSC-derived sensory neurons. Nat. Genet. 2018;50:54–61. doi: 10.1038/s41588-017-0005-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Baran Y., Subramaniam M., Biton A., Tukiainen T., Tsang E.K., Rivas M.A., Pirinen M., Gutierrez-Arcelus M., Smith K.S., Kukurba K.R., GTEx Consortium The landscape of genomic imprinting across diverse adult human tissues. Genome Res. 2015;25:927–936. doi: 10.1101/gr.192278.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hindorff L.A., Sethupathy P., Junkins H.A., Ramos E.M., Mehta J.P., Collins F.S., Manolio T.A. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA. 2009;106:9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Mancuso N., Shi H., Goddard P., Kichaev G., Gusev A., Pasaniuc B. Integrating Gene Expression with Summary Association Statistics to Identify Genes Associated with 30 Complex Traits. Am. J. Hum. Genet. 2017;100:473–487. doi: 10.1016/j.ajhg.2017.01.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ndungu, A., Payne, A., Torres, J., van de Bunt, M., and McCarthy, M.I. A multi-tissue transcriptome analysis of human metabolites guides the interpretability of associations based on multi-SNP models for gene expression. bioRxiv.org, doi: 10.1101/773630. [DOI] [PMC free article] [PubMed]
- 45.Stacey D., Fauman E.B., Ziemek D., Sun B.B., Harshfield E.L., Wood A.M., Butterworth A.S., Suhre K., Paul D.S. ProGeM: a framework for the prioritization of candidate causal genes at molecular quantitative trait loci. Nucleic Acids Res. 2019;47:e3. doi: 10.1093/nar/gky837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Wright C.F., Fitzgerald T.W., Jones W.D., Clayton S., McRae J.F., van Kogelenberg M., King D.A., Ambridge K., Barrett D.M., Bayzetinova T., DDD study Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet. 2015;385:1305–1314. doi: 10.1016/S0140-6736(14)61705-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Lek M., Karczewski K.J., Minikel E.V., Samocha K.E., Banks E., Fennell T., O’Donnell-Luria A.H., Ware J.S., Hill A.J., Cummings B.B., Exome Aggregation Consortium Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Bruneau B.G. The developmental genetics of congenital heart disease. Nature. 2008;451:943–948. doi: 10.1038/nature06801. [DOI] [PubMed] [Google Scholar]
- 49.Zhao X., Sun M., Zhao J., Leyva J.A., Zhu H., Yang W., Zeng X., Ao Y., Liu Q., Liu G. Mutations in HOXD13 underlie syndactyly type V and a novel brachydactyly-syndactyly syndrome. Am. J. Hum. Genet. 2007;80:361–371. doi: 10.1086/511387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Spielmann M., Brancati F., Krawitz P.M., Robinson P.N., Ibrahim D.M., Franke M., Hecht J., Lohan S., Dathe K., Nardone A.M. Homeotic arm-to-leg transformation associated with genomic rearrangements at the PITX1 locus. Am. J. Hum. Genet. 2012;91:629–635. doi: 10.1016/j.ajhg.2012.08.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.GTEx Consortium Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Wheeler H.E., Shah K.P., Brenner J., Garcia T., Aquino-Michaels K., Cox N.J., Nicolae D.L., Im H.K., GTEx Consortium Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues. PLoS Genet. 2016;12:e1006423. doi: 10.1371/journal.pgen.1006423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Warren H.R., Evangelou E., Cabrera C.P., Gao H., Ren M., Mifsud B., Ntalla I., Surendran P., Liu C., Cook J.P., International Consortium of Blood Pressure (ICBP) 1000G Analyses. BIOS Consortium. Lifelines Cohort Study. Understanding Society Scientific group. CHD Exome+ Consortium. ExomeBP Consortium. T2D-GENES Consortium. GoT2DGenes Consortium. Cohorts for Heart and Ageing Research in Genome Epidemiology (CHARGE) BP Exome Consortium. International Genomics of Blood Pressure (iGEN-BP) Consortium. UK Biobank CardioMetabolic Consortium BP working group Genome-wide association analysis identifies novel blood pressure loci and offers biological insights into cardiovascular risk. Nat. Genet. 2017;49:403–415. [Google Scholar]
- 54.Eppinga R.N., Hagemeijer Y., Burgess S., Hinds D.A., Stefansson K., Gudbjartsson D.F., van Veldhuisen D.J., Munroe P.B., Verweij N., van der Harst P. Identification of genomic loci associated with resting heart rate and shared genetic predictors with all-cause mortality. Nat. Genet. 2016;48:1557–1563. doi: 10.1038/ng.3708. [DOI] [PubMed] [Google Scholar]
- 55.Nielsen J.B., Thorolfsdottir R.B., Fritsche L.G., Zhou W., Skov M.W., Graham S.E., Herron T.J., McCarthy S., Schmidt E.M., Sveinbjornsson G. Biobank-driven genomic discovery yields new insight into atrial fibrillation biology. Nat. Genet. 2018;50:1234–1239. doi: 10.1038/s41588-018-0171-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.The LifeLines Cohort Study, The ADIPOGen Consortium, The AGEN-BMI Working Group, The CARDIOGRAMplusC4D Consortium, The CKDGen Consortium, The GLGC, The ICBP, The MAGIC Investigators, The MuTHER Consortium, The MIGen Consortium Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518:197–206. doi: 10.1038/nature14177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Zhao W., Rasheed A., Tikkanen E., Lee J.-J., Butterworth A.S., Howson J.M.M., Assimes T.L., Chowdhury R., Orho-Melander M., Damrauer S., CHD Exome+ Consortium. EPIC-CVD Consortium. EPIC-Interact Consortium. Michigan Biobank Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. Nat. Genet. 2017;49:1450–1457. doi: 10.1038/ng.3943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Li Z., Chen J., Yu H., He L., Xu Y., Zhang D., Yi Q., Li C., Li X., Shen J. Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia. Nat. Genet. 2017;49:1576–1583. doi: 10.1038/ng.3973. [DOI] [PubMed] [Google Scholar]
- 59.de Lange K.M., Moutsianas L., Lee J.C., Lamb C.A., Luo Y., Kennedy N.A., Jostins L., Rice D.L., Gutierrez-Achury J., Ji S.-G. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat. Genet. 2017;49:256–261. doi: 10.1038/ng.3760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Day F.R., Thompson D.J., Helgason H., Chasman D.I., Finucane H., Sulem P., Ruth K.S., Whalen S., Sarkar A.K., Albrecht E., LifeLines Cohort Study. InterAct Consortium. kConFab/AOCS Investigators. Endometrial Cancer Association Consortium. Ovarian Cancer Association Consortium. PRACTICAL consortium Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk. Nat. Genet. 2017;49:834–841. doi: 10.1038/ng.3841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Wood A.R., Esko T., Yang J., Vedantam S., Pers T.H., Gustafsson S., Chu A.Y., Estrada K., Luan J., Kutalik Z., Electronic Medical Records and Genomics (eMEMERGEGE) Consortium. MIGen Consortium. PAGEGE Consortium. LifeLines Cohort Study Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 2014;46:1173–1186. doi: 10.1038/ng.3097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Willer C.J., Schmidt E.M., Sengupta S., Peloso G.M., Gustafsson S., Kanoni S., Ganna A., Chen J., Buchkovich M.L., Mora S., Global Lipids Genetics Consortium Discovery and refinement of loci associated with lipid levels. Nat. Genet. 2013;45:1274–1283. doi: 10.1038/ng.2797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Gusev A., Ko A., Shi H., Bhatia G., Chung W., Penninx B.W.J.H., Jansen R., de Geus E.J.C., Boomsma D.I., Wright F.A. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 2016;48:245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Mancuso N., Gayther S., Gusev A., Zheng W., Penney K.L., Kote-Jarai Z., Eeles R., Freedman M., Haiman C., Pasaniuc B. Large-scale transcriptome-wide association study identifies new prostate cancer risk regions. Nat. Commun. 2018;9:4079. doi: 10.1038/s41467-018-06302-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Liu B., Gloudemans M.J., Rao A.S., Ingelsson E., Montgomery S.B. Abundant associations with gene expression complicate GWAS follow-up. Nat. Genet. 2019;51:768–769. doi: 10.1038/s41588-019-0404-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Claussnitzer M., Dankel S.N., Kim K.-H., Quon G., Meuleman W., Haugen C., Glunk V., Sousa I.S., Beaudry J.L., Puviindran V. FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. N. Engl. J. Med. 2015;373:895–907. doi: 10.1056/NEJMoa1502214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Li X., Kim Y., Tsang E.K., Davis J.R., Damani F.N., Chiang C., Hess G.T., Zappala Z., Strober B.J., Scott A.J., GTEx Consortium. Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group. Statistical Methods groups—Analysis Working Group. Enhancing GTEx (eGTEx) groups. NIH Common Fund. NIH/NCI. NIH/NHGRI. NIH/NIMH. NIH/NIDA. Biospecimen Collection Source Site—NDRI. Biospecimen Collection Source Site—RPCI. Biospecimen Core Resource—VARI. Brain Bank Repository—University of Miami Brain Endowment Bank. Leidos Biomedical—Project Management. ELSI Study. Genome Browser Data Integration &Visualization—EBI. Genome Browser Data Integration &Visualization—UCSC Genomics Institute, University of California Santa Cruz The impact of rare variation on gene expression across tissues. Nature. 2017;550:239–243. doi: 10.1038/nature24267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Battle A., Brown C.D., Engelhardt B.E., Montgomery S.B., GTEx Consortium. Laboratory, Data Analysis & Coordinating Center (LDACC)—Analysis Working Group. Statistical Methods groups—Analysis Working Group. Enhancing GTEx (eGTEx) groups. NIH Common Fund. NIH/NCI. NIH/NHGRI. NIH/NIMH. NIH/NIDA. Biospecimen Collection Source Site—NDRI. Biospecimen Collection Source Site—RPCI. Biospecimen Core Resource—VARI. Brain Bank Repository—University of Miami Brain Endowment Bank. Leidos Biomedical—Project Management. ELSI Study. Genome Browser Data Integration &Visualization—EBI. Genome Browser Data Integration &Visualization—UCSC Genomics Institute, University of California Santa Cruz. Lead analysts. Laboratory, Data Analysis &Coordinating Center (LDACC) NIH program management. Biospecimen collection. Pathology. eQTL manuscript working group Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. [Google Scholar]
- 69.Visel A., Minovitsky S., Dubchak I., Pennacchio L.A. VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Res. 2007;35:D88–D92. doi: 10.1093/nar/gkl822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Nica A.C., Montgomery S.B., Dimas A.S., Stranger B.E., Beazley C., Barroso I., Dermitzakis E.T. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 2010;6:e1000895. doi: 10.1371/journal.pgen.1000895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Wainberg M., Sinnott-Armstrong N., Knowles D., Golan D., Ermel R., Ruusalepp A., Quertermous T., Hao K., Björkegren J.L.M., Rivas M.A. Transcriptome-wide association studies: opportunities and challenges. Nat. Genet. 2019;51:592–599. doi: 10.1038/s41588-019-0385-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Hormozdiari F., van de Bunt M., Segrè A.V., Li X., Joo J.W.J., Bilow M., Sul J.H., Sankararaman S., Pasaniuc B., Eskin E. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am. J. Hum. Genet. 2016;99:1245–1260. doi: 10.1016/j.ajhg.2016.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Chun S., Casparino A., Patsopoulos N.A., Croteau-Chonka D.C., Raby B.A., De Jager P.L., Sunyaev S.R., Cotsapas C. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 2017;49:600–605. doi: 10.1038/ng.3795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
List of Genes with Enhancer Domain Scores, pLI Scores, and RVIS
List of Enriched GO Categories for Genes with a High pLI Score or RVIS and a Low EDS
List of Enriched GO Categories for Genes with a Low pLI Score or RVIS and a High EDS
List of Tissue Groupings and Matched Groupings between Epigenome Roadmap Enhancers and GTEx Tissues
GWAS GO Enrichment Categories and Assignment of Literature-Based Causal Genes
Sources of Gene Lists Used in Enrichment Analyses
List of Literature-Based Imprinted Genes Removed during Allele-Specific Expression Analysis





