Summary
Unknown SNP-to-gene regulatory architecture complicates efforts to link noncoding GWAS associations with genes implicated by sequencing or functional studies. eQTLs are often used to link SNPs to genes, but expression in bulk tissue explains a small fraction of disease heritability. A simple but successful approach has been to link SNPs with nearby genes via base pair windows, but genes may often be regulated by SNPs outside their window. We propose the abstract mediation model (AMM) to estimate (1) the fraction of heritability mediated by the closest or kth-closest gene to each SNP and (2) the mediated heritability enrichment of a gene set (e.g., genes with rare-variant associations). AMM jointly estimates these quantities by matching the decay in SNP enrichment with distance from genes in the gene set. Across 47 complex traits and diseases, we estimate that the closest gene to each SNP mediates 27% (SE: 6%) of heritability and that a substantial fraction is mediated by genes outside the ten closest. Mendelian disease genes are strongly enriched for common-variant heritability; for example, just 21 dyslipidemia genes mediate 25% of LDL heritability (211× enrichment, p = 0.01). Among brain-related traits, genes involved in neurodevelopmental disorders are only about 4× enriched, but gene expression patterns are highly informative, as they have detectable differences in per-gene heritability even among weakly brain-expressed genes.
Keywords: GWAS, heritability, SNP-to-gene architecture, eQTLs
Introduction
A common challenge in genome-wide association studies (GWASs) is mapping disease-associated variants to the target genes that mediate their effects. Associated loci often contain multiple genes,1,2 fine-mapped causal variants are mostly noncoding,3,4 and functional data, including from expression quantitative trait loci (eQTLs), are often inconclusive.5, 6, 7
Conversely, it is often unknown which SNPs regulate putative disease genes. Disease genes with large-effect rare variants sometimes localize near common-variant associations from GWASs,2,8, 9, 10 suggesting possible mediation, but unknown SNP-to-gene architecture complicates mediation analyses. Window-based enrichment approaches (such as LDSC-SEG11 and MAGMA12) can be used to show that a set of genes is enriched for nearby SNP associations.11, 12, 13, 14, 15, 16 However, gene-set-proximal SNPs may not actually regulate their proximal gene,17 and gene-set-proximal enrichments are attenuated accordingly.
A different approach is to use eQTL data because eQTLs can identify the set of SNPs regulating expression of a nearby gene.18, 19, 20, 21, 22, 23 However, eQTL data may fail to recapitulate disease-relevant cell types and cellular states, which are often unknown. While measured expression from bulk tissue is more readily available, a recent study estimated that the fraction of heritability mediated in cis by tissue-level gene expression is only 11%.17 Moreover, colocalization analyses suggest that eQTLs and nearby GWAS loci usually arise from different causal SNPs.24
We propose the abstract mediation model (AMM) for the relationship between SNPs, genes, and a disease or complex trait. Under the model, the effect of a disease-associated SNP is mediated by one or more nearby genes:
where the genes in this model are abstractions for whatever biological variables actually have a causal effect on disease risk (for example, expression levels in a certain cell type). If multiple gene-associated variables can potentially affect a trait (for example, its expression levels in different cell types), the abstract gene in this model is the union of those variables. Unlike in eQTL studies, these variables are not observed and neither are the SNP-to-gene or gene-to-disease effect sizes. Instead, we observe proxies for both effects: the proximity of SNPs to genes and an annotation for each gene, such as its membership in a disease-relevant gene set (Figure 1A). This limited information is sufficient to partition mediated heritability across genes. In particular, it allows us to estimate (1) the fraction of heritability mediated by the kth-closest gene to each SNP and (2) the proportion of heritability mediated by a specified gene set.
Figure 1.
The abstract mediation model
(A) Under AMM, the gene-mediated effect of a SNP on a trait is equal to the product of the SNP-to-gene and gene-to-trait effects. Neither of these quantities is observed directly, but we do observe proxies for each: SNP-to-gene proximity and gene set membership, respectively.
(B) AMM partitions mediated heritability across SNP-gene pairs, each an entry in the SNPs-by-genes matrix. The heritability mediated by a gene set is a row sum (dark blue horizontal rectangles), and similarly, the heritability mediated by the closest gene to each SNP is the sum across SNP-closest-gene pairs (blue diagonal cells).
(C) Example of mediated heritability enrichment under AMM. The orange arrows represent SNP-to-gene effect sizes, p(k); larger values of p(k) are denoted with arrow thickness. The red arrows represent gene-to-trait effect sizes. The gray arrows represent SNP-to-trait effect sizes. SNP2 has a large effect on trait Y (thick gray arrow) because its closest gene (Gene2) is a member of an enriched gene set and has a large effect on Y (thick red arrow).
Material and methods
Abstract mediation model: Intuition
Across the genome, some fraction of heritability, p(1), is mediated by the closest gene to each SNP. Suppose that every disease-associated SNP regulated its closest gene, such that 100% of heritability was mediated by the closest gene to each SNP (p(1) = 1). Under this model, the proportion of heritability mediated by a set of genes would be equal to the heritability explained by the SNPs whose closest gene is in that gene set.
More generally, suppose that we knew the proportion of heritability mediated by the closest, second-closest, and kth-closest gene. Even without knowing the specific disease gene(s) for individual disease-associated SNPs, we could estimate the proportion of heritability mediated by a set of genes: the heritability of SNPs whose kth-closest gene is in the gene set, multiplied by p(k), summed across k. That is, if the SNP-to-gene architecture were known, it would allow us to estimate the proportion of heritability mediated by a set of genes.
Conversely, suppose that p(k) were unknown but that we did know a set of genes known to be enriched for heritability (like constrained genes; see below). The set of SNPs whose kth-closest gene is in the gene set would be enriched for heritability, and this enrichment would be proportional to p(k). For example, if SNPs are three times as likely to affect their closest gene compared with their second-closest gene, then SNPs adjacent to a known disease gene will have triple the heritability enrichment as those one gene away. By comparing the enrichments of gene-set-proximal SNPs with that of the gene set itself, we would obtain estimates of .
Definition of the abstract mediation model
The abstract mediation model (AMM) partitions heritability across SNP-gene pairs. The genes in the model are abstractions for whatever genic variables actually modulate disease risk (mRNA levels, enzyme activity, etc.), hence “abstract”; the heritability assigned to a set of SNP-gene pairs is referred to as “mediated heritability.” It is helpful to think of mediated heritability in terms of a SNP-to-gene effect size and a gene-to-trait effect size (Figure 1A), although these quantities do not need to be explicitly defined (see below). The SNP-to-gene effect-size variance depends on whether that gene is the closest, second-closest, or kth-closest gene to that SNP; the gene-to-trait effect-size variance depends on whether the gene is in an enriched gene set. The heritability mediated by a gene set is the sum of the mediated heritability across all SNP-gene pairs where the gene is in that set, and similarly, the heritability mediated by the closest gene to each SNP is the sum across SNP-closest-gene pairs (Figure 1B). We assume that all heritability is mediated by some gene in cis to each SNP (assumptions). We define genes in cis to a SNP as those closer than K genes away; unless otherwise specified we use K = 50. AMM does not make any assumptions about the number of genes that are affected by each SNP.
In detail, AMM jointly estimates the SNP-to-gene architecture (p(k)) and the mediated heritability enrichment of a gene set. We first train AMM p(k) estimates on genes intolerant of heterozygous disruption (constrained genes), a large gene set enriched for common variant heritability across a range of traits.25 Specifically, the fraction of heritability mediated by the kth-closest gene, p(k), is equal to the heritability of SNPs whose kth-closest gene is constrained, divided by the heritability of all SNPs with a cis constrained gene, conditioned on SNP-level functional annotations in the baselineLD model (assumptions).26 We assume that p(k) is the same for genes inside and outside of the gene set (assumptions). This estimated SNP-to-gene architecture is used to estimate the heritability enrichment of other gene sets, some of which are much smaller. The estimates of heritability enrichment of other gene sets are robust to the use of constrained genes to estimate p(k) (Figure S1). This approach allows for estimation of SNP-to-gene architecture without incorporating eQTL data.
If the gene set is enriched for heritability, then SNPs whose closest genes are in the gene set are more likely to be associated with the trait (Figure 1C). For a SNP i whose kth-ranked gene is in an enriched gene set A, its expected heritability is
where is the expected heritability of a SNP with no cis genes in annotation A, is the additional per-SNP heritability mediated by genes in A, is the proportion of cis-mediated heritability explained by the kth-closest gene for each SNP, and indicates whether the kth-closest gene to SNP i is in A.
This expression leads to a stratified linkage disequilibrium (LD) score regression equation26 (see estimation details):
where is the stratified LD score for SNP i to the annotation a(k), where a(k) is the SNP-length vector denoting whether the kth-closest gene to SNP i is in A. li is the unstratified LD score of SNP i.
The above regression can be used to estimate the heritability enrichment of gene set A (see derivation of AMM estimation equation):
where N(A)/N is the fraction of genes in gene set A. If the gene set A is unenriched, then and . can also be negative, in which case A is depleted. Finally, we define the fraction of heritability mediated by a gene set as the heritability enrichment times the fraction of genes in A:
Derivation of AMM estimation equation
The abstract mediation model decomposes the effect size of each SNP as a sum across the cis genes that mediate their effects. Let be the closest, second-closest, and Kth-closest genes to SNP j, and let denote the effect size of SNP j mediated by gene g. Under the “all-heritability-mediated” assumption (see below), the total effect size of each SNP j can be decomposed across its K nearest genes:
| (Equation 1) |
Similarly, the heritability of SNP j can be decomposed as
| (Equation 2) |
where
The heritability of SNP j mediated by gene g is defined as h2j(g) = βj(g)βj. This differs from βj(g)2, the squared effect size of SNP j mediated through gene g, for SNPs that affect multiple causal genes. Even for such a SNP, what matters is the expected value of h2j(g), and this will still be equal to the expected value of βj(g)2 as long as there is no tendency for coregulated causal genes to have causal effects in the same (or opposite) direction. That is, unless causal genes that are coregulated in cis tend to have coordinated directions of effect, h2j(g) can be interpreted as the effect-size variance of SNP j acting through gene g.
Let denote whether genes are in a gene set A, respectively. Suppose that follow a joint distribution that depends on , satisfying two assumptions (see below).
-
(1)
No spurious enrichment: only depends on , such that .
-
(2)
No special relationship: factors into a function of and a function of .
Under these assumptions, it follows that factors into a function of k and a function of . The function of k is denoted , and it is normalized so that . The function of is denoted , such that
| (Equation 3) |
and we define
such that
| (Equation 4) |
is interpreted as the heritability of a SNP with no nearby genes in A. The additional heritability explained by a SNP whose kth-closest gene is in A is equal to multiplied by .
Substituting Equation 4 into Equation 2, we obtain the estimation equation:
| (Equation 5) |
We define the mediated heritability enrichment of a gene set in terms of and . If N(A)/N is the fraction of genes in A, then the mediated heritability enrichment of A, e(A), is given by
| (Equation 6) |
and the total proportion of heritability mediated by genes in A is defined as e(A)(N(A)/N). We note that this definition avoids assigning additional heritability to a gene set just because it has a larger number of nearby SNPs; we would view this source of enrichment as being spurious.
Assumptions
Estimation requires three primary assumptions.
The first assumption is the “no-spurious-enrichment” assumption. We assume that the heritability enrichment of SNPs mapped to genes in A is actually mediated by those genes. This assumption is violated if genes in A lie in disease-relevant regions of the genome without being disease relevant themselves. This assumption could be violated if A were chosen on the basis of the GWAS itself: for example, suppose A is the set of genes that are the closest gene to a GWAS lead SNP. Even if none of these genes are actually the mediating gene, the set of SNPs whose closest gene is in A would be highly enriched for heritability, and we would wrongly infer large values for and . We do not apply AMM to gene sets that were constructed from GWAS data.
This assumption also motivates us to condition on potential confounders in the baselineLD model.26,27 Some annotations may be enriched near genes in the gene set; for example, brain-expressed genes tend to be long, and their nearby SNPs may be enriched for the exonic and intronic annotations. It has previously been observed that conditioning on genomic annotations can explain certain gene-set enrichments.28
The second assumption is the “no-special-relationship” assumption. We assume that cis-regulatory architecture, with respect to the cis-gene ranking, is the same for genes in A and genes not in A. More precisely, we assume that:
| (Equation 7) |
This assumption would be violated, for example, if A were the set of genes that are only regulated by promoter eQTLs and never by more distal enhancer eQTLs. It could also be violated if both the gene ranking and the gene set were cell type specific: for example, a ranking obtained from eQTL data in T cells may be highly informative for the set of T cell-expressed genes, but not for B cell-specific autoimmune disease genes. This source of model violations is not relevant to the proximity ranking (which is the same in all cell types).
The third assumption is the “all-heritability-mediated” assumption. We assume that genes in cis mediate 100% of heritability. This assumption is plausible, as trans-regulatory effects are most likely mediated by cis-regulatory effects,29 but it could be violated if the number of cis genes was too small or if some trans-regulatory effects are not mediated by any cis gene.
Violations of this assumption would affect our estimates in two ways. First, estimates of are interpreted as proportions of cis-mediated heritability rather than as proportions of all heritability. Second, our estimates of would be upwardly biased, leading to downward bias in estimated heritability enrichments.
Estimation details
Let be the indicator variable for . Suppose that genes inside and outside A have different effect-size variances, such that the heritability of SNP xi mediated by its rank-k gene is
| (Equation 8) |
This equation relies on assumption 2 (see above). Summing over genes,
| (Equation 9) |
(We have used the definition that .) We derive an LD score regression equation from Equation 9 (see supplemental notes). In brief, let be the chi-square statistic of SNP xj calculated from a GWAS of size N, and let be the LD score for SNP xj with the annotation for all SNPs xi. Then
| (Equation 10) |
where is the unstratified LD score. This equation leads to the following estimation procedure:
-
(1)
construct annotations for each SNP and stratified LD scores for these annotations;
-
(2)
perform LD score regression on these LD scores jointly, possibly with additional covariates such as the baselineLD model (see below), obtaining regression coefficients for each annotation;
-
(3)
estimate
-
(4)
if is significantly greater than zero, estimate .
We use the output of Equation 10 to estimate the mediated heritability enrichment of the gene set A as noted above.
For all estimation with real data, standard errors are derived from block jackknife with 200 partitions of adjacent SNPs, as previously described for LD score regression.26
Simulations
Simulation of p(k)
We first assigned SNP effects to be mediated by their kth-closest genes with probability p(k) (probabilities: [0.5, 0.35, 0.15]). We then randomly assigned 10% of 18,000 genes to membership in an enriched gene set A and assigned these genes larger gene-trait squared effect sizes: effect sizes in A were drawn from N(mean = 0, var = 1.5) while effect sizes for all other genes drawn from N(mean = 0, var = 0.5). We then randomly assigned 900,000 SNPs to their three closest genes, and the only constraint was that each SNP was assigned three unique genes. We then simulated SNP-trait effects βj as the effect size of the gene mediating SNP j effect, normalized such that the sum of βj2 equals the heritability of 0.1. We then generated GWAS χ2 statistics by multiplying βj by the square root of the GWAS sample size (N = 500,000), adding noise from a standard normal distribution, and squaring that quantity. We then regressed (χ2 − 1) on three indicator vectors denoting whether the kth-closest gene to SNP j is in the enriched gene set (indicators are sufficient because there is no LD in this simulation). To estimate p(k), we divided the kth proximity annotation coefficient from this regression by the sum of the three proximity coefficients.
Simulation of mediated heritability enrichments
We simulated three different values of e(A) by varying the effect sizes of gene to trait: we left the effect sizes of genes outside of the enriched gene set as drawn from N(mean = 0, var = 0.5), while genes in the enriched gene set are drawn from N(mean = 0, var = 1.01) in the first simulation, N(mean = 0, var = 1.49) in the second simulation, and N(mean = 0, var = 2.44) in the third simulation. In each of the three simulations, we calculated as the mean simulated squared effect of genes not in the enriched set, while is the mean of the squared effect sizes of genes in the gene set minus . Note that is defined as the expected heritability of a SNP with no cis genes in annotation A and is the additional per-SNP heritability mediated by genes in A. We then calculate e(A) as defined in the main text by using , , and the fraction of genes in the enriched gene set (10%). We then used AMM to estimate e(A). Specifically, we first estimated the kth proximity annotation coefficients as described above and then estimated as the sum of the regression coefficients and as equal to the regression intercept.
Simulations without LD and with multiple mediating genes per SNP
An alternative interpretation of p(k) is that multiple genes mediate SNP effects. To simulate this, we made SNP affect all three proximate genes, with SNP-gene squared effect sizes proportional to p(k). Specifically, the SNP-to-gene effect for the closest gene is drawn from N(mean = 0, var = 1), for the second-closest gene from N(mean = 0, var = 0.7), and for the third-closest gene from N(mean = 0, var = 0.3). All other aspects of the simulation are as above.
Simulations with LD
We performed simulations with real LD from 74,440 SNPs from the 1000 Genomes Project. Simulations with LD differ in a few ways. With LD, GWAS Z scores are estimated as
| (Equation 11) |
where n is the number of samples, R is the LD matrix, β are SNP-trait effects, R′ is the Cholesky decomposition of the LD matrix, and is statistical noise drawn from a standard normal distribution. These χ2 statistics are regressed on stratified LD scores estimated from the LD matrix and the gene proximity annotations, plus an LD score for the base annotation of all SNPs.
Specific methods implemented in manuscript
Estimation of p(k) in bins
We binned gene proximity annotations to increase power. For instance, by binning the third- through fifth-closest genes, we obtained an average proportion of heritability mediated by each of the third- through fifth-closest genes. We obtained binned p(k) estimates through the following modified procedure.
-
(1)
Construct annotations for each SNP and stratified LD scores for each of the k annotations. We defined gene location as the gene body midpoint.
-
(2)
For each bin range (i.e., third- through fifth-closest genes), sum the LD scores for each annotation in the range, for example: .
-
(3)
Perform LD score regression with these binned LD scores, plus additional covariates (such as baselineLD model).
-
(4)
Estimate where g(bin) is the number of genes in the bin.
-
(5)
Estimate binned .
Estimation of p(k) with covariates and meta-analysis
We condition on baselineLD model annotations to eliminate potential sources of confounding, ensuring that gene set enrichments are not driven by genes that have larger numbers of nearby SNPs in one of these enriched annotations (for example, longer genes may have more nearby coding SNPs). The modified regression equation is:
| (Equation 12) |
where is the regression coefficient for baselineLD model annotation n and is the LD score for SNP j.
When estimating p(k) in constrained genes, we constructed a set of “baselineLD minus” annotations in order to avoid controlling for genic elements relevant to constrained genes; these excluded features including conservation, minor allele frequency, and ancient sequence annotation (Table S1).
We estimated binned p(k) by using constrained genes and meta-analyzing across 47 traits. For each annotation bin (i.e., kth-closest gene is in the constrained set), we obtained binned estimates of τ(k) in a joint regression as noted above and estimates of τ(A) by summing across binned annotations. To estimate binned p(k) meta-analyzed across traits,
| (Equation 13) |
AMM with known p(k)
AMM can simultaneously estimate p(k) and mediated heritability enrichments for gene sets. However, we observed that power to estimate mediated heritability enrichments increases when pre-trained p(k) are used (i.e., AMM only estimated mediated heritability enrichments and not p(k) for the focal gene set) (Figure S1). To use pre-trained p(k), we use AMM to estimate a single heritability coefficient instead of a coefficient for each SNP annotation. To estimate this coefficient, we calculate a linear combination of LD scores for each of the original annotations, where the weights are the pre-trained p(k) estimates. For instance, if analysis consisted of two annotations with pre-trained p(k) values of [0.7, 0.3] with annotation LD scores l1 and l2, the new regression LD score would be lpre-trained = 0.7l1 + 0.3l2. In the manuscript, we used pre-trained p(k) values from the set of constrained genes given its mediated heritability enrichment across a range of traits, which facilitates meta-analysis.
We estimate τ(0) from the covariate annotations in the model (i.e., from the baselineLD model27). Specifically:
| (Equation 14) |
where mn is the number of SNPs in the nth covariate annotation, τn is the LD score regression coefficient for the nth covariate annotation, and M is the number of SNPs in the model.
GTEx gene sets
For analysis of mediated heritability of specifically expressed genes, we used previously defined gene sets based on bulk RNA expression across a range of tissues from GTEx.11 Briefly, in a given tissue, for each gene a t-statistic of specific expression as compared to other tissues was calculated. In a given tissue, specifically expressed genes are defined as the top 10% of genes with the greatest specific expression t-statistic. Note that for brain tissues (cortex), we used specific expression compared to non-brain tissues.
For analysis of mediated heritability enrichments of top expressed genes in cortex and liver, we downloaded from GTEx median gene-level TPM by tissue. The two focal tissues in this analysis were cortex (GTEx: “brain – cortex”) and liver (GTEx: “liver”). We defined total expression in a focal tissue as the sum of per-gene median expression across genes. We then calculated the fraction of total expression in a given tissue for a gene as the median expression of that gene in the tissue divided by the total expression in that tissue. We ranked genes by fraction of total expression in descending order and calculated cumulative expression as the fraction of total expression explained by that gene and all genes with greater total expression. In this specific analysis, we estimated a modified version of enrichment, defined as the fraction of remaining heritability mediated in the bin divided by the fraction of remaining genes in the bin. In detail, for each gene bin, we used AMM to estimate a mediated heritability enrichment. For each bin, we then calculated the fraction of mediated heritability in that bin as the fraction of total genes in that bin multiplied by the mediated heritability enrichments. Starting at the first bin (genes 1–1,000 by top expression), we then estimated the fraction of remaining heritability as the fraction of mediated heritability in the bin divided by fraction of mediated heritability explained by that bin and all bins of lower gene rank (i.e., lower mean expression).
Window-based enrichments
Using sets of genes specifically expressed in tissues, we compared mediated heritability enrichments with window-based enrichments. Estimation of window-based enrichments via stratified LD score regression followed the standard approach described previously.11 Briefly, we constructed SNP annotations ± 100 kb around the set of specifically expressed genes (see GTEx gene sets). We then estimated LD scores for this annotation and regressed this annotation and the baseline model to control for potential confounding.26
Consensus gene list
We defined a list of consensus genes to use as input for the gene-by-SNP proximity-matrix. We constructed the consensus gene list as the intersection of (1) genes from the gnomAD browser that were autosomal, had estimated pLI (probability of loss-of-function intolerance),25 and were protein coding; (2) genes with a specific expression t-statistic from Finucane et al., 2018;11 and (3) genes with a median TPM (transcripts per million) measured from GTEx v8. The intersection of these three lists left 17,661 genes. Before analyzing any gene set, we first intersected the gene set with these 17,661 genes to ensure that all genes in the set were represented in our proximity matrix. We use GRCh37 in all analyses.
Results
Performance of AMM in simulations
We evaluated AMM in simulations with no LD. Stratified LD score regression, which we use to estimate fractions of heritability, is known to account for LD appropriately.26,27 We chose simulation parameters to approximately match UK Biobank (GWAS N = 500,000; M = 900,000 SNPs). We simulated 18,000 genes, 10% of which were assigned at random to an enriched gene set. SNP-to-gene and gene-to-trait effect sizes were drawn from normal distributions with different variance parameters depending on SNP-to-gene proximity and on gene-set membership, respectively (see material and methods).
AMM produced unbiased estimates of p(k), the fraction of heritability mediated by the kth-closest gene (Figure 2A), and of gene-set enrichment (Figure 2B). In these simulations, every SNP had an effect on each of its nearby genes, with effect-size variance proportional to p(k). We obtained similar results when each SNP affected only one cis gene, with probability p(k) (Figure S2). We also performed simulations with LD and obtained similar results (Figure S3). These simulations indicate that AMM can partition gene-mediated heritability without observing the genes themselves.
Figure 2.
Performance of AMM in simulations
(A) AMM produces unbiased estimates of the proportion of heritability mediated by the kth-closest gene. Horizontal black segments are the mean of the true heritability proportion across simulation iterations.
(B) AMM produces unbiased estimates of mediated heritability enrichment of gene sets. Error bars represent standard deviations across 50 simulation runs. See material and methods for additional details.
We performed four secondary simulations to assess the effect of potential model violations (see material and methods). First, we show that AMM infers upwardly biased estimates of e(A) and p(1) if genes in A are in disease-relevant regions of the genome without being disease-relevant themselves (violating the “no-spurious-enrichment” assumption; Figure S4). Second, we show that estimates of p(k) are biased if the cis-regulatory architecture of genes in A differs from those outside of A (violating the “no-special-relationship” assumption; Figure S5). Third, we show that estimates of e(A) are downwardly biased if genes in cis do not mediate 100% of heritability (violating the “all-heritability-mediated” assumption) but estimates of p(k) remain unbiased (Figure S6). Finally, we show that non-independence of SNP-to-gene and gene-to-trait effect sizes does not lead to bias (Figure S7).
SNP-to-gene architecture of 47 complex traits
Constrained genes (pLI ≥ 0.9, n = 2,776, Table S2), which are intolerant of heterozygous loss-of-function variation, are enriched for heritability across a wide range of complex traits.25 We applied AMM to this gene set in conjunction with well-powered GWAS summary statistics for 47 traits and common diseases (median N = 419,236) (Table S3). We estimated the proportion of heritability explained by the closest, second-closest, and kth-closest gene to each SNP, and we meta-analyzed across traits (material and methods).
On average, the closest gene to each SNP mediates 27.1% (SE: 6.4%) of heritability (Figure 3A). This estimate is approximately concordant with eQTL data22 (Figure S8) and epigenomic data.19,30,31 Per-gene heritability decays quickly from the closest gene to the third-closest gene, but detectable mediation persists as far as the 11th- through 20th-closest genes, where on average, each of the 11th- through 20th-closest genes mediate 2.3% (SE: 0.7%) of heritability.
Figure 3.
SNP-to-gene architecture of 47 complex traits
(A) Mediated heritability across gene-proximity bins, meta-analyzed across 47 traits, estimated with the constrained gene set (pLI ≥ 0.9). The estimate of p(k) is the average for genes in that bin; per-bin p(k) multiplied by the number of genes in the bin, summed across bins, equals 100% of heritability. Error bars represent standard errors.
(B) As in (A) but using the set of genes specifically expressed in liver from GTEx (n = 1,766) and meta-analyzed across eight traits with positive mediated enrichments in liver (Table S4).
(C) As in (A) but using the set of genes specifically expressed in cortex from GTEx (n = 1,766) and meta-analyzed across nine traits with positive mediated enrichments in cortex (Table S5). For numerical results, see Table S6.
We replicated these results by using previously defined gene sets derived from tissue-specific gene expression data in GTEx.11,22 To maximize power in the meta-analysis, we chose two tissues, cortex and liver, that were enriched across several tissue-relevant traits (see below and Tables S5 and S4). Estimates of SNP-to-gene architecture were concordant: using the cortex gene set, we estimate the closest gene mediates 31.4% (SE: 11.7%) of heritability, and using the liver gene set, we estimate 30.1% (SE: 7.0%) (Figures 3B and 3C).
We performed three secondary analyses. First, we verified that the constrained gene set is broadly enriched across traits (median = 2.1×), confirming that it is appropriate to view our estimates as an average across traits (not only across a subset of enriched traits) (Figure S9). Second, we considered the possibility that apparent signal in quite faraway genes could be driven by physical clustering of constrained genes, which could violate the no-spurious-enrichment assumption (see material and methods); however, constrained genes do not cluster next to each other (and moreover, we jointly model all nearby genes in the regression) (Figure S10). Third, we estimated p(k) as a function of SNP-to-gene distance. We did not detect a statistically significant difference in SNP-to-gene architecture after incorporating distance (Figure S11).
Mediated heritability enrichment in Mendelian and drug target gene sets
Although most complex-trait heritability is explained by common regulatory variation, rare coding variants can have much larger effect sizes on these traits or on closely related Mendelian phenotypes. It is unclear to what extent common-variant heritability is mediated by Mendelian disease genes.
We applied AMM to estimate common-variant enrichment in Mendelian gene sets associated with rare forms of common diseases or traits. We analyzed 21 Mendelian dyslipidemia genes (also including lipid drug targets),32 50 Mendelian diabetes genes (also including drug targets),32 206 skeletal growth disorder genes,2 and 251 developmental disorder genes33 (predominantly neurodevelopmental) (Table S2). None of these gene sets were defined via GWAS results (which would lead to bias; see material and methods).
Mendelian gene sets were highly enriched for common-variant heritability (Figure 4). The 21 dyslipidemia genes mediate 25.1% (SE: 11.4%, p = 0.01) of common-variant heritability for low density lipoprotein (LDL), an enrichment of 211.5× (SE: 96.0×). By comparison, the set of constrained genes is 132 times larger, and it mediates approximately the same proportion of LDL heritability. For diabetes, a similar fraction of heritability is mediated by 50 Mendelian diabetes genes, representing a 64.4× enrichment (SE: 27.0×, p = 9.5e−3). These gene sets were not strongly enriched for height, but height enrichment was observed for skeletal growth disorder genes: we estimate that this gene set mediates 12.7% of common-variant heritability, a 10.9× enrichment (SE: 2.7×, p = 1.1e−4). There was a surprising 8.6× enrichment (SE: 2.7×, p = 2.3e−3) of neuroticism in the diabetes gene set; other brain-related traits were not enriched in this gene set (Table S7), and S-LDSC26 is known to occasionally produce false positives for small annotations (see discussion).34
Figure 4.
Mediated heritability enrichment of genes implicated in Mendelian disorders
Horizontal bars are the fraction of mediated heritability, calculated as the product of AMM enrichment and the fraction of total genes in the gene set (red line). For trait-gene set pairs with enrichments significantly greater than 1, the bars are colored blue and point estimates of enrichments are listed. AMM enrichments estimates are also listed for enrichments significantly less than 1. Error bars represent standard errors of the fraction of mediated heritability. See Table S2 for gene set references. For numerical results, see Table S7.
Finally, we analyzed mediated heritability enrichment in 251 genes associated with developmental disorders (mostly neurodevelopmental). We find enrichments for cognitive traits in this gene set, including schizophrenia (3.5×, p = 0.01) and neuroticism (3.5×, p = 0.01). However, their enrichments are similar to constrained genes more broadly (respectively: 3.0× and 2.9×). The constrained gene set is approximately ten times larger and thus mediates much more heritability than the neurodevelopmental gene set (47.4% versus 4.9% in schizophrenia, for example). This modest enrichment could be related to the much higher polygenicity of brain-related traits compared with lipid and metabolic traits (see further discussion below).
Mediated heritability enrichment of specifically expressed genes
SNPs near genes specifically expressed in trait-relevant tissues are enriched for heritability.11 However, it is unknown what fraction of heritability is actually mediated by these genes. We applied AMM to the top 10% of specifically expressed genes across GTEx tissues, as previously defined.11
AMM highlighted trait-relevant tissues (Figure 5A). Genes specifically expressed in cortex (versus non-brain tissues) mediate 21.9% (SE: 3.5%, p = 2.7e−4) of schizophrenia heritability (2.2× enrichment), while genes specifically expressed in blood mediate 3.0% (SE: 3.5%, p = 0.02 for depletion). Genes specifically expressed in liver mediate 54.0% (SE: 18.4%, p = 8.3e−3) (5.4× enrichment) of LDL heritability. Genes specifically expressed in whole blood mediate 35.5% of Alzheimer disease heritability (SE: 12.3%, p = 0.02), while genes specifically expressed in cortex mediate 14.4% (SE: 7.5%, p = 0.28).
Figure 5.
Mediated heritability enrichment of specifically expressed genes
(A) Horizontal bars are the fraction of mediated heritability, calculated as the product of AMM enrichment and the fraction of total genes in the gene set (red line). For trait-gene set pairs with enrichments significantly greater than 1, the bars are colored blue and point estimates of enrichments are listed. For enrichments significantly less than 1, point estimates of enrichments are listed. Error bars represent standard errors of the fraction of mediated heritability.
(B) Contrasting window-based enrichment (gene body ± 100 kb) with mediated heritability enrichments (AMM) for the same 15 trait-gene set pairs as in (A) (same color code).
(C) We compared the statistical power of mediated enrichments from AMM against window-based enrichments ± 100 kb for the same 15 trait-gene set pairs. The AMM enrichment Z scores are defined as (e(A) − 1)/SE(e(A)). The S-LDSC Z score is from the regression coefficient of annotation of specifically expressed genes. For numerical results, see Tables S4, S5, S8, and S9.
Using the same gene sets and traits, we estimated the heritability enrichment of SNPs within 100 kb of these genes (“gene-window heritability”), defined as the fraction of heritability in the gene window divided by the fraction of SNPs, as described by Finucane et al. (Figure 5B).11 These estimates differ in two ways. First, gene-window enrichment estimates are attenuated relative to gene-mediated enrichments for significantly enriched trait-tissue pairs, as genes can be regulated by SNPs outside their window. Second, several trait pairs are depleted for mediated heritability but not for gene-window heritability. This difference results from the fact that gene-proximal SNPs are more likely to be functional (due to being coding or conserved, etc.), increasing their gene-window enrichment. This limitation affects point estimates of gene-window enrichment but not their p values, which were defined more stringently (conditioning on a union-of-gene-windows annotation).11 AMM point estimates and p values are unaffected by this phenomenon, as every SNP has exactly one closest gene, regardless of distance from it. In contrast to the point estimates, p values for enrichment were similar between AMM and the window-based approach (Figure 5C).
Contrasting mediated heritability enrichment of top expressed genes in cortex and liver
Among complex traits, the most polygenic are brain related and some of the least polygenic are liver related.35 We investigated whether different gene expression patterns in the cortex compared with the liver from GTEx were related to these differences.22 First, we ranked genes by their fraction of (TPM-normalized) expression in the focal tissue (see material and methods) (Figure 6A). Expression in liver is skewed toward the most highly expressed genes: 50% of total expression in liver comes from the 131 most expressed genes, while in cortex, the top 946 genes represent 50% of total expression. The most expressed gene in liver (ALB, albumin) represents 5.7% of total liver expression, compared to 0.7% for the most expressed gene in cortex.
Figure 6.
Contrasting expression and enrichment in cortex and liver
(A) We calculated the fraction of total expression for each gene as its median expression divided by the total across genes, and we plot the cumulative fraction of expression across genes from most to least expressed.
(B) Across genes from most to least expressed, we calculated the cumulative mediated heritability enrichment of tissue-appropriate traits. Cumulative mediated enrichment for a bin was defined as its fraction of remaining heritability (subtracting more highly expressed genes) divided by its fraction of remaining genes; in the last bin, it is null by the definition. Error bars represent standard errors. For numerical results, see Tables S10 and S11.
We estimated the cumulative mediated heritability enrichment for top expressed genes in each tissue, binning genes by expression from greatest to least expressed (Figure 6B). Cumulative mediated heritability enrichment was defined as the fraction of remaining heritability mediated by genes in the bin divided by the fraction of remaining genes in a meta-analysis of tissue-enriched traits (see material and methods and Tables S4 and S5).
Like total expression, mediated heritability enrichment in liver is more highly concentrated among highly expressed genes: the 500-most-expressed genes were enriched in heritability (4.4×, p = 1.0e−3) and were significantly more enriched than the top 500 genes in cortex (p = 3.9e−3 for comparison). In contrast, the 5,000th–10,000th and 10,000th–15,000th most-expressed genes were enriched for heritability in cortex (p = 2.8e−6 and p = 8.4e−3, respectively) but unenriched in liver (p = 0.49 and p = 0.43, respectively; p < 0.05 for both comparisons). These estimates raise the question of whether differences in polygenicity among brain- versus liver-related traits are explained by differences in gene expression patterns.36
Discussion
We have presented the abstract mediation model for partitioning gene-mediated heritability and applied it to a set of 47 phenotypes and a range of biologically and biomedically important gene sets. Our results demonstrate that it is possible to estimate mediated heritability without observing underlying mediating variables, instead observing enriched proxies. This approach bypasses the need to measure gene expression levels in causal cell types and states.
Using AMM, we estimated that the closest gene mediates the most heritability (approximately 30%) and that a substantial fraction is mediated by genes outside the ten closest. On the one hand, this estimate is concordant with the distances between significant eQTL-eGene pairs from GTEx (Figure S8), as well as with previous estimates with eQTL and GWAS data.19,30,31 On the other hand, an analysis of genome-wide significant metabolite QTLs estimated that 69% of candidate causal genes were the closest gene to the lead variant.37 Moreover, a recent UK Biobank analysis found that genes with rare coding associations were strongly enriched for being the closest gene to a genome-wide significant common variant, suggesting that most common-variant associations were frequently mediated by their closest gene.38 These larger estimates were obtained by analyzing only genome-wide significant loci, which have larger-than-average effect sizes. SNP-to-gene architecture may differ between loci with small and large effect sizes, for example if large-effect loci are disproportionately enriched in coding regions. If so, we would expect genome-wide significant loci to implicate their closest gene much more often than subthreshold associations, explaining the discrepancy between these estimates and ours for p(k).
Our Mendelian gene set enrichment estimates raise the question of how much rare-variant heritability is mediated by the same genes, and more generally, how common- and rare-variant heritability enrichments compare. Recently released exome sequencing results38,39 may facilitate estimation of rare coding variant heritability. These estimates will not be affected by uncertain SNP-to-gene architecture, and they can be compared directly with mediated heritability estimates from AMM.
While we have utilized SNP-to-gene proximity to rank genes, other strategies could be considered. Hi-C contact,40 epigenetic co-expression,41 and fine-mapped eQTL data,42 by themselves or together, could be used to rank genes for each SNP. The SNP-to-gene ranking should not be specifically informative for the gene set being tested (violating the “no-special-relationship” assumption; see material and methods). An important goal should be to produce a maximally enriched SNP-to-gene mapping, such that the top ranked gene for each SNP explains the majority of disease heritability.
Our study has several limitations. First, AMM does not identify causal gene(s) for a given SNP, and more generally, putative disease gene sets must be prespecified. Second, AMM does not identify the cell type or context in which gene mediation occurs, potentially in contrast with eQTL-based approaches. Third, AMM often has limited power for individual genes or very small gene sets, and p(k) estimates must be meta-analyzed across traits for power. Fourth, AMM requires some assumptions about SNP-to-gene architecture (see material and methods). Fifth, AMM cannot determine whether SNPs are mediated by single or multiple genes, though in simulations we show that AMM can estimate unbiased p(k) under either model. Finally, stratified LD score regression has elevated false positive rates for small annotations, which may affect our analysis of small gene sets.34 Despite these limitations, this study presents a novel approach for estimating gene-mediated heritability and informs our understanding of the convergence of common and rare genetic variation. We have released software implementing AMM at the command-line.
Acknowledgments
This work was supported by a grant from NLM (T15LM007092, D.J.W.), a grant from The Simons Foundation (704413, L.J.O. and E.B.R.), and a grant from the NHGRI (R00 HG010160 to S.G.). We are grateful to Jenna Ballard, Hilary Finucane, Sasha Gusev, Ajay Nadig, Alkes Price, Pouria Salehi, Katie Siewert, and Doug Yao for their input on this manuscript.
Declaration of interests
The authors declare no competing interests.
Published: February 9, 2022
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2022.01.010.
Contributor Information
Daniel J. Weiner, Email: dweiner@broadinstitute.org.
Luke J. O’Connor, Email: loconnor@broadinstitute.org.
Web resources
AMM command line tool and reference files, https://github.com/danjweiner/AMM21
Supplemental information
References
- 1.Schizophrenia Working Group of the Psychiatric Genomics Consortium Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511:421–427. doi: 10.1038/nature13595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lango Allen H., Estrada K., Lettre G., Berndt S.I., Weedon M.N., Rivadeneira F., Willer C.J., Jackson A.U., Vedantam S., Raychaudhuri S., et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467:832–838. doi: 10.1038/nature09410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Fachal L., Aschard H., Beesley J., Barnes D.R., Allen J., Kar S., Pooley K.A., Dennis J., Michailidou K., Turman C., et al. Fine-mapping of 150 breast cancer risk regions identifies 191 likely target genes. Nat. Genet. 2020;52:56–73. doi: 10.1038/s41588-019-0537-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Huang H., Fang M., Jostins L., Umićević Mirkov M., Boucher G., Anderson C.A., Andersen V., Cleynen I., Cortes A., Crins F., et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature. 2017;547:173–178. doi: 10.1038/nature22969. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hormozdiari F., van de Bunt M., Segrè A.V., Li X., Joo J.W.J., Bilow M., Sul J.H., Sankararaman S., Pasaniuc B., Eskin E. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am. J. Hum. Genet. 2016;99:1245–1260. doi: 10.1016/j.ajhg.2016.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Fulco C.P., Nasser J., Jones T.R., Munson G., Bergman D.T., Subramanian V., Grossman S.R., Anyoha R., Doughty B.R., Patwardhan T.A., et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 2019;51:1664–1669. doi: 10.1038/s41588-019-0538-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wainberg M., Sinnott-Armstrong N., Mancuso N., Barbeira A.N., Knowles D.A., Golan D., Ermel R., Ruusalepp A., Quertermous T., Hao K., et al. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet. 2019;51:592–599. doi: 10.1038/s41588-019-0385-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lange L.A., Hu Y., Zhang H., Xue C., Schmidt E.M., Tang Z.Z., Bizon C., Lange E.M., Smith J.D., Turner E.H., et al. Whole-exome sequencing identifies rare and low-frequency coding variants associated with LDL cholesterol. Am. J. Hum. Genet. 2014;94:233–245. doi: 10.1016/j.ajhg.2014.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Flannick J., Mercader J.M., Fuchsberger C., Udler M.S., Mahajan A., Wessel J., Teslovich T.M., Caulkins L., Koesterer R., Barajas-Olmos F., et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature. 2019;570:71–76. doi: 10.1038/s41586-019-1231-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Freund M.K., Burch K.S., Shi H., Mancuso N., Kichaev G., Garske K.M., Pan D.Z., Miao Z., Mohlke K.L., Laakso M., et al. Phenotype-Specific Enrichment of Mendelian Disorder Genes near GWAS Regions across 62 Complex Traits. Am. J. Hum. Genet. 2018;103:535–552. doi: 10.1016/j.ajhg.2018.08.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Finucane H.K., Reshef Y.A., Anttila V., Slowikowski K., Gusev A., Byrnes A., Gazal S., Loh P.R., Lareau C., Shoresh N., et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 2018;50:621–629. doi: 10.1038/s41588-018-0081-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.de Leeuw C.A., Mooij J.M., Heskes T., Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 2015;11:e1004219. doi: 10.1371/journal.pcbi.1004219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Weeks E.M., Ulirsch J.C., Cheng N.Y., Trippe B.L., Fine R.S., Miao J., Patwardhan T.A., Kanai M., Nasser J., Fulco C.P., et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. medRxiv. 2020 doi: 10.1101/2020.09.08.20190561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Fine R.S., Pers T.H., Amariuta T., Raychaudhuri S., Hirschhorn J.N. Benchmarker: An Unbiased, Association-Data-Driven Strategy to Evaluate Gene Prioritization Algorithms. Am. J. Hum. Genet. 2019;104:1025–1039. doi: 10.1016/j.ajhg.2019.03.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.de Leeuw C.A., Neale B.M., Heskes T., Posthuma D. The statistical properties of gene-set analysis. Nat. Rev. Genet. 2016;17:353–364. doi: 10.1038/nrg.2016.29. [DOI] [PubMed] [Google Scholar]
- 16.Zhu X., Stephens M. Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nat. Commun. 2018;9:4361. doi: 10.1038/s41467-018-06805-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yao D.W., O’Connor L.J., Price A.L., Gusev A. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nat. Genet. 2020;52:626–633. doi: 10.1038/s41588-020-0625-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Grundberg E., Small K.S., Hedman Å.K., Nica A.C., Buil A., Keildson S., Bell J.T., Yang T.P., Meduri E., Barrett A., et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat. Genet. 2012;44:1084–1089. doi: 10.1038/ng.2394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhu Z., Zhang F., Hu H., Bakshi A., Robinson M.R., Powell J.E., Montgomery G.W., Goddard M.E., Wray N.R., Visscher P.M., Yang J. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016;48:481–487. doi: 10.1038/ng.3538. [DOI] [PubMed] [Google Scholar]
- 20.Gusev A., Ko A., Shi H., Bhatia G., Chung W., Penninx B.W., Jansen R., de Geus E.J., Boomsma D.I., Wright F.A., et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 2016;48:245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Gamazon E.R., Wheeler H.E., Shah K.P., Mozaffari S.V., Aquino-Michaels K., Carroll R.J., Eyler A.E., Denny J.C., Nicolae D.L., Cox N.J., Im H.K., GTEx Consortium A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–1330. doi: 10.1126/science.aaz1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Giambartolomei C., Vukcevic D., Schadt E.E., Franke L., Hingorani A.D., Wallace C., Plagnol V. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 2014;10:e1004383. doi: 10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chun S., Casparino A., Patsopoulos N.A., Croteau-Chonka D.C., Raby B.A., De Jager P.L., Sunyaev S.R., Cotsapas C. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 2017;49:600–605. doi: 10.1038/ng.3795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lek M., Karczewski K.J., Minikel E.V., Samocha K.E., Banks E., Fennell T., O’Donnell-Luria A.H., Ware J.S., Hill A.J., Cummings B.B., et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Finucane H.K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.R., Anttila V., Xu H., Zang C., Farh K., et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Gazal S., Finucane H.K., Furlotte N.A., Loh P.R., Palamara P.F., Liu X., Schoech A., Bulik-Sullivan B., Neale B.M., Gusev A., Price A.L. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 2017;49:1421–1427. doi: 10.1038/ng.3954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kim S.S., Dai C., Hormozdiari F., van de Geijn B., Gazal S., Park Y., O’Connor L., Amariuta T., Loh P.R., Finucane H., et al. Genes with High Network Connectivity Are Enriched for Disease Heritability. Am. J. Hum. Genet. 2019;105:1302. doi: 10.1016/j.ajhg.2019.11.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pierce B.L., Tong L., Chen L.S., Rahaman R., Argos M., Jasmine F., Roy S., Paul-Brutus R., Westra H.J., Franke L., et al. Mediation analysis demonstrates that trans-eQTLs are often explained by cis-mediation: a genome-wide analysis among 1,800 South Asians. PLoS Genet. 2014;10:e1004818. doi: 10.1371/journal.pgen.1004818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Porcu E., Rüeger S., Lepik K., Santoni F.A., Reymond A., Kutalik Z., eQTLGen Consortium. BIOS Consortium Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nat. Commun. 2019;10:3300. doi: 10.1038/s41467-019-10936-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J., et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Forgetta V., Jiang L., Vulpescu N.A., Hogan M.S., Chen S., Morris J.A., Grinek S., Benner C., Jang D.-K., Hoang Q., et al. An Effector Index to Predict Causal Genes at GWAS Loci. bioRxiv. 2021 doi: 10.1101/2020.06.28.171561. [DOI] [PubMed] [Google Scholar]
- 33.Kaplanis J., Samocha K.E., Wiel L., Zhang Z., Arvai K.J., Eberhardt R.Y., Gallone G., Lelieveld S.H., Martin H.C., McRae J.F., et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature. 2020;586:757–762. doi: 10.1038/s41586-020-2832-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Tashman K.C., Cui R., O’Connor L.J., Neale B.M., Finucane H.K. Significance testing for small annotations in stratified LD-Score regression. medRxiv. 2021 doi: 10.1101/2021.03.13.21249938. [DOI] [Google Scholar]
- 35.O’Connor L.J., Schoech A.P., Hormozdiari F., Gazal S., Patterson N., Price A.L. Extreme Polygenicity of Complex Traits Is Explained by Negative Selection. Am. J. Hum. Genet. 2019;105:456–476. doi: 10.1016/j.ajhg.2019.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Boyle E.A., Li Y.I., Pritchard J.K. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell. 2017;169:1177–1186. doi: 10.1016/j.cell.2017.05.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Stacey D., Fauman E.B., Ziemek D., Sun B.B., Harshfield E.L., Wood A.M., Butterworth A.S., Suhre K., Paul D.S. ProGeM: a framework for the prioritization of candidate causal genes at molecular quantitative trait loci. Nucleic Acids Res. 2019;47:e3. doi: 10.1093/nar/gky837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Backman J.D., Li A.H., Marcketta A., Sun D., Mbatchou J., Kessler M.D., Benner C., Liu D., Locke A.E., Balasubramanian S., et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature. 2021;599:628–634. doi: 10.1038/s41586-021-04103-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Wang Q., Dhindsa R.S., Carss K., Harper A.R., Nag A., Tachmazidou I., Vitsios D., Deevi S.V.V., Mackay A., Muthas D., et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature. 2021;597:527–532. doi: 10.1038/s41586-021-03855-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Rao S.S.P., Huntley M.H., Durand N.C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D., Lander E.S., Aiden E.L. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J., et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hormozdiari F., Gazal S., van de Geijn B., Finucane H.K., Ju C.J., Loh P.R., Schoech A., Reshef Y., Liu X., O’Connor L., et al. Leveraging molecular quantitative trait loci to understand the genetic architecture of diseases and complex traits. Nat. Genet. 2018;50:1041–1047. doi: 10.1038/s41588-018-0148-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






