Summary
Functional enrichment results typically implicate tissue or cell-type-specific biological pathways in disease pathogenesis and as therapeutic targets. We propose generalized linkage disequilibrium score regression (g-LDSC) that requires only genome-wide association studies (GWASs) summary-level data to estimate functional enrichment. The method adopts the same assumptions and regression model formulation as stratified linkage disequilibrium score regression (s-LDSC). Although s-LDSC only partially uses LD information, our method uses the whole LD matrix, which accounts for possible correlated error structure via a feasible generalized least-squares estimation. We demonstrate through simulation studies under various scenarios that g-LDSC provides more precise estimates of functional enrichment than s-LDSC, regardless of model misspecification. In an application to GWAS summary statistics of 15 traits from the UK Biobank, estimates of functional enrichment using g-LDSC were lower and more realistic than those obtained from s-LDSC. In addition, g-LDSC detected more significantly enriched functional annotations among 24 functional annotations for the 15 traits than s-LDSC (118 vs. 51).
Keywords: generalized LD score regression, partition heritability, functional enrichment, GWAS summary statistics, complex human traits
Xiong et al. describe a methodology for estimating functional enrichment that requires only GWAS summary-level data. They demonstrate that this method would provide a more precise fold enrichment estimate compared with the state-of-the-art method.
Introduction
Functional annotations1,2 provide valuable information in understanding and elucidating the molecular mechanisms of the effects of genetic variants. To date, the profiles of tens of thousands of functional annotations across the human genome have been documented in databases.3,4,5,6 However, extracting the most valuable information from many such functional annotations is challenging. Enrichment analysis, which determines whether genomic regions with the same functional tag have an overrepresentation of disease-associated variants is commonly used to identify the most informative functional annotations for a disease or group of diseases.7 Such enrichment results typically implicate tissue or cell-type-specific signaling or biological pathways in disease pathogenesis and as therapeutic targets.8 Integrating functional annotations to genome-wide association studies (GWASs) results may also improve causal variant identification by fine-mapping, and individual risk prediction by polygenic risk scores.9,10,11,12
Functional enrichment may be evaluated by simply testing whether the genome-wide significant SNPs from GWAS fall disproportionately in genomic regions with specific types of functional annotations.13,14,15 However, many causal SNPs with small effect size may not reach the stringent threshold for genome-wide significance.16,17 More powerful methods have been developed that use information from all interrogated variants to partition genome-wide SNP heritability16 into components attributable to various functional annotations.18,19,20,21,22
Several approaches have been proposed for estimating and partitioning genome-wide SNP heritability, including genomic relatedness restricted maximum likelihood (GREML) implemented in genome-wide complex trait analysis (GCTA) software and linkage disequilibrium score regression (LDSC).23,24 Genome-wide SNP heritability is estimated under a linear mixed model via the genetic relatedness matrix (GRM) in GCTA, and from the regression coefficient of χ2 statistics on the per-SNP LD scores via weighted least squares (WLS) estimation in LDSC. Subsequently, GREML was extended to partition SNP heritability with multiple random effects characterized by annotation-specific GRMs,18 and similarly, LDSC was extended to stratified LD score regression (s-LDSC) by multiple regression on annotation-specific LD scores.19 One disadvantage of GREML is that it requires individual-level genotype data, which are often difficult to obtain, especially for large-scale, multicenter studies.25 Although s-LDSC requires only GWAS summary-level data, it uses WLS estimation, which does not take full account of the correlations among the residuals. Since s-LDSC uses χ2 statistics of SNPs as the dependent variable, residual correlations are present for SNPs in LD, and this reduces the precision of the estimates of SNP heritability and functional enrichment.22,26,27
In recognition of these limitations, we have developed a new approach, generalized LD score regression (g-LDSC), for estimating functional enrichments. The method uses information on the relation between χ2 statistics and the squared LD matrix, and differs from s-LDSC in using feasible generalized least-squares (FGLS) estimation,28 which accounts for possible correlated error structure, instead of WLS. Like s-LDSC, it requires only summary statistics from GWAS and can be applied to large-scale datasets. We demonstrate through simulation studies under various scenarios that g-LDSC provides less biased and more precise estimates of functional enrichment than s-LDSC, regardless of model misspecification. In an application to GWAS summary statistics of 15 traits from the UK Biobank, estimates of functional enrichment using g-LDSC were lower and more realistic than those obtained from s-LDSC. In addition, g-LDSC detected more significantly enriched functional annotations for the 15 traits than s-LDSC (118 vs. 51).
Material and methods
The g-LDSC method
Let be a vector of z statistics for SNPs from a GWAS of sample size and be a vector of underlying effect sizes for the SNPs, where phenotypes and genotypes are standardized to zero mean and unit variance. We assume that , where the covariance matrix is a diagonal matrix such that . Here, is a vector of contributions of functional annotations () to effect size deviations from the mean 0, and is a matrix that indicates the status (0,1) of the SNPs for the annotations. Thus, the phenotypic heritability contributed by jth SNP is given by matrix multiplication of the jth row of with the vector —in other words,
| (Equation 1) |
where represents the status of the jth SNP in , and represents the effect size contribution of . The z statistics then follow a multivariate normal distribution (for derivation, see the supplemental information):
| (Equation 2) |
where is the LD matrix containing pairwise correlations between any two SNPs i, j, and represents a scalar contribution to the variance of z statistics reflecting confounding biases such as population stratification. We construct a linear model relating the χ2 statistic of a SNP to its expectation, given by the corresponding diagonal element of the covariance matrix in Equation 2, plus an error term:
| (Equation 3) |
where is known as the LD score24 of SNP j with respect to , and denotes the random error term for the jth SNP. In s-LDSC, estimates of in Equation 3 are obtained via WLS, assuming that the error terms for different SNPs are independent of one another:
| (Equation 4) |
where is a weight that accounts for nonindependence and heteroscedasticity.19 However, WLS does not fully account for nonindependence of the residuals in a linear model.
To address the issue of correlated residuals, we propose to estimate by GLS, which requires the covariance matrix of the error terms of the linear model to be specified. The theoretical expression of the covariance matrix can be derived (see the supplemental information) as follows:
| (Equation 5) |
where the notation denotes the Hadamard square29 of a matrix—for example, means .
The GLS estimator of and its sampling variance-covariance matrix are given by:
| (Equation 6) |
where is an matrix containing the partitioned LD score , is a vector of χ2 statistics corrected for confounding bias , and is an approximation of .
The functional enrichment for is defined as the ratio of the proportion of heritability explained by to the proportion of SNPs annotated to .19 The estimated standard error of can be estimated using a block jackknife procedure.30
Approximation of covariance matrix
From Equation 5, we note that the covariance matrix is a matrix containing unknown parameters and . To simplify the minimization of the squared Mahalanobis distance of residuals28 in the presence of these parameters in GLS, we approximate to a constant matrix:
| (Equation 7) |
Following the framework of s-LDSC, we set to unity and all to be , where and denote the means of and over all regression SNPs j, respectively.
Confounding bias
Our model can accommodate confounding bias from population stratification by adding a constant intercept term . This is implemented by adding to the vector of regression coefficients and a corresponding column of to the matrix of partitioned LD scores .
The confounding bias parameter could then be estimated along with the other regression coefficients by
| (Equation 8) |
A large indicates greater bias and 1, no bias.
Significance testing
Because is not directly interpretable in Equation 5, we can transform as a ratio of the proportion of heritability explained by to the proportion of SNPs annotated to , representing a quantitative measurement of functional enrichment (Equation 7), as follows:
| (Equation 9) |
where is the estimated partitioned heritability of , is the number of SNPs in , and is the estimated SNP heritability.
Since large values of indicate greater functional enrichment in and no enrichment, we used the test statistics19 to determine whether there is a significant enrichment in using the hypotheses H0: W = 0 versus H1: W > 0, which is equivalent to testing H0: = 1 versus H1: > 1.
Functional annotations and associated LD scores
Following the baseline model in Finucane et al.,19 we considered 52 functional annotations along with a base annotation () containing all SNPs. These annotations comprised 24 main annotations that are not specific to any cell type such as coding, UTR, promoter, and intronic regions.18,31 In addition, the 500-bp window around each annotation and the 100-bp window around chromatin immunoprecipitation sequencing peaks were also included as a functional annotation when appropriate. Partitioned LD scores were calculated using the LDSC software.24
Simulation settings
Summary statistics from GWAS were generated directly32 with varying genetic architectures parameters, including total SNP heritability (), functional enrichment of specific annotation (), the proportion of the causal SNPs (), and sample size (). The joint effect sizes were simulated from a spike-and-slab distribution: , where is the point mass at zero, and is the variance of SNP j under the polygenic model as defined in Equation 1. We performed 50 simulation replicates for each parameter configuration. The proportion of the causal SNPs varied from 1% to 100%. The functional enrichment for a specified annotation (DNase I hypersensitivity sites, DHS) was set from 0 to 3, for which the corresponding values of were calculated (see supplemental information). The total SNP heritability varied between 0.04, 0.2, 0.4, and 0.6 and sample size was set at 2,500, 5,000, 10,000, 20,000, or 50,000.
The overall process of generating z statistics given for GWAS summary statistics is demonstrated in Equation 9.
| (Equation 10) |
For selected SNPs in this simulation, we used the intersection of the SNP sets from the 1000 Genomes Project33 and the HapMap 334 dataset. After filtering SNP with minor allele frequency less than 5%, 54,282 SNPs from chromosome 10 were retained. We downloaded the block-wise LD matrices for = 489 1000 Genomes Project Europeans from the PRS-CSx website.35
Enrichment analysis simulation study
Based on the simulated summary statistics on 54,282 SNPs, we compared the performance of the g- and s-LDSC approaches under varying genetic architectures, including total SNP heritability, functional enrichment of DHS, proportion of causal SNPs, and GWAS sample sizes. The simulation was repeated to generate estimates of DHS enrichment, the empirical SE and root-mean-square error (RMSE) of DHS enrichment estimates, enrichment p values, and estimates for SNP heritability and confounding bias.
Model misspecification simulation study
In this simulation, we compared g- and s-LDSC in terms of enrichment estimation in the presence of model misspecification, which refers to a scenario in which all true causal annotations are not included in the model. We considered the situation in which DHS is the only causal annotation wherein 80% of the causal SNPs lie, with specified to yield true DHS enrichments of , and . Data generated under this scenario were analyzed by two correctly specified models: (1) a 2-annotation (DHS and non-DHS) model and (2) a 52-annotation (baseline) model including DHS. To investigate the effect of misspecification, we considered scenarios in which DHS is not a causal annotation and in which the only causal annotation is located at (1) 500-bp windows around DHS, yielding a true DHS enrichment or (2) Enhancer_Anderson, which overlap partially with DHS, yielding a true or DHS enrichment. Data generated under these scenarios were analyzed by two incorrectly specified models: (1) a 2-annotation (DHS and non-DHS) model and (2) a 50-annotation model (excluding Enhancer_Anderson and its extended regions from the baseline model).
Reference sample versus true LD matrix simulation study
In the above simulations, the LD matrix from the 1000 Genomes Project was used to simulate the genotype data as well as to calculate LD scores and the estimated residual covariance matrix for the g- and s-LDSC analyses. Thus, the LD matrix from the 1000 Genomes Project matches perfectly with the LD pattern in the simulated samples and can be considered to be a true LD matrix. To evaluate the performance of g- and s-LDSC using an external reference sample LD matrix, we used LD matrices generated either from (1) a sample of 361,194 individuals of White British ancestry from the UK Biobank, assuming the same LD blocks as the 1000 Genomes Project LD matrix, or (2) a sample of 7,507 individuals of African ancestry from UK Biobank, assuming different LD blocks, and used these to perform g- and s-LDSC analyses on GWAS data generated under the 1000 Genomes Project LD matrix.
Performance of versus simulation study
The GLS estimator contains the , which contains the parameters to be estimated. Since these parameters are unknown, we used an approximation , which assumes the contributions () of all annotations to be equal. We performed simulations to assess the accuracy of enrichment estimates obtained using compared to those obtained using . Unless otherwise specified, all of the simulation studies used in g-LDSC estimation to reflect the fact that the true is not known in real applications.
Application to the UK Biobank GWAS data
We applied g-LDSC and s-LDSC to 15 sets of real summary statistics obtained from the Neale lab.36 Fifteen traits were selected: age at menarche, anorexia nervosa, bipolar disorder, body mass index (BMI), coronary atherosclerosis, Crohn disease, high-density lipoprotein level, low-density lipoprotein level, rheumatoid arthritis, schizophrenia/delusion, standing height, triglycerides, type 2 diabetes, ulcerative colitis, and year-ended full-time education.
All GWAS analyses were performed on a sample of 361,194 individuals of White British ancestry from the UK Biobank, with covariates that included 20 genetic principal components, age, age2, sex, age × sex, and age2 × sex.
We applied g- and s-LDSC to summary statistics on the 15 traits to estimate heritability and functional enrichment under the 52-annotation baseline model. In addition to evaluating the mean and variance of the enrichment of the 24 main annotations across multiple traits, we conducted a meta-analysis of the enrichment analysis results of 5 traits with low phenotypic correlation (age at menarche, BMI, coronary atherosclerosis, height, and ulcerative colitis), to more precisely estimate the enrichment associated with the various functional annotations across multiple traits.
Computational efficiency
We recorded the central processing unit (CPU) times (in minutes) taken by g- and s-LDSC for the analyses of the 15 traits. The analyses were performed on an Intel Xeon E5-2650 processor, 2.20 GHz, and 12 cores with 120 GB of memory. The g-LDSC program was run with a parallelized code on a four-core cluster, and s-LDSC was run using the default setting of the s-LDSC software.
Results
Enrichment analysis simulation study
In simulations, g-LDSC achieved higher precision than s-LDSC for estimating fold enrichment for different values of true folds enrichment and sample sizes, when polygenicity is 5% (Figure 1A; Table S1). Both methods had lower accuracy (higher RMSE) under a lower level of polygenicity (1%), but g-LDSC continued to produce more precise estimates than s-LDSC (Tables 1 and S2; Figures S4, S5 and S13).
Figure 1.
Evaluation of functional enrichment estimates and performance of hypothesis tests between s-LDSC and g-LDSC through simulations
(A) We compared functional enrichment estimates via different GWAS sample sizes. Dots represent point estimates and error bars represent 95% confidence intervals. Dashed lines represent 'the estimated value' matches the 'true value'.
(B) The percentage of rejection among repeated experiments via different folds of enrichment. g- and s-LDSC rejected 0.70% and 2.00% of 150 replicates correspondingly under 1-fold enrichment scenario. Polygenicity is set at 5% throughout.
Table 1.
RMSE and SE of enrichment estimates with g-LDSC and s-LDSC in simulations under 20,000 GWAS sample size
| True fold enrichment | Polygenicity, % | g-LDSC |
s-LDSC |
||
|---|---|---|---|---|---|
| RMSE | SE | RMSE | SE | ||
| 1 | 1 | 0.27 | 0.27 | 0.54 | 0.44 |
| 5 | 0.27 | 0.24 | 0.43 | 0.38 | |
| 100 | 0.25 | 0.20 | 0.27 | 0.26 | |
| 2 | 1 | 0.34 | 0.32 | 0.57 | 0.47 |
| 5 | 0.25 | 0.24 | 0.42 | 0.41 | |
| 100 | 0.23 | 0.19 | 0.26 | 0.26 | |
| 3 | 1 | 0.61 | 0.50 | 0.72 | 0.68 |
| 5 | 0.25 | 0.26 | 0.49 | 0.47 | |
| 100 | 0.23 | 0.19 | 0.31 | 0.26 | |
At a GWAS sample size of 20,000, g-LDSC rejected the null hypothesis of no enrichment at a rate of 0.7%, below the specified type I error rate of 5%, when no enrichment existed (fold enrichment ≤1). In contrast, s-LDSC was slightly liberal, rejecting the null hypothesis at a rate of 2.0% when no enrichment existed (Figure 1B). When enrichment existed (fold enrichment >1), the statistical power of both tests increased with increasing fold enrichment, with g-LDSC achieving greater power than s-LDSC when the true fold enrichment is ≥1.4. The difference in power is particularly large (90% vs. 48%) when the true fold enrichment is 2.
Model misspecification simulation study
In the absence of model misspecification (where DHS is the only causal annotation in the simulation and is included in the analysis model), the estimates from both g- and s-LDSC were unbiased, although g-LDSC estimates were more precise (Figure 2A). This was true whether the analysis model contained 2 annotations or 52 annotations.
Figure 2.
Simulation results for model misspecification
The comparison of DHS enrichment estimates between g- and s-LDSC using 2-annotation (DHS and non-DHS) model and 52-annotation models. (A) represents the scenario where the true causal annotation was correctly specified in the model. The 50-annotation model in (B) is 52-annotation model excluding Enhancer_Anderson and its extended regions. Dashed lines represent the estimated value matches the true value. GWAS sample size is set at 50,000 throughout.
In the presence of model misspecification (where DHS is not the causal annotation and the true causal annotation was not included in the analysis model), the estimates of DHS enrichment from s-LDSC were severely biased, except when the true fold DHS enrichment is 0 (no DHS SNP was causal) under the 50-annotation model, whereas the estimates from g-LDSC were unbiased under the 50-annotation model for all assigned DHS enrichments, and only moderately biased under the 2-annotation model (Figure 2B).
Reference sample versus true LD matrix simulation study
Using the reference sample LD matrix calculated from White British UK Biobank subjects did not lead to biased enrichment estimates in either g- or s-LDSC, but slightly reduced the precision of the estimates from g-LDSC when the true enrichment was 2 or 3 compared to estimates produced using true LD matrix. Using the reference sample LD matrix calculated from African-ancestry UK Biobank subjects led to serious downward bias in enrichment estimates, when the true enrichment was 2 or 3, for both g- and s-LDSC, without reducing the precision of the estimates (Figures 3 and S12).
Figure 3.
Simulation results for reference LD versus true LD matrix
The comparison of DHS enrichment estimates between g- and s-LDSC using LD score and calculated from true 1000 Genomes Project European, reference UK Biobank European, or reference UK Biobank African. Dashed lines represent the true value at 1×, 2×, and 3× fold enrichment. Diamonds indicate means in boxplots. The colors of the boxes differentiate the estimation methods with different LD matrices. SNP heritability is set at 0.4. Polygenicity is set at 5%. GWAS sample size is set at 50,000.
The performance of versus simulation study
When the polygenicity in simulation studies is set to be high (polygenicity ), both RMSE and SE of enrichment estimates are quite stable, being only slightly larger when estimated assuming than assuming (Table 2). When polygenicity is low (polygenicity ), the enrichment estimates obtained under become more markedly greater than those obtained under , but the difference is nevertheless not very large (69% larger for RMSE and 29% larger for SE when polygenicity ).
Table 2.
RMSE and SE of DHS enrichment estimates under and with g-LDSC at different polygenicity levels from 50 replicates of simulated GWAS data on 10,000 subjects
| Polygenicity, % | Matrix used: |
Matrix used: |
||
|---|---|---|---|---|
| RMSE | SE | RMSE | SE | |
| 1 | 0.49 | 0.33 | 0.83 | 0.43 |
| 5 | 0.44 | 0.22 | 0.59 | 0.33 |
| 20 | 0.33 | 0.23 | 0.36 | 0.32 |
| 40 | 0.29 | 0.20 | 0.28 | 0.28 |
| 60 | 0.24 | 0.20 | 0.25 | 0.25 |
| 80 | 0.26 | 0.22 | 0.29 | 0.30 |
| 100 | 0.28 | 0.22 | 0.29 | 0.28 |
True enrichment set at 3-fold.
The UK Biobank functional enrichment analysis
In the enrichment analysis of 15 traits on 24 functional annotations (360 trait-annotation pairs) to the UK Biobank GWAS data, g-LDSC identified 118 significant functional enrichment at the 5% significance level after applying a Bonferroni correction, whereas s-LDSC detected 51 such significant enrichments (Figure 4; Table S5). The two methods produced SNP heritability estimates that were highly correlated across the 15 traits Figure S9 and SNP heritability estimates obtained from g-LDSC were higher than those obtained from s-LDSC (Figure S9), which were consistented with the results obtained in simulations (Figures S1–S3). The two methods produced similar estimates for the inflation term : mean = 1.027, SD = 0.074 for g-LDSC and mean = 1.040 and SD = 0.049 for s-LDSC (Table S3; Figures S6–S8).
Figure 4.
Trait-specific enriched annotations identified by g-LDSC and s-LDSC among 15 traits in UK Biobank
The comparison of p value (-) of each trait-specific annotation. Scatter colors present enriched annotations identified by different methods at 5% significance level after applying a Bonferroni correction: black (neither), red (g-LDSC), blue (s-LDSC), and purple (both). Dashed lines depict the significant threshold for s-LDSC (blue) and g-LDSC (red). All p values less than 1e−15 are demonstrated as 1e−15.
In the meta-analysis of the enrichment analysis of the five nearly independent traits, g-LDSC yielded lower (and more reasonable) estimates with smaller SEs compared to s-LDSC (Figure 5). Both methods identified coding regions, conserved regions, and H3K9ac as enriched annotations. However, only g-LDSC identified H3K4me1, transcription start site (TSS), 3′ UTR, and weak enhancers as enriched annotations. Among the enriched annotations reported only by g-LDSC, H3K4me1 and TSS were reported to show enrichment for 25-hydroxyvitamin D levels,37 3′ UTR was reported to show enrichment for schizophrenia,38 and weak enhancers were reported to show enrichment for depression.39 However, s-LDSC but not g-LDSC identified H3K27ac and H3K4me3 as enriched annotations (Figure 6).
Figure 6.
Enrichment estimates with g-LDSC and s-LDSC for the 24 main functional categories in meta-analysis of 5 independent traits
Error bars represent meta-SEs around the estimates of enrichment. Dashed line depicts the critical value of enrichment. “∗” indicates significance at 5% significance level after applying a Bonferroni correction.
Figure 5.
Enrichment analysis for selected annotations and traits for g-LDSC and s-LDSC
Error bars represent jackknife SE around the estimates of enrichment. Dashed line depicts the critical value of enrichment.
Computational efficiency
The average computation time across the 15 traits in our analysis of UK Biobank data was ∼20 min for g-LDSC and ∼1 min for s-LDSC. The g-LDSC approach required additional running time for analysis because it introduced additional matrix operations. When the number of annotations used in g-LDSC was much smaller than the number of SNPs in an LD block (), the computational complexity in this LD block could be approximated by (Figure S11). Thus, the total computational complexity of completing an enrichment analysis in g-LDSC is approximately , where is the total number of LD blocks. Other factors influencing the computational complexity include the number of eigenvectors used in matrix inversion (see supplemental information) and the number of functional annotations contained in the model.
Discussion
We have developed a GWAS summary statistics-based method, g-LDSC, to estimate functional enrichment with increased accuracy and precision. In this study, we performed a systematic comparison of g-LDSC with s-LDSC, a state-of-the-art method for functional enrichment estimation. Through extensive simulations, g-LDSC produced enrichment estimates with lower RMSEs and smaller SEs, and achieved greater power to detect enrichment when it is present, than s-LDSC. Furthermore, in an application to UK Biobank GWAS data, g-LDSC detected more significantly enriched functional annotations for the 15 traits than s-LDSC (118 vs. 51). In addition, estimates of functional enrichment using g-LDSC were generally lower and more realistic (maximum 4-fold) than those obtained from s-LDSC (maximum 15-fold).
The g-LDSC method differs from s-LDSC in that it uses FGLS instead of WLS to estimate functional enrichment. In addition to allowing for differences in residual variance among observations, FGLS also allows for residual correlations between observations. For this reason, g-LDSC produces more precise estimates for SNP heritability as well as functional enrichment, than s-LDSC. However, unlike other recently developed methods for estimating SNP heritability from GWAS summary statistics, including high-definition likelihood27 and LD eigenvalue regression,26 g-LDSC models only the squares of the z statistics but not their cross-products. For this reason, g-LDSC is expected to produce less precise SNP heritability estimates than these other methods that model both the cross-products and the squares of z statistics. However, the extension of the methods that model the cross-products of z statistics to estimate functional enrichment is not straightforward. Thus, g-LDSC may be the best currently available method for functional enrichment estimation based on GWAS summary statistics.
In practice, when estimating annotation enrichment using GWAS summary-level data, the true LD matrix is generally unavailable, and it is necessary to use an LD matrix from a reference panel that is not a perfect match of the true LD structure of the GWAS sample. Our simulations showed that using a mismatched LD matrix can lead to biased or less precise estimates. However, the bias and loss of precision caused by LD mismatch could be reduced by choosing a reference panel that has very similar ancestry to the GWAS sample.
In our studies we estimated assuming no confounding bias exists. However, a previous work has shown that fine-scale structure in ethnically homogeneous populations may have an impact on GWAS, especially in large samples.40 Like s-LDSC, g-LDSC also includes an intercept term (shown in Equation 7) to remove confounding bias. However, confounding bias also has an influence on the residual covariance matrix , and ignoring this (by setting to 1 when the true is actually greater than 1) reduces the precision of g-LDSC enrichment estimates (Table S4). To address this issue, we can apply g-LDSC iteratively, where is set to 1 in the first iteration, the intercept from which is then used as the input value for the second iteration to obtain final enrichment estimates. In simulations we found that this two-stage procedure significantly improved the precision of enrichment estimates in samples with substantial confounding bias (Table S4).
The selection of annotations is crucial for obtaining accurate g-LDSC enrichment estimates. The LDSC framework assumes that the expected heritability of SNPs differ depending on certain “causal” annotations. When an annotation has an effect on SNP heritability but is not included in the model, the model is misspecified, leading to biased enrichment estimates for annotations that are correlated with the omitted annotation. To reduce bias caused by this type of model misspecification, one solution is to jointly estimate the enrichments for a large number of annotations, which means that a causal annotation is less likely to be omitted, and that even if it is omitted, the bias introduced by its omission is likely to be reduced because of its correlations with annotations included in the model. Indeed, with the 50-annotation model, the estimates from g-LDSC were unbiased in the presence of model misspecification. Therefore, in practice, we recommend g-LDSC users to jointly consider a comprehensive range of annotations (e.g., the 52-annotation model) in each analysis.
There are some limitations to our method. First, g-LDSC uses an approximate conditional covariance matrix that produces estimates with inflation when polygenicity is low. This may be because assumes an underlying polygenic model with equal contributions () from all of the annotations. In principle, an iterative procedure can be adopted, in which a current is used in FGLS to revise , which are in turn used to update until convergence. However, we abandoned this approach because we found that the iterative procedure often produces that are not positive definite and therefore not invertible. If there is a way of obtaining a more accurate , for example, by ensuring that the iterative procedure always produces updated that are positive definite, then this should improve the performance of g-LDSC. Second, g-LDSC required approximately 20-fold additional CPU time to complete the analysis for a whole-genome sequencing GWAS compared to s-LDSC, the most time-consuming operation being the inversion of the error covariance matrix for each LD block. One possible way to reduce the computational cost of g-LDSC is to adopt sparse LD precision matrices.41 Third, like s-LDSC, the current version of g-LDSC can perform enrichment analysis only for binary functional annotations. Further adaptation of g-LDSC is necessary for it to be applicable to continuous functional annotations such as protein functional scores,42,43 evolutionary conservation scores,44,45 and epigenetic measures.1,46
The g-LDSC method is a substantial advance that enables more accurate estimation of functional enrichment for complex diseases. It has the potential to gain biological insights, especially when applied to fine-grained high-resolution functional annotations such as cell-type-specific, developmental stages-specific, or environment-specific annotations. Finally, the g-LDSC framework can be extended to jointly model multiple complex traits to partition the genetic correlations by functional annotation.
Data and code availability
g-LDSC software (R package) and analysis scripts are available at https://github.com/xzw20046/gldsc.
Web resources
1000 Genome Project, http://www.1000genomes.org
LDSC, https://github.com/bulik/ldsc
PRScsx, https://github.com/getian107/PRScsx
PLINK, https://www.cog-genomics.org/plink/
UK Biobank, https://www.ukbiobank.ac.uk
Neale lab UK Biobank GWAS summary statistics, http://www.nealelab.is/uk-biobank/
Acknowledgments
This research has been conducted using data from UK Biobank Resource accessed under application no. 28732. The computations were performed using research computing facilities offered by Information Technology Services, The University of Hong Kong, and the High-Performance Computing Facility of Bioinformatics Core, Centre for PanorOmic Sciences, LKS Faculty of Medicine, The University of Hong Kong. This work was supported by Hong Kong Research Grants Council Collaborative Research Grant C7044-19G, Hong Kong Innovation and Technology Bureau funding for the State Key Laboratory of Brain and Cognitive Sciences, and the National Natural Science Foundation of China (32170637).
Author contributions
Z.X. conceived the idea, analyzed the data, performed the analyses, provided software and drafted the manuscript. T.-T.Q. conceived the idea and drafted the manuscript. Y.D.Z. conceived the idea and drafted the manuscript. P.C.S. conceived the idea and drafted the manuscript. All of the authors provided input and revisions for the final manuscript.
Declaration of interests
The authors declare no competing interests.
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xhgg.2024.100272.
Contributor Information
Yan Dora Zhang, Email: doraz@hku.hk.
Pak Chung Sham, Email: pcsham@hku.hk.
Supplemental information
References
- 1.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Roadmap Epigenomics Consortium. Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kanehisa M., Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. Gene ontology: tool for the unification of biology. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.O'Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D., et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Frankish A., Diekhans M., Jungreis I., Lagarde J., Loveland J.E., Mudge J.M., Sisu C., Wright J.C., Armstrong J., Barnes I., et al. GENCODE 2021. Nucleic Acids Res. 2021;49:D916–D923. doi: 10.1093/nar/gkaa1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yao H. The University of Hong Kong Pokfulam; 2021. Functional Annotation, Prioritization and Enrichment Analysis of Human Regulatory Variants. [Google Scholar]
- 8.Wijesooriya K., Jadaan S.A., Perera K.L., Kaur T., Ziemann M. Urgent need for consistent standards in functional enrichment analysis. PLoS Comput. Biol. 2022;18:e1009935. doi: 10.1371/journal.pcbi.1009935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hirschhorn J.N., Daly M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 2005;6:95–108. doi: 10.1038/nrg1521. [DOI] [PubMed] [Google Scholar]
- 10.Trynka G., Westra H.J., Slowikowski K., Hu X., Xu H., Stranger B.E., Klein R.J., Han B., Raychaudhuri S. Disentangling the effects of colocalizing genomic annotations to functionally prioritize non-coding variants within complex-trait loci. Am. J. Hum. Genet. 2015;97:139–152. doi: 10.1016/j.ajhg.2015.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kichaev G., Pasaniuc B. Leveraging functional-annotation data in trans-ethnic fine-mapping studies. Am. J. Hum. Genet. 2015;97:260–271. doi: 10.1016/j.ajhg.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Márquez-Luna C., Gazal S., Loh P.R., Kim S.S., Furlotte N., Auton A., 23andMe Research Team. Price A.L. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun. 2021;12:6052. doi: 10.1038/s41467-021-25171-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Pickrell J.K. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet. 2014;94:559–573. doi: 10.1016/j.ajhg.2014.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J., et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Huang D., Wang Z., Zhou Y., Liang Q., Sham P.C., Yao H., Li M.J. vSampler: fast and annotation-based matched variant sampling tool. Bioinformatics. 2021;37:1915–1917. doi: 10.1093/bioinformatics/btaa883. [DOI] [PubMed] [Google Scholar]
- 16.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W., et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Stahl E.A., Wegmann D., Trynka G., Gutierrez-Achury J., Do R., Voight B.F., Kraft P., Chen R., Kallberg H.J., Kurreeman F.A.S., et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat. Genet. 2012;44:483–489. doi: 10.1038/ng.2232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gusev A., Lee S.H., Trynka G., Finucane H., Vilhjálmsson B.J., Xu H., Zang C., Ripke S., Bulik-Sullivan B., Stahl E., et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 2014;95:535–552. doi: 10.1016/j.ajhg.2014.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Finucane H.K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.R., Anttila V., Xu H., Zang C., Farh K., et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Yang J., Manolio T.A., Pasquale L.R., Boerwinkle E., Caporaso N., Cunningham J.M., de Andrade M., Feenstra B., Feingold E., Hayes M.G., et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 2011;43:519–525. doi: 10.1038/ng.823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Davis L.K., Yu D., Keenan C.L., Gamazon E.R., Konkashbaev A.I., Derks E.M., Neale B.M., Yang J., Lee S.H., Evans P., et al. Partitioning the heritability of Tourette syndrome and obsessive compulsive disorder reveals differences in genetic architecture. PLoS Genet. 2013;9:e1003864. doi: 10.1371/journal.pgen.1003864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Speed D., Balding D.J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat. Genet. 2019;51:277–284. doi: 10.1038/s41588-018-0279-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Bulik-Sullivan B.K., Loh P.R., Finucane H.K., Ripke S., Yang J., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Patterson N., Daly M.J., Price A.L., Neale B.M. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zheng J., Erzurumluoglu A.M., Elsworth B.L., Kemp J.P., Howe L., Haycock P.C., Hemani G., Tansey K., Laurin C., Early Genetics and Lifecourse Epidemiology EAGLE Eczema Consortium, et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics. 2017;33:272–279. doi: 10.1093/bioinformatics/btw613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Song S., Jiang W., Zhang Y., Hou L., Zhao H. Leveraging LD Eigenvalue Regression to Improve the Estimation of SNP Heritability and Confounding Inflation. Am. J. Hum. Genet. 2022;109:802–811. doi: 10.1016/j.ajhg.2022.03.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Ning Z., Pawitan Y., Shen X. High-definition likelihood inference of genetic correlations across human complex traits. Nat. Genet. 2020;52:859–864. doi: 10.1038/s41588-020-0653-y. [DOI] [PubMed] [Google Scholar]
- 28.Fomby T.B., Johnson S.R., Hill R.C. Advanced Econometric Methods. Springer; 1984. Feasible generalized least squares estimation; pp. 147–169. [Google Scholar]
- 29.Reams R. Hadamard inverses, square roots and products of almost semidefinite matrices. Lin. Algebra Appl. 1999;288:35–43. [Google Scholar]
- 30.Patterson N. 2020. A Modification to the Jackknife to Deal with Adjacent Blocks. [Google Scholar]
- 31.Kent W.J., Sugnet C.W., Furey T.S., Roskin K.M., Pringle T.H., Zahler A.M., Haussler D. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhang Y., Qi G., Park J.H., Chatterjee N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat. Genet. 2018;50:1318–1326. doi: 10.1038/s41588-018-0193-x. [DOI] [PubMed] [Google Scholar]
- 33.Siva N. 1000 Genomes project. Nat. Biotechnol. 2008;26:256–257. doi: 10.1038/nbt0308-256b. [DOI] [PubMed] [Google Scholar]
- 34.International HapMap 3 Consortium. Altshuler D.M., Gibbs R.A., Peltonen L., Altshuler D.M., Gibbs R.A., Peltonen L., Dermitzakis E., Schaffner S.F., Yu F., et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. doi: 10.1038/nature09298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ge T., Chen C.Y., Ni Y., Feng Y.C.A., Smoller J.W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 2019;10:1776. doi: 10.1038/s41467-019-09718-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.NealeLab UK Biobank GWAS Result. 2022. http://www.nealelab.is/uk-biobank/
- 37.Jiang X., O'Reilly P.F., Aschard H., Hsu Y.H., Richards J.B., Dupuis J., Ingelsson E., Karasik D., Pilz S., Berry D., et al. Genome-wide association study in 79,366 European-ancestry individuals informs the genetic architecture of 25-hydroxyvitamin D levels. Nat. Commun. 2018;9:260. doi: 10.1038/s41467-017-02662-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wang Y., Thompson W.K., Schork A.J., Holland D., Chen C.H., Bettella F., Desikan R.S., Li W., Witoelar A., Zuber V., et al. Leveraging genomic annotations and pleiotropic enrichment for improved replication rates in schizophrenia GWAS. PLoS Genet. 2016;12:e1005803. doi: 10.1371/journal.pgen.1005803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Howard D.M., Adams M.J., Clarke T.K., Hafferty J.D., Gibson J., Shirali M., Coleman J.R.I., Hagenaars S.P., Ward J., Wigmore E.M., et al. Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nat. Neurosci. 2019;22:343–352. doi: 10.1038/s41593-018-0326-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Cook J.P., Mahajan A., Morris A.P. Fine-scale population structure in the UK Biobank: implications for genome-wide association studies. Hum. Mol. Genet. 2020;29:2803–2811. doi: 10.1093/hmg/ddaa157. [DOI] [PubMed] [Google Scholar]
- 41.Salehi Nowbandegani P., Wohns A.W., Ballard J.L., Lander E.S., Bloemendal A., Neale B.M., O'Connor L.J. Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies. bioRxiv. 2022 doi: 10.1038/s41588-023-01487-8. Preprint at. [DOI] [PubMed] [Google Scholar]
- 42.Adzhubei I.A., Schmidt S., Peshkin L., Ramensky V.E., Gerasimova A., Bork P., Kondrashov A.S., Sunyaev S.R. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ng P.C., Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Pollard K.S., Hubisz M.J., Rosenbloom K.R., Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–121. doi: 10.1101/gr.097857.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Siepel A., Bejerano G., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. doi: 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Li X., Li Z., Zhou H., Gaynor S.M., Liu Y., Chen H., Sun R., Dey R., Arnett D.K., Aslibekyan S., et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 2020;52:969–983. doi: 10.1038/s41588-020-0676-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
g-LDSC software (R package) and analysis scripts are available at https://github.com/xzw20046/gldsc.






