Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2023 Dec 28;19(12):e1011104. doi: 10.1371/journal.pgen.1011104

SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations

Wenmin Zhang 1,*, Hamed Najafabadi 1,2,3, Yue Li 1,4,*
Editor: Gao Wang5
PMCID: PMC10781022  PMID: 38153934

Abstract

Identifying causal variants from genome-wide association studies (GWAS) is challenging due to widespread linkage disequilibrium (LD) and the possible existence of multiple causal variants in the same genomic locus. Functional annotations of the genome may help to prioritize variants that are biologically relevant and thus improve fine-mapping of GWAS results. Classical fine-mapping methods conducting an exhaustive search of variant-level causal configurations have a high computational cost, especially when the underlying genetic architecture and LD patterns are complex. SuSiE provided an iterative Bayesian stepwise selection algorithm for efficient fine-mapping. In this work, we build connections between SuSiE and a paired mean field variational inference algorithm through the implementation of a sparse projection, and propose effective strategies for estimating hyperparameters and summarizing posterior probabilities. Moreover, we incorporate functional annotations into fine-mapping by jointly estimating enrichment weights to derive functionally-informed priors. We evaluate the performance of SparsePro through extensive simulations using resources from the UK Biobank. Compared to state-of-the-art methods, SparsePro achieved improved power for fine-mapping with reduced computation time. We demonstrate the utility of SparsePro through fine-mapping of five functional biomarkers of clinically relevant phenotypes. In summary, we have developed an efficient fine-mapping method for integrating summary statistics and functional annotations. Our method can have wide utility in understanding the genetics of complex traits and increasing the yield of functional follow-up studies of GWAS. SparsePro software is available on GitHub at https://github.com/zhwm/SparsePro.

Author summary

Accurately identifying causal variants from genome-wide association studies summary statistics is important for understanding genetic architecture of complex traits and identifying therapeutic targets. Functional annotations are commonly used as additional evidence for prioritizing causal variants. In this study, we present SparsePro to integrate summary statistics and functional annotations for accurate identification of causal variants. SparsePro extends the capabilities of a popular fine-mapping method, SuSiE, with important contributions in hyperparameter estimation, posterior summaries and integration of function annotations. Through extensive simulations, we demonstrate that our proposed approach can effectively integrate summary statistics and functional annotation, leading to improved power for identifying causal variants. Furthermore, we evaluate the benefits of incorporating functional annotations through real data analyses of five functional biomarkers. In summary, by improving power and providing valuable insights into complex disease genetics, SparsePro will have wide utility in advancing our knowledge and facilitating follow-up discoveries.

Introduction

Establishment of large biobanks and advances in genotyping and sequencing technologies have enabled large-scale genome-wide association studies (GWAS) [13]. Although GWAS have revealed hundreds of thousands of associations between genetic variants and traits of interest, understanding the genetic architecture underlying these associations remains challenging [46], mainly because GWAS rely on univariate regression models that are not able to distinguish causal variants from other variants in linkage disequilibrium (LD) [5, 7, 8].

Several fine-mapping methods have been proposed to identify causal variants from GWAS. For instance, BIMBAM [9], CAVIAR [10] and CAVIARBF [11] estimate the posterior inclusion probabilities (PIP) for variants in a genomic locus by exhaustively evaluating likelihoods of all possible causal configurations. FINEMAP [12] accelerates such calculations with a stochastic shotgun search focusing on the most likely subset of causal configurations. However, the total number of variant-level causal configurations can grow exponentially with the number of causal variants, which can lead to a very high computational cost in classical fine-mapping methods. Starting from the motivation of quantifying uncertainty in selecting variants for constructing credible sets, SuSiE introduced a novel sum of single effect model and proposed an efficient iterative Bayesian stepwise selection (IBSS) algorithm [13, 14]. The IBSS algorithm sheds light to a promising approach to improve fine-mapping efficiency.

Additionally, functional annotations are commonly used as auxiliary evidence for prioritizing causal variants. PAINTOR [15] uses a probabilistic framework that integrates GWAS summary statistics with functional annotation data to improve accuracy of fine-mapping. Similarly, TORUS [16] incorporates highly informative genomic annotations to help with quantitative trait loci discoveries. Recently, PolyFun [17] was developed to use genome-wide heritability estimates from LD score regression to set the functional priors for fine-mapping methods. Given the computational efficiency of SuSiE, integrating functional annotations into similar algorithms can be desirable.

In this work, we present SparsePro for efficient fine-mapping with the ability to incorporate functional annotations. We connect the SuSiE IBSS algorithm with earlier work on a paired mean field variational inference algorithm [18] through the implementation of a sparse projection. We further propose effective strategies for estimating hyperparameters and summarizing posterior probabilities. We assess the performance of our proposed approach via simulation studies and examine the utility of SparsePro by fine-mapping five functional biomarkers of clinically relevant phenotypes.

Results

SparsePro method overview

To fine-map causal variants, SparsePro integrates two lines of evidence (Fig 1). First, with GWAS summary statistics and matched LD information, we jointly infer both the variant representations and the effect sizes for effect groups. Second, we estimate the functional enrichment of causal variants and use the estimates to further prioritize causal variants. As outputs, our method yields variant-level PIP estimates, set-level posterior summaries as well as enrichment estimates of functional annotations.

Fig 1. SparsePro for integrating summary statistics and functional annotations.

Fig 1

The data generative process in SparsePro is depicted in this graphical model. Green shaded nodes represent observed variables: functional annotation information Ag for the gth variant, genotype Xi and trait yi for the ith individual. The orange unshaded nodes represent latent variables. Specifically, π˜g is the prior inclusion probability for the gth variant derived from functional annotation information; sk is a sparse indicator specifying the variant representation of the kth effect group and βk represents the effect size of the kth effect group. As a result, posterior summary can be obtained from posterior distribution of sk. Here, we assume individual-level data are available and adaption to GWAS summary statistics is detailed in the S1 Text.

Locus simulation

To evaluate the performance of SparsePro, we conducted simulations using the UK Biobank genotype data from multiple genomic loci (Methods), considering different numbers of causal variants and functional enrichment settings. As an example, the results from one locus (chr22: 31,000,000–32,000,000, Genome Assembly hg19) under a specific simulation setting: K = 5 (causal variants) and W = 2 (enrichment intensity) are shown in Fig 2. A reliable method should accurately identify true causal variants (represented by red dots) with high posterior inclusion probabilities (PIP) and assign low PIP to non-causal variants (represented by black dots), thereby achieving a high area under the precision-recall curve (AUPRC). Moreover, an ideal method should generate credible sets that have a high coverage and power while maintaining a small size.

Fig 2. Locus simulation in the setting of K = 5 (number of causal variants) and W = 2 (enrichment intensity).

Fig 2

(A) Comparison of posterior inclusion probabilities (PIP) obtained using different methods. Each dot represents a variant. True causal variants are colored red and non-causal variants are colored black. (B) Precision-recall curves. The area under the precision-recall curve (AUPRC) for each method is indicated. (C) Calibration curves. Variants are grouped into five bins according to their PIP values. Each dot represents one bin. The actual precision (y-axis) is plotted against the expected precision (x-axis) calculated by mean PIP values across all variants in the bin.

We found that variant-level PIP obtained from FINEMAP, SuSiE, and SparsePro- (SparsePro implementation without functional information) were highly similar for most variants. Compared to FINEMAP, SparsePro- had fewer false positives; compared to SuSiE, SparsePro- was able to detect more causal variants with higher PIP (Fig 2A). Aggregating all simulation replicates, SparsePro- achieved an AUPRC of 0.91 in identifying true causal variants, while the second-best statistical fine-mapping method (SuSiE) achieved an AUPRC of 0.87 and PAINTOR- achieved an AUPRC of 0.82 (Fig 2B). Incorporating functional annotations can improve fine-mapping performance, with SparsePro+ achieving an AUPRC of 0.98, while PAINTOR+ achieved an AUPRC of 0.93 (Fig 2B). Additionally, PIP generated by most methods exhibit good calibration, as the actual precision is close to the expected precision calculated by the mean PIP values (Fig 2C).

SparsePro yielded better set-level summaries compared to SuSiE due to its effective strategies for estimating hyperparameters and summarizing posterior probabilities (S1 Text). Overall, credible sets from SparsePro-, SparsePro+, and SuSiE exhibited similar coverage (Fig 3A). However, both SparsePro- and SparsePro+ consistently demonstrated a higher power and smaller size compared to SuSiE (Fig 3A). In simulation settings with functional enrichment, SparsePro+ demonstrated an additional increase in power and reduction in set size compared to SparsePro- (Fig 3A). For example, with K = 5 (causal variants) and W = 2 (enrichment intensity), SparsePro- and SuSiE achieved a similar mean coverage of 0.99 for their 95% credible sets (Fig 3A and S1 Table). However, SparsePro- outperformed SuSiE by achieving a higher mean power of 0.82 at a smaller mean size of 1.8, while SuSiE achieved a power of 0.68 at a mean size of 2.0 (Fig 3A and S1 Table). When functional information was incorporated, SparsePro+ further increased power and reduced the size of credible sets, achieving a power of 0.94 at a mean size of 1.4 (Fig 3A and S1 Table).

Fig 3. Summary of locus simulations.

Fig 3

(A) Coverage, power and size of 95% credible sets. (B) Area under the precision-recall curve (AUPRC). (C) Computation time in seconds.

SparsePro also consistently demonstrated the highest AUPRC across all simulation settings (Fig 3B and S2 Table). When the number of causal variants was small (e.g. K = 1), the AUPRC of FINEMAP was comparable to SparsePro-; however, as the number of causal variants increased, the performance of FINEMAP deteriorated (Fig 3B). Notably, without simulated functional enrichment (W = 0), none of the annotations demonstrated significance in the SparsePro G-test (Methods) (S3 Table). Therefore, SparsePro+ was not implemented. In contrast, PAINTOR did not select annotations by relevance, which could lead to worse performance of PAINTOR+ compared to PAINTOR- when there was no actual functional enrichment. For both PAINTOR+ and SparsePro+, stronger functional enrichment resulted in higher estimated enrichment weights (S3 and S4 Tables and S1 and S2 Figs). Consequently, the magnitude of their performance improvement became more evident compared to SparsePro- and PAINTOR-, respectively(Fig 3B).

In these simulations, SparsePro also achieved great computational efficiency (Fig 3C), with the computation cost only increasing marginally as the number of causal variants increased. In contrast, while FINEMAP was the fastest approach when there was only one causal variant, its computational cost increased drastically with more causal variants (Fig 3C).

Genome-wide simulation

In addition to locus simulations, we also performed genome-wide simulations to compare SparsePro with PolyFun, which relies on LD score regression to derive functionally-informed priors (Methods). A reliable functionally-informed fine-mapping method should accurately estimate the intensity of functional enrichment and incorporate this information into deriving functionally-informed priors. Additionally, the method should also demonstrate robustness against annotation misspecification or measurement errors.

SparsePro accurately estimated functional enrichment weights in different simulation settings. Specifically, when functional enrichment was simulated (W = 1 and W = 2), the SparsePro G-test identified the enriched annotations as well as annotations that overlapped with the enriched annotations (S5 Table, S3 Fig). In settings without simulated functional enrichment (W = 0), none of the G-test results were statistically significant (S5 Table, S3 Fig). Furthermore, even without selecting relevant annotations based on a p-value threshold, the joint enrichment estimates in SparsePro yielded small weights for non-enriched annotations and large weights for enriched annotations, which closely aligned with their simulated values (S6 Table, S4 Fig). In comparison, the annotation coefficients obtained from LD score regression in PolyFun were small values that are difficult to interpret (S7 Table).

With accurate estimates of enrichment weights, SparsePro obtained well-informed priors that enhanced the power of causal variant prioritization. For instance, in the simulation setting with W = 2, SparsePro+ outperformed SparsePro- and other methods by identifying more causal variants with higher PIP (Fig 4A). The improved performance is demonstrated by achieving a higher AUPRC (S8 Table) and generating credible sets with an increased power and reduced size (S9 Table). The performance of SparsePro+PolyFun was worse than SparsePro+, with a lower AUPRC (S8 Table) and credible sets with a reduced power (S9 Table).

Fig 4. Genome-wide simulations.

Fig 4

(A) Comparison of posterior inclusion probabilities (PIP) obtained using different methods in the simulation setting of W = 2. True causal variants are colored red and non-causal variants are colored black. (B) The logarithmic relative ratio (logRR) between the largest and smallest prior inclusion probabilities. (C) Coverage, power and size for 95% credible sets.

The key distinction between the functionally-informed priors derived from SparsePro and PolyFun lay in their adaptability to the actual functional enrichment, as demonstrated through the logarithmic relative ratio (logRR) between the largest and smallest prior inclusion probabilities (Fig 4B). In SparsePro+, in settings with no simulated functional enrichment (W = 0), this ratio was 1 since a flat prior was used due to the absence of annotations selected by the G-test for prior inclusion probability calculation (S10 Table, Fig 4B). As the enrichment intensity (W) increased, logRR also increased accordingly and were close to the simulated values (W = 1 or W = 2; S10 Table, Fig 4B). In contrast, in PolyFun, the logRR remained approximately log(100) as per the default setting of the algorithm, regardless of the actual functional enrichment (S10 Table, Fig 4B).

In the case of annotation misspecification (Methods), SparsePro demonstrated greater robustness compared to PolyFun. When the “conserved sequences” [19] annotation (the simulated misspecified annotation) was included in functional fine-mapping, the enrichment weights estimated by SparsePro+Misspecified were attenuated compared to SparsePro+ when the “non-synonymous” [20] annotation (the simulated enriched annotation) was included (Fig 4B). Consequently, SparsePro+Misspecified identified fewer causal variants with higher PIP compared to SparsePro+ (Fig 4A, S5 and S6 Figs). However, SparsePro+Misspecified maintained a comparable performance to SparsePro- with a similar overall AUPRC (S8 Table) and similar coverage, power and size for credible sets (Fig 4C). In contrast, with the misspecified annotation, PolyFun still generated a strong functionally-informed prior (Fig 4B). As a result, a large number of non-causal variants were prioritized with high PIP (Fig 4A, S5 and S6 Figs), and the obtained credible sets had a lower coverage and reduced power (Fig 4C).

Fine-mapping of functional biomarkers of clinically relevant phenotypes

We performed GWAS using data from the UK Biobank [1] for five functional biomarkers: FEV1-FVC ratio (FFR; lung function), estimated glomerular filtration rate (eGFR; kidney function), pulse rate (heart function), blood gamma-glutamyl transferase (gamma-GT; liver function) and blood glucose level (pancreatic islet function) (Methods). The fine-mapping results for these biomarkers demonstrated that functional annotations were informative in prioritizing causal variants. Notably, for all five biomarkers, the “non-synonymous” [20] annotation consistently exhibited the highest enrichment weights compared to other annotations (S7 Fig, S11 Table). Specifically, for eGFR, the non-synonymous annotation had an enrichment weight of 3.19 (95% CI: 2.88–3.50) (S11 Table). This indicates that non-synonymous variants were 24.3 (95% CI: 17.8–33.1) times more likely to be causal variants compared to variants that are not non-synonymous (S11 Table).

We used a different set of tissue-specific annotations to evaluate the biological relevance of the fine-mapping results (Methods). By estimating enrichment weights for these annotations based on variant-level PIP, we found that the tissue-specific annotations corresponding to each biomarker exhibited the highest enrichment weights (Fig 5A, S12 Table). For example, for eGFR, where the kidney-related annotation was the most relevant, the estimated enrichment weight from PIP derived from SparsePro- was 1.42 (95% CI: 1.25–1.59) while the estimated enrichment weight from PIP derived from SparsePro+ was 2.14 (95% CI: 1.95–2.33) (S12 Table).

Fig 5. Biological relevance of fine-mapping results for functional biomarkers of clinically relevant phenotypes.

Fig 5

(A) Enrichment fold in tissue-specific annotations. Each row denotes a tissue-specific annotation derived from histone marks (Methods) and each column denotes a functional biomarker. Error bars represent 95% confidence intervals for enrichment estimates. (B) Proportion of top variants from 95% credible sets mapped to tissue-specific annotations. Rows denote relevant tissue-specific annotations and columns denote functional biomarkers.

Furthermore, the top variants from 95% credible sets also demonstrated tissue specificity (Fig 5B). For example, approximately 36% of the top variants for eGFR were annotated to kidney-specific annotations, while only 18% of the top variants for pulse rate were annotated to kidney-specific annotations (p-value from Fisher’s exact test: 8.8 × 10−7) (S13 and S14 Tables).

Evidence of genetic coordination of clinically relevant phenotypes

The top variants from 95% credible sets mapped to genes in both core and regulatory pathways for these biomarkers (S15S19 Tables). Four genes harbored top variants for four out of the five biomarkers (Fig 6A). Interestingly, we found that rs1260326 (Fig 6B), a missense variant (Leu446Pro) in gene GCKR, was fine-mapped for eGFR (PIP = 0.99), blood glucose level (PIP = 0.99), gamma-GT level (PIP = 1.00) and pulse rate (PIP = 0.85). Notably, this specific variant has been significantly associated with several glycemic traits [21] and quantitative traits for metabolic syndromes and comorbidities [22, 23], and has been implicated in the functions of the liver and other vital organs [2426]. Other highly pleiotropic genes are transcription factors GLIS3, RREB1 and ZBT38. These findings may present promising genetic targets for experimental validations in a larger effort towards understanding the mechanisms of genetic coordination among clinically relevant phenotypes.

Fig 6. Genes harboring causal sets for five functional biomarkers of clinically relevant phenotypes.

Fig 6

(A) Genome-wide distribution of genes harboring causal sets for at least two functional biomarkers. (B) GCKR locus with fine-mapped variant rs1260326. This variants was deemed causal for eGFR, glucose, gamma-GT and pulse rate. P-values from GWAS and posterior inclusion probabilities inferred from SparsePro+ are illustrated. Variants within a ±500kb window are colored by their linkage disequilibrium r2 with rs1260326.

Discussion

Accurately identifying causal variants is fundamental to human genetics research and particularly important for interpreting GWAS results [5, 8]. In this work, we presented SparsePro to help prioritize causal variants by integrating GWAS summary statistics and functional information. We showcased the improved performance of our proposed approach through simulation studies. By fine-mapping genetic associations in five biomarkers of clinically relevant phenotypes, we demonstrated that functional annotations were useful in prioritizing biologically relevant variants.

SparsePro builds upon SuSiE [13] and extends the capabilities of SuSiE with several important contributions. First, we proposed an effective strategy for estimating hyperparameters. Specifically, local heritability-based estimates can reduce the number of parameters to be estimated by the fine-mapping algorithm, resulting in improved power and efficiency. To showcase its utility, we also applied this strategy to SuSiE and observed substantial improvement of fine-mapping power (S2 Table) with calibrated PIP (S8 Table).

Moreover, we provided an alternative attainable coverage-based approach for posterior summaries. Specifically, we calculated attainable coverage for each effect group and only effect groups with attainable coverage greater than ρ were summarized to ρ-level credible sets. We also applied this approach to SuSiE, which yielded improved set-level summaries compared to its original implementation with purity-based filtering (S9 Fig). As expected, if both strategies for estimating hyperparameters and summarizing posterior probabilities were incorporated in SuSiE, its performance could be comparable to that of SparsePro (S1 and S2 Tables).

Importantly, we provided a framework to integrate GWAS summary statistics and functional annotations. Functional annotations are widely used as additional evidence to prioritize causal variants together with statistical associations, with the possibility to elucidate the causal mechanisms [1517]. In this study, we proposed an integrated approach for functional fine-mapping by jointly estimating enrichment weights for functional annotations and subsequently incorporating enrichment weights to derive functionally-informed priors. Therefore, the obtained priors were adaptive to functional enrichment based on the data, which allows the use of functional annotations in a cautious manner. We additionally introduced a G-test to assess the relevance of annotations. This G-test evaluates whether the causal signals are significantly enriched in the annotation of interest. In simulations, the G-test has shown its effectiveness in accurately identifying the enriched annotations (S3 and S5 Tables). However, in SparsePro, filtering annotations by the G-test does not impact the fine-mapping results dramatically (S2 and S8 Tables). This is because when irrelevant annotations are included in the joint estimates, their estimated enrichment weights are typically small (S6 Table), thus having a limited impact on the functionally-informed priors. Nonetheless, in SparsePro, screening annotations by G-test leads to simple interpretable models. While we used a p-value threshold of 1 × 10−5 in both simulations and fine-mapping of functional biomarkers, users can adjust this threshold based on their preference for a more complicated model or a sparser model. Additionally, for other functionally-informed methods that are sensitive to annotation specifications, particularly those deriving strong priors from annotations, incorporating our proposed G-test can be useful to mitigate the impact of annotation misspecification.

In real data analyses, the “non-synonymous” [20] annotation is highly relevant in fine-mapping (S7 Fig) and indeed, by using this annotation to prioritize variants, we were able to identify rs1260326 as a causal variant for pulse rate when statistical evidence alone was not able to distinguish it from other variants in high LD (S10 Fig). However, future investigations are still needed to elucidate the roles of many other functional annotations.

Similar to existing fine-mapping algorithms, there are caveats in fine-mapping analysis using SparsePro. First, there are challenges related to allele flipping and LD rank deficiency when using summary statistics for fine-mapping. SparsePro, similar to Zou et al [14], does not require a full-rank LD matrix as it does not require matrix inversion throughout the algorithm. However, allele flipping can lead to algorithm convergence issues. To address this, it is recommended that users closely monitor the convergence of the algorithm and utilize scripts we provided to automate the formatting of GWAS summary statistics to match alleles in the LD reference panel. By taking these precautions, the potential convergence issues caused by allele flipping can be mitigated. Additionally, the identification of causal variants in fine-mapping relies on the rigorousness of GWAS study design, and may be biased if unmeasured confounding factors such as population stratification are not properly controlled for.

In summary, SparsePro is an accurate and efficient fine-mapping method integrating statistical evidence and functional annotations. We envision its wide utility in understanding the genetic architecture of complex traits, identifying target genes, and increasing the yield of functional follow-up studies of GWAS.

Methods

SparsePro for efficient fine-mapping integrating summary statistics and functional annotations

In SparsePro, we use a generative model to integrate GWAS summary statistics and functional annotations (Fig 1 and S1 Text). First, we specify prior inclusion probability for the gth variant π˜g:

π˜g=exp(AgTw)g=1Gexp(AgTw)

where Ag is the M × 1 vector of M annotations for the gth variant and w is a M × 1 vector of enrichment weights. Here, we use the softmax function to ensure the prior probabilities are normalized. If no functional information is provided, the prior inclusion probability is considered equal for all variants.

Subsequently, we assume the high dimensional genotype matrix XN×G can be represented by altogether K effect groups via a sparse projection SG×K = [s1, …, sK] with

skMultinomial(1,π˜)

Then the effect sizes for effect groups can be represented by β = [β1, …, βK] where

βkN(0,τβ-1)

Finally, for a continuous trait yN×1 over N individuals, we have:

yN(XSβ,τy-1I)

For inference, we use an efficient paired mean field variational inference algorithm [18] adapted for GWAS summary statistics, which we show is equivalent to the SuSiE IBSS algorithm [13] (detailed in the S1 Text). We estimate hyperparameters for effect sizes τβ and residual errors τy using local heritability-based estimates from HESS [27] (S1 Text) and propose an attainable coverage-based strategy for summarizing posterior probabilities (S1 Text). Additionally, we use joint estimates of enrichment weights to derive functionally-informed priors to further prioritize causal variants (S1 Text) and introduce a G-test to screen relevant functional annotations (S1 Text).

Locus simulation studies

We conducted locus simulations to evaluate the performance of fine-mapping methods under different settings. We randomly selected three 1-Mb regions, and obtained genotypes for 353,570 unrelated UK Biobank White British ancestry individuals [1]. For each locus, we generated 50 replicates for each combination of parameters: K ∈ {1, 2, 5, 10} (number of causal variants) and W ∈ {0, 1, 2} (enrichment intensity) among variants that were annotated as “conserved sequences” [19], “DNase I hypersensitive sites” (DHS) [28], “non-synonymous” [20], or overlapping with histone marks H3K27ac [29] or H3K4me3 [28]. In the simulated weight vector w, the entries that correspond to the these enriched annotations had a value of W. Causal variants in each simulation replicate were randomly assigned. Then, we used the GCTA GWAS simulation pipeline [30] to simulate a continuous trait with a total heritability of K × 10−4. We performed association test between each variant and the simulated trait, and obtained GWAS summary statistics using the fastGWA software [31].

Next, we ran the different fine-mapping programs with the GWAS summary statistics and in-sample LD as inputs. For methods using functional annotations, we provided the aforementioned five annotations with enrichment of causal variants as well as five additional annotations without enrichment: “actively transcribed regions” [32], “transcription start sites” [32], “promoter regions” [33], “5’-untranslated regions” [20], and “3’-untranslated regions” [20]. The statistical fine-mapping results obtained from SparsePro without annotation information were denoted as “SparsePro-”. Annotations with a G-test p-value < 1 × 10−5 were selected for functionally-informed fine-mapping, and the results were referred to as “SparsePro+”. Additionally, we performed functionally-informed fine-mapping by including all annotations (i.e., a G-test p-value < 1.0) without G-test screening, denoted as “SparsePro+1.0”. Moreover, we conducted statistical fine-mapping using the stochastic shotgun search mode of FINEMAP (V1.4) and the function “susie_rss” from SuSiE (V0.12.16). The mcmc mode for PAINTOR (V3.0) was used to obtain the baseline model results and the annotated model results, separately denoted as “PAINTOR-” and “PAINTOR+”. The largest K used for SparsePro, SuSiE and FINEMAP was 10. Due to the high computation cost, PAINTOR only allows up to 3 causal variants per locus. Computation time was recorded on a 2.1 GHz CPU for fine-mapping programs including all procedures.

Furthermore, we investigated the benefits of our proposed strategies for estimating hyperparamters and summarizing posterior probabilities (detailed in the S1 Text) by incorporating them into SuSiE. Specifically, local heritability-based estimates for effect size variance and residual variance were provided to “scaled_prior_variance” and “residual_variance” respectively in both SuSiE+HESS and SuSiE+SparsePro while the default empirical Bayes based hyperparameter estimates were used in SuSiE. The posterior summaries obtained from SuSiE with heritability-based hyperparameters were denoted as “SuSiE+HESS” while the posterior summaries obtained using our proposed approach were denoted as “SuSiE+SparsePro” (S1 Text).

Genome-wide simulation studies

We conducted genome-wide simulations to compare SparsePro+ with other methods that requires genome-wide GWAS summary statistics for functional fine-mapping. We obtained genotypes of 353,570 unrelated UK Biobank White British individuals on chromosome 22 and sampled 100 causal variants with W ∈ {0, 1, 2} (enrichment intensity) among variants that were annotated as “non-synonymous” [20]. We used the GCTA GWAS simulation pipeline [30] to simulate a continuous trait with a per-chromosome heritability of 0.01. We tested the association between each variant and the simulated trait, and obtained GWAS summary statistics using the fastGWA software [31]. This process was repeated 22 times to obtain genome-wide GWAS summary statistics. Additionally, we obtained LD information calculated using the UK Biobank participants from Weissbrod et al [17]. These LD matrices were generated for genome-wide variants binned into sliding windows of 3 Mb with neighboring windows having a 2-Mb overlap.

We applied SparsePro to the GWAS summary statistics with the aforementioned LD information, iterating over all sliding windows initially without any functional annotation. The fine-mapping results obtained were referred to as “SparsePro-”. Next, the 10 annotations used in locus simulations were used to derive functional priors. The fine-mapping results from SparsePro with a prior derived from PolyFun were denoted as “SparsePro+PolyFun”. Additionally, results from SparsePro with a functional prior estimated from annotations with a G-test p-value less than 1 × 10−5 were denoted as “SparsePro+” while results from SparsePro with a functional prior estimated from all 10 annotations were denoted as “SparsePro+1.0”. In these fine-mapping analyses, variants in each 3-Mb sliding window were fine-mapped jointly. However, we only retained PIP for variants located in the 1-Mb region central to the window as well as credible sets with top variants located in this 1-Mb region. Therefore, variants were fine-mapped together with neighboring variants within at least 1-Mb to mitigate boundary effect.

To further investigate the impact of annotation misspecification or annotation measurement errors on functionally-informed fine-mapping, we utilized the “conserved sequences” [19] annotation, which partly overlaps with the simulated enriched “non-synonymous” [20] annotation. We used this annotation for deriving the functional prior using both SparsePro and PolyFun, and the corresponding results were labeled as “SparsePro+Misspecified” and “SparsePro+Misspecified PolyFun”, respectively. These analyses allowed us to evaluate the robustness of the functionally-informed fine-mapping approach to annotation misspecifications and potential measurement errors.

Fine-mapping of functional biomarkers of clinically relevant phenotypes

To investigate potential genetic coordination mechanisms, we performed GWAS in the UK Biobank [1], focusing on five functional biomarkers: forced expiratory volume in one second to forced vital capacity (FEV1-FVC) ratio for lung function, estimated glomerular filtration rate for kidney function, pulse rate for heart function, gamma-GT for liver function and blood glucose level for pancreatic islet function. For each biomarker, we first regressed out the effects of age, age2, sex, genotyping array, recruitment centre, and the first 20 genetic principal components before inverse normal transforming the residuals to z-scores that had a zero mean and unit variance. We then performed GWAS analysis on the resulting z-scores with the fastGWA software [30, 31] to obtain summary statistics.

Using the summary statistics and the matched LD information [17], we performed genome-wide fine-mapping with “SparsePro-”, “SparsePro+” and “SparsePro+PolyFun” as described in Section 5.3 with annotations from the “baselineLF2.2.UKB” model [17] provided by PolyFun.

To assess the biological relevance of fine-mapping results, we used 10 tissue-specific annotations derived from four histone marks H3K4me1, H3K4me3, H3K9ac, and H3K27ac by Finucane et al [34]. This set of annotations was not used by any functional fine-mapping methods. To assess tissue specificity of the obtained PIP values, we ran G-test and estimated enrichment weight (S1 Text) for each tissue-specific annotation. Additionally, we examined whether the top variants from 95% credible sets identified for a trait were more enriched for relevant tissue-specific annotations compared to the top variants identified for other traits by Fisher’s exact tests.

We used phenogram [35] to illustrate genes that harbored causal variants for at least two biomarkers to explore possible pleiotropic effects.

Supporting information

S1 Text. Supplementary notes.

(PDF)

S1 Table. Summary of coverage, power and size of 95% credible sets in locus simulations.

(XLSX)

S2 Table. Summary of AUPRC in locus simulations.

(XLSX)

S3 Table. Annotation enrichment weights and G-test p-values from SparsePro in locus simulations.

(XLSX)

S4 Table. Annotation enrichment weights estimated by PAINTOR+ in locus simulations.

(XLSX)

S5 Table. Annotation enrichment weights and G-test p-values from SparsePro in genome-wide simulations.

(XLSX)

S6 Table. Annotation enrichment weights estimated jointly from SparsePro in genome-wide simulations.

(XLSX)

S7 Table. Annotation coefficients estimated by PolyFun in genome-wide simulations.

(XLSX)

S8 Table. Summary of AUPRC in genome-wide simulations.

(XLSX)

S9 Table. Summary of coverage, power and size of 95% credible sets in genome-wide simulations.

(XLSX)

S10 Table. Summary of the relative ratio between the largest and smallest prior inclusion probabilities in genome-wide simulations.

(XLSX)

S11 Table. Annotation enrichment weights and G-test p-values from SparsePro in fine-mapping functional biomarkers.

(XLSX)

S12 Table. Annotation enrichment weights for tissue-specific annotations and G-test p-values from SparsePro-, SparsePro+ and SparsePro+PolyFun in fine-mapping functional biomarkers.

(XLSX)

S13 Table. Percentage of top variants from 95% credible sets annotated to tissue-specific annotations in fine-mapping functional biomarkers.

(XLSX)

S14 Table. Fisher’s exact for tissue specificity in fine-mapping functional biomarkers.

(XLSX)

S15 Table. 95% credible sets for eGFR from SparsePro+.

(XLSX)

S16 Table. 95% credible sets for FFR from SparsePro+.

(XLSX)

S17 Table. 95% credible sets for gamma-GT from SparsePro+.

(XLSX)

S18 Table. 95% credible sets for glucose from SparsePro+.

(XLSX)

S19 Table. 95% credible sets for pulse rate from SparsePro+.

(XLSX)

S1 Fig. Annotation enrichment weights estimated by SparsePro in locus simulations.

Each grid corresponds to a different simulation setting of K (number of causal variants) and W (enrichment intensity). Error bars represent 95% confidence intervals for enrichment estimates. Blue dots are estimated values and yellow triangles are simulated values.

(TIFF)

S2 Fig. Annotation enrichment weights estimated by PAINTOR in locus simulations.

Each grid corresponds to a different simulation setting of K (number of causal variants) and W (enrichment intensity). PAINTOR does not provide confidence intervals for enrichment weights. Blue dots are estimated values and yellow triangles are simulated values.

(TIFF)

S3 Fig. Annotation enrichment weights estimated by SparsePro in genome-wide simulations.

Each row represents a different simulation setting with W (enrichment intensity) = 0, 1, or 2. Error bars represent 95% confidence intervals for enrichment estimates. Blue dots are estimated values and yellow triangles are simulated values.

(TIFF)

S4 Fig. Enrichment weights estimated jointly by SparsePro without filtering annotations in genome-wide simulations.

Each row represents a different simulation setting with W (enrichment intensity) = 0, 1, or 2. Error bars represent 95% confidence intervals for enrichment estimates. Blue dots are estimated values and yellow dots are simulated values.

(TIFF)

S5 Fig. Comparison of posterior inclusion probabilities (PIP) obtained using different methods in the simulation setting of W (enrichment intensity) = 1.

True causal variants are colored red and non-causal variants are colored black.

(TIFF)

S6 Fig. Comparison of posterior inclusion probabilities (PIP) obtained using different methods in the simulation setting of W (enrichment intensity) = 0.

True causal variants are colored red and non-causal variants are colored black.

(TIFF)

S7 Fig. Enrichment fold of annotations in fine-mapping functional biomarkers.

Each row denotes an annotation and each column denotes a functional biomarker. Error bars represent 95% confidence intervals for enrichment estimates.

(TIFF)

S8 Fig. Calibration curves for SuSiE and SuSiE with hyperparameters estimated from HESS.

Variants are grouped into five bins according to their PIP values. Each dot represents one bin. The actual precision (y-axis) is plotted against the expected precision (x-axis) calculated by mean PIP values across all variants in the bin.

(TIFF)

S9 Fig. Summary of coverage, power and size of 95% credible sets in different simulation settings for SuSiE, SuSiE with hyperparameters estimated from HESS (SuSiE+HESS) and SuSiE with hyperparameters estimated from HESS and posterior summaries proposed in SparsePro (SuSiE+SparsePro).

(TIFF)

S10 Fig. Fine-mapping the GCKR locus for pulse rate.

(A) GWAS summary statistics for pulse rate at the GCKR locus. (B) Fine-mapping results from SparsePro-. (C) Fine-mapping results from SparsePro+. (D) Fine-mapping results from SparsePro+PolyFun. P-values from GWAS and inferred posterior inclusion probabilities from fine-mapping are illustrated. Variants within a ±500kb window are colored by their linkage disequilibrium r2 with rs1260326.

(TIFF)

S11 Fig. Comparison of posterior inclusion probabilities (PIP) in the simulation setting of K (number of causal variants) = 5 and W (enrichment intensity) = 2 between SuSiE and SuSiE with hyperparameters estimated from HESS (SuSiE+HESS).

True causal variants are colored red and non-causal variants are colored black.

(TIFF)

S12 Fig. Comparison of posterior inclusion probabilities (PIP) in the simulation setting of K (number of causal variants) = 5 and W (enrichment intensity) = 2 between SparsePro- (SparsePro without functional annotations) and SuSiE with hyperparameters estimated from HESS (SuSiE+HESS).

True causal variants are colored red and non-causal variants are colored black.

(TIFF)

S13 Fig. Comparison of posterior inclusion probabilities (PIP) in the simulation setting of K (number of causal variants) = 5 and W (enrichment intensity) = 2 between SuSiE with hyperparameters estimated from HESS (SuSiE+HESS) and SuSiE with hyperparameters estimated from HESS and posterior summaries proposed in SparsePro (SuSiE+SparsePro).

True causal variants are colored red and non-causal variants are colored black.

(TIFF)

Acknowledgments

This study has been conducted using UK Biobank Resources under Application Number 45551 and we thank NeuroHub for providing access to data resources. This study was enabled, in part, by support from Calcul Québec and Compute Canada. We thank Dr. Robert Sladek and Dr. Josée Dupuis for helpful discussion and suggestions.

Data Availability

SparsePro is an open-access software publicly available at \url{https://github.com/zhwm/SparsePro}. The simulation scripts are deposited at \url{https://github.com/zhwm/SparsePro_analysis}. Individual-level phenotype and genotype data from the UK Biobank are available upon successful application at \url{https://www.ukbiobank.ac.uk}. GCTA was downloaded from \url{https://cnsgenomics.com/software/gcta/bin/gcta_1.93.2beta.zip}. FINEAMP was downloaded from \url{http://www.christianbenner.com/finemap_v1.4_x86_64.tgz}. SuSiE (version 0.12.16) was installed from CRAN. PolyFun was installed from \url{https://github.com/omerwe/polyfun}. UK Biobank LD information was downloaded from \url{https://alkesgroup.broadinstitute.org/UKBB_LD/}. Tissue-specific annotation was downloaded from \url{https://alkesgroup.broadinstitute.org/LDSCORE/}.

Funding Statement

W.Z. has been supported by a doctoral training fellowship from the FRQNT (319188) and the Healthy Brains, Healthy Lives Program, funded by the Canada First Research Excellence Fund (CFREF), Quebec’s Ministère de l’Économie et de l’Innovation (MEI), and the Fonds de recherche du Québec (FRQS, FRQSC and FRQNT). H.N. holds a Canada Research Chair funded by the Canadian Institutes of Health Research. Y.L. is supported by Natural Sciences and Engineering Research Council (NSERC) Discovery Grant (RGPIN-2019-0621), Fonds de recherche Nature et technologies (FRQNT) New Career (NC-268592), and Canada First Research Excellence Fund Healthy Brains for Healthy Life (HBHL) initiative New Investigator start-up award (G249591). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–209. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Canela-Xandri O, Rawlik K, Tenesa A. An atlas of genetic associations in UK Biobank. Nature Genetics. 2018;50(11):1593–1599. doi: 10.1038/s41588-018-0248-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Loh PR, Kichaev G, Gazal S, Schoech AP, Price AL. Mixed-model association for biobank-scale datasets. Nature Genetics. 2018;50(7):906–908. doi: 10.1038/s41588-018-0144-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics. 2017;101(1):5–22. doi: 10.1016/j.ajhg.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Schaid DJ, Chen W, Larson NB. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics. 2018;19(8):491–504. doi: 10.1038/s41576-018-0016-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Stranger BE, Stahl EA, Raj T. Progress and promise of genome-wide association studies for human complex trait genetics. Genetics. 2011;187(2):367–383. doi: 10.1534/genetics.110.120907 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Benner C, Havulinna AS, Järvelin MR, Salomaa V, Ripatti S, Pirinen M. Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. The American Journal of Human Genetics. 2017;101(4):539–551. doi: 10.1016/j.ajhg.2017.08.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Spain SL, Barrett JC. Strategies for fine-mapping complex traits. Human Molecular Genetics. 2015;24(R1):R111–R119. doi: 10.1093/hmg/ddv260 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLOS Genetics. 2007;3(7):e114. doi: 10.1371/journal.pgen.0030114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics. 2014;198(2):497–508. doi: 10.1534/genetics.114.167908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Chen W, Larrabee BR, Ovsyannikova IG, Kennedy RB, Haralambieva IH, Poland GA, et al. Fine mapping causal variants with an approximate Bayesian method using marginal test statistics. Genetics. 2015;200(3):719–736. doi: 10.1534/genetics.115.176107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Benner C, Spencer CC, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32(10):1493–1501. doi: 10.1093/bioinformatics/btw018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2020;82(5):1273–1300. doi: 10.1111/rssb.12388 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Zou Y, Carbonetto P, Wang G, Stephens M. Fine-mapping from summary data with the “Sum of Single Effects” model. PLOS Genetics. 2022;18(7):e1010299. doi: 10.1371/journal.pgen.1010299 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLOS Genetics. 2014;10(10):e1004722. doi: 10.1371/journal.pgen.1004722 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Wen X. Molecular QTL discovery incorporating genomic annotations using Bayesian false discovery rate control. The Annals of Applied Statistics. 2016; p. 1619–1638. [Google Scholar]
  • 17. Weissbrod O, Hormozdiari F, Benner C, Cui R, Ulirsch J, Gazal S, et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nature Genetics. 2020;52(12):1355–1363. doi: 10.1038/s41588-020-00735-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Titsias M, Lázaro-Gredilla M. Spike and slab variational inference for multi-task and multiple kernel learning. Advances in Neural Information Processing Systems. 2011;24:2339–2347. [Google Scholar]
  • 19. Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478(7370):476–482. doi: 10.1038/nature10530 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research. 2010;38(16):e164–e164. doi: 10.1093/nar/gkq603 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Chen J, Spracklen CN, Marenne G, Varshney A, Corbin LJ, Luan J, et al. The trans-ancestral genomic architecture of glycemic traits. Nature Genetics. 2021;53(6):840–860. doi: 10.1038/s41588-021-00852-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Huang LO, Rauch A, Mazzaferro E, Preuss M, Carobbio S, Bayrak CS, et al. Genome-wide discovery of genetic loci that uncouple excess adiposity from its comorbidities. Nature Metabolism. 2021;3(2):228–243. doi: 10.1038/s42255-021-00346-2 [DOI] [PubMed] [Google Scholar]
  • 23. Vuckovic D, Bao EL, Akbari P, Lareau CA, Mousas A, Jiang T, et al. The polygenic and monogenic basis of blood traits and diseases. Cell. 2020;182(5):1214–1231. doi: 10.1016/j.cell.2020.08.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Chen VL, Du X, Chen Y, Kuppa A, Handelman SK, Vohnoutka RB, et al. Genome-wide association study of serum liver enzymes implicates diverse metabolic and liver pathology. Nature Communications. 2021;12(1):1–13. doi: 10.1038/s41467-020-20870-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Pazoki R, Vujkovic M, Elliott J, Evangelou E, Gill D, Ghanbari M, et al. Genetic analysis in European ancestry individuals identifies 517 loci associated with liver enzymes. Nature Communications. 2021;12(1):1–12. doi: 10.1038/s41467-021-22338-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Bell S, Rigas AS, Magnusson MK, Ferkingstad E, Allara E, Bjornsdottir G, et al. A genome-wide meta-analysis yields 46 new loci associating with biomarkers of iron homeostasis. Communications Biology. 2021;4(1):1–14. doi: 10.1038/s42003-020-01575-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Shi H, Kichaev G, Pasaniuc B. Contrasting the genetic architecture of 30 complex traits from summary association data. The American Journal of Human Genetics. 2016;99(1):139–153. doi: 10.1016/j.ajhg.2016.05.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Trynka G, Sandor C, Han B, Xu H, Stranger BE, Liu XS, et al. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nature Genetics. 2013;45(2):124–130. doi: 10.1038/ng.2504 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Hnisz D, Abraham BJ, Lee TI, Lau A, Saint-André V, Sigova AA, et al. Super-enhancers in the control of cell identity and disease. Cell. 2013;155(4):934–947. doi: 10.1016/j.cell.2013.09.053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Jiang L, Zheng Z, Qi T, Kemper KE, Wray NR, Visscher PM, et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nature Genetics. 2019;51(12):1749–1755. doi: 10.1038/s41588-019-0530-8 [DOI] [PubMed] [Google Scholar]
  • 32. Hoffman MM, Ernst J, Wilder SP, Kundaje A, Harris RS, Libbrecht M, et al. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Research. 2013;41(2):827–841. doi: 10.1093/nar/gks1284 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Villar D, Berthelot C, Aldridge S, Rayner TF, Lukk M, Pignatelli M, et al. Enhancer evolution across 20 mammalian species. Cell. 2015;160(3):554–566. doi: 10.1016/j.cell.2015.01.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Finucane HK, Reshef YA, Anttila V, Slowikowski K, Gusev A, Byrnes A, et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics. 2018;50(4):621–629. doi: 10.1038/s41588-018-0081-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Wolfe D, Dudek S, Ritchie MD, Pendergrass SA. Visualizing genomic information across chromosomes with PhenoGram. BioData Mining. 2013;6(1):1–12. doi: 10.1186/1756-0381-6-18 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Xiaofeng Zhu, Gao Wang

15 Mar 2023

Dear Dr Zhang,

Thank you very much for submitting your Methods entitled 'SparsePro: an efficient fine-mapping method integrating summary statistics and functional annotations' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. The main concern raised by all reviewers is the strong similarity between the proposed variational inference framework and the SuSiE model (Wang et al., 2020). Reviewers are unsure whether the subtle changes made to SuSiE in the proposed framework are a misinterpretation of the original model or have well-justified reasons. To address this concern, the editors suggest submitting a new manuscript that uses the current work as a starting point. This new manuscript should explicitly connect the proposed framework to the SuSiE model and algorithm, explain the motivation for modifications made to SuSiE before incorporating annotations, and provide comments on why these changes are necessary. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We therefore suggest major revision with these important details clarified. We cannot, of course, promise publication at that time. 

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Gao Wang

Guest Editor

PLOS Genetics

Xiaofeng Zhu

Section Editor

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Zhang et al present their extension of the SuSiE model to incorporate functional annotations. SparsePro prevents irrelevant annotations from contaminating the model by testing the functional annotations before integrating them into the model. It estimates the enrichment coefficients within the IBSS algorithm. Moreover, it estimates some hyper-parameters outside the IBSS algorithm to avoid convergence issue. The manuscript presents simulation results to support the method. They also show application result on UKBiobank traits. It is well written. As SuSiE becomes a popular approach for fine-mapping, it is desirable to have a version supporting functional annotations.

Major comments:

1. The SparsePro model is essentially the SuSiE model with functional annotations, with some changes to the prior and residual variance estimation. As most readers are familiar with SuSiE, I suggest using the terminology and concepts from SuSiE in the manuscript. For example, replacing 'effect group' with 'credible set'; using similar credible set definition as in SuSiE. The derivations in supplementary section 1 before annotation estimation are same as in SuSiE, but the SuSiE paper is not cited.

2. The statement on line 214 said 'in SuSiE, Bayes factors were normalized by sum of Bayes factors … which increased power for identifying causal variants.' This statement is not accurate. The softmax is same as normalizing weighted Bayes Factors (weighted by the prior probability \\pi_g). So the log probability in page 37 3rd equation is same as log(\\pi_g * Bayes Factor). This is not the cause for the SparsePro higher power in the simulation. The higher power from SparsePro- may caused by estimating hyper-parameters outside the iterative algorithm, and different PIP definition.

3. The PIP for each variant is computed as max among groups. Why not computing using 1-\\prod_k(1-\\gamma_kg), which is the theoretical definition of PIP?

4. From line 310-313, it seems that the 95% causal set (1 set containing all causal variants) is used in the set-level comparisons, not 95% credible sets as defined in SuSiE. Is there any reason not using credible sets for comparisons? SparsePro, SuSiE and FINEMAP all output 95% credible sets. It is unnecessary to combine them into one causal set. I suggest conducting set-level comparisons using 95% credible sets (coverage / power / size of each single 95% credible set).

5. The method estimates the enrichment coefficient using 'one-at-a-time' coordinate ascent. Is this same as jointly estimating the coefficients? Is it possible to estimate them jointly? For your reference, TORUS (Wen, X., 2016.) estimates the coefficients jointly using EM.

Minor comments:

1. The credible sets from SuSiE has a purity > 0.5 filter by default. Is the same purity filter applied in other methods? If not, the comparison is unfair.

2. The simulation has a per-variant heritability 10^(-4). Does this mean '--simu-hsq' is K * 10^(-4) in GCTA simulation? Or does it mean the simulated causal variants have similar effect size?

3. The relationship between the simulation parameter W and the relative enrichment vector w in section 4.1 page 10 is unclear.

4. Line 242 'causal effect sizes may vary across different subpopulations', where does the subpopulation come from? This is just single study fine-mapping.

5. Do the computational times in figures include estimating \\tao_\\beta, tao_y and testing of functional annotations?

6. Any reason to use log20 for entropy difference cutoff? Any reason to use 10^(-5) p value threshold for G-test?

7. In the genome-wide simulation, the results in the 1-MB center of each 3-MB window are considered. But the signal could also at/close to the 1-MB center boundary. How to analyze these signals?

8. What's the coefficient w scale in Fig S7?

Reviewer #2: Comment,

Zhang et al. proposed an interesting enhancement of the SuSiE model proposed by Wang et al. JRSSB 2021 to perform informed fine-mapping using side information/annotation. The authors show some convincing evidence that their approach (SparsePro) has greater power than some competing methods, such as SuSiE+Polyfun. Overall the approach is sound, and I am mostly positive regarding the manuscript. The level of detail provided by the authors is satisfactory. Still, some typos are disrupting the overall quality of the manuscript as well as some statements that need to be clarified.

Major comment:

1) While most of the manuscript reads quite well, there is a couple of sentences that do not flow well or are grammatically incorrect, e.g., in the last paragraph of the introduction, the authors wrote

"In line with the idea of grouping correlated variants together into effect groups, we proposed Sparse Projections to Causal Effects (SparsePro) to further improve fine-mapping efficiency and accuracy. First, within each effect group, we additionally incorporate" in this case, first needs to be followed by a sentence with an active verb before additionally. I am not a native English speaker, and I am aware that it can be hard to draft a manuscript. However, before considering the manuscript for acceptance, the authors need to proofread the manuscript to correct some of these problematic sentences.

2) The authors wrote: " Second, we use an efficient variational

inference algorithm to further simplify the intuitive algorithm proposed in SuSiE and improve computation efficiency." I have had a close look at the algorithm proposed by Zhang. While the author does not explicitly compute the marginal Bayes factor for each variant, given that we do not use any annotation, the coordinate ascend seems very similar to the one proposed by Wang et al.

As Wang and colleague provided the complexity of each coordinate ascend update, I think it would be interesting that the authors provide the complexity of each coordinate ascend update of SparsePro. This would make it more explicit that the gain in computational speed is not only due to the implementation. Because for the moment, it is not clear to me how this approach differs from the IBSS. Perhaps the author could elaborate on the computational complexity of their VA of the Single Effect Model proposed by Wang et al.

3) I would be interested in seeing a set of simulations in which the annotations are misspecified/measured with noise and potential bias toward non-causal SNPs. While substantial efforts are made to get high-quality annotations, it is not unlikely that many of those are poorly measured or biased. I would be interested in seeing some simulations in which the author would consider poorly measured annotations and see if that could generate some low-coverage credible set. In general, my overall question is that given that you have some annotations, is it worth it to include them in a fine mapping procedure, or could that potentially "harm" your results. Could you generate a new set of simulations in which annotations are measured with noise? Furthermore, could try to come up with a set of simulations that could lead to problematic coverage due to annotations. For example, suppose that you use the following annotation (that is made to be problematic) in a case where you consider a model with K SNP. Consider the following K annotation for each non-casual SNP set annotation k to its correlation with causal SNP k and for the causal SNP set each of their K annotations by sampling a random number 0 and 1

Minor:

1) There is a type after equation 2 in the supplement for the condition variational approximation. It is written $s_{kg=1}$ whereas it should be $s_{kg}=1$; please go through the equation to correct the other typos

2) The equation below, "Therefore, the posterior probability of the gth variant being causal in the kth effect group can be estimated as:" seems somewhat not correct. The input of in the softmax function is a scalar, whereas it should be a vector. The posterior probability of the gth variant being causal in the kth effect group should be the g component of this softmax.

3) In the SuSiE-rss manuscript, Zou and colleagues spend a substantial amount of work dealing with problematic LD. I would be interested to explicitly say what is implemented in SparsePro to circumvent problems related to LD matrices that are not full rank or allele flipping problem

4) could you show if the Gtest used for testing the annotation is correctly controlling the type I error

Reviewer #3: The authors propose a method for estimating the hyperparameters of the sum of single effects regression. In particular they leverage functional annotation enrichments to specify informative prior inclusion probabilities for different variants, and leverage heritability estimates to set the effect size and residual variance hyper-parameters. By incorporating this enrichment information they are able to demonstrate improvement over fine-mapping methods that either do not leverage functional annotations, or leverage functional annotations through different means (e.g. Polyfun, which computes prior inculsion probabilities given partitioned heritability estimates across a set of annotations). These are important contributions, as the selection of these hyperparameters can greatly influence the calibration and power of finemapping.

While I am generally positive about the work put forth here, my main reccomendations are to focus the discussion and commentary on the benefits of including functional annotations, and providing more clear rationale for the use of heritability information to set the effect and error variance hyperparameters. The authors should modify their discussion of the algorithmic differences between SuSiE and SparsePro because they do not seem accurate-- as far as I can tell, for a fixed set of hyperparameters (prior inclusion probability, effect variance, residual variance) the coordinate ascent variational inference (CAVI) employed for SparsePro and SuSiE's IBSS algorithm (which is also CAVI) are the same.

**Estimating prior inclusion probabilities** (Sparspro vs Polyfun) To my understand, Polyfun provides a heuristic for forming the prior inclusion probabilites based on a heritability partition. Sparsepro takes a less heuristic approach by directly estimating enrichment of selected variants, and using those enrichments to refine the posterior approximations made by SuSiE. I very much support this approach, and the authors successfully show through extensive simulations how their method improves on the heursitic approach to estimating the prior inclusion probabilities developed in Polyfun.

**Estimating variance hyperparametes** (Sparsepro vs SuSiE) SuSiE uses a variational empricial Bayes approach to estimate the effect variance and residual variance-- this just means optimizing the objective w.r.t to these hyperparameters. In contrast, Sparsepro fixes these hyperparemeters to values informed by heritability estimates. For example, the residual variance is set to 1-h2.

The residual variance is fixed to 1 - h2 where h2 is a locus level heritability estimate. In contrast, a conservative approach would be to set the residual variance to 1. I'm concerned that setting the residual variance to 1-h2 may disrupt calibration of the posterior. Basically, while h2 is an estimate of the heritability in the locus, finemapped association signals will only explain a portion of this heritability. 1-h2 may be too small, and encourage the model to select variants in a way that is anti-conservative.

**The variational approximations are identical** The discussion and supplemental materials emphasize the differences in computation between SuSiE's IBSS algorithm, and the variational updates derived in this paper. However, it is important to note that the variational approximation for Sparsepro and SuSiE are identical $q(\\beta, S) = \\prod_k q(\\beta_k, s_k)$. Consequently all differences in performance between SuSiE and Sparsepro- (without annotation) should be explained by (1) differences in the hyperparameters/hyperparameter estimation procedure and (2) implimentation details (e.g. convergence criteria, order of coordinate updates, etc., which may influence which local optima of the variational objective is found).

In particular the following does not seem correct 214:215 "In SuSiE, Bayes factors were normalized by sum of Bayes factors across all variants while SparsePro uses the softmax function to normalize posterior probabilities which increased power for identifying causal variants". I believe the marginal log Bayes factors are equal (up to a constant) to the posterior (log) probabilities referenced here (th 4th expression in supplementary material, page 27). Thus normalizing Bayes factors is equivalent to applying softmax of the log probabilities.

**Suggested Revisions**

- Clarify the similarities and differences between Sparsepro and SuSiE. I believe the variational approximations are the same, but the real contribution here are annotation and heritability informed hyperparameter settings, which are an important contribution that can stand on there own.

- Sparsepro- and SuSiE shoud be identical up the the setting of the effect variance and residual variance hyperparameters. Commentary attempting to explain the difference in performance between SuSiE and Sparsepo- should be revised, because at times it implies a difference in the algorithm/optimization procedure which does not seem to be correct.

- Please discuss/justify the heritability based estimates for effect variance and residual variance. In particular I am concerned that useing 1-h2 fro residual variance will make the algorithm anticonservate (see above) by underestimating the residual variance in the regression problem.

- Assess the calibration of PIPs for Sparsepro+ (e.g. Figure S1 in SuSiE manuscript)-- the AUC plots tell us that ranking variants by PIP is good, but it doesn't tell us that thresholding at some nominal PIP value controls the false positive rate. Good PIP callibration would go a long way in addressing my conerns about the choice of residual variance parameter.

**Minor points**

- Maybe a simpler enrichment analysis for the UKBB biomarkers would be (1) causal variants in this phenotype vs (2) causal variants discovered in other phenotypes. It would more clearly highlight that the enrichment of the tissue-specific annotation in the relvant biomarker is above and beyond the background level enrichment of enrichment across causal variants discovered in all phenotypes.

- Were causal variants defined as the top variant per credible set or all variants in the credible set?

- It is not clear to me which annotations are used in the UKBB biomarkers analysis. This should be clearly stated in 4.4 or methods.

- To clarify the comparison between polyfun and Sparsepro it may be good to (1) run Sparsepro with the prior inclusion probabilities derived from polyfun and (2) fit Sparsepro with the exact same annotations used in polyfun (without screening annotations based on significance first). (1) vs Sparsepro+ would demonstrate that Sparsepro+ is making better use of the annotation information. (2) vs Sparsepro+ would emphasize the benfit of selecting annotations based on Gtest.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Attachment

Submitted filename: sparsepro_review.pdf

Decision Letter 1

Xiaofeng Zhu

29 Aug 2023

Dear Dr Zhang,

Thank you very much for submitting your Research Article entitled 'SparsePro: an efficient fine-mapping method integrating summary statistics and functional annotations' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript.

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Xiaofeng Zhu

Section Editor

PLOS Genetics

Xiaofeng Zhu

Section Editor

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Thanks for addressing the issues. I have the following follow-up questions:

1. SparsePro uses a different formulation, but the underlying model is same as SuSiE. The K effect groups are same as K single effects. So I suggest removing the discussion on 'Equivalence between the SuSiE IBSS algorithm and a paired mean field variational inference algorithm'. The main contributions of the manuscript lie in the functional annotation, hyperparameter estimation, and posterior summary.

2. FINEMAP provides credible sets in the output cred file.

3. In the supplementary note line 82, the authors said 'it might be challenging to find the appropriate threshold' for purity. However, it is important to note that the threshold for entropy (log(20)) is also arbitrary. The highly correlated variants could be more than 50 in complex regions. Does this threshold, 20, correspond to a purity level in your simulations? Could you summarize the purity for the output CSs? How does the result look like if SuSiE uses the corresponding purity filter?

4. I'm still unclear about the CSs around the boundary of central 1MB region. For a 3Mb window, it has 3 parts, left 1Mb, central 1Mb, right 1Mb. The result for the central 1Mb part is used. What about the CS with SNPs at the right end of the central 1Mb and the left end of the right 1Mb? How do you address the results?

5. Is the \\tau_\\beta same for all effect groups? SuSiE allows different effect priors for each single effect.

6. Extracting information from the large supplementary table S1 according to lines 100-107 is challenging. Consider presenting these results in a figure format or incorporating them into Figure 3 for better clarity.

7. What's the largest K used when fitting the SparsePro model in simulations and applications? What's the parameter setting for SuSiE, FINEMAP and PAINTOR?

Reviewer #2: I am positive about publishing this manuscript in PLOS Genetics.

I would like to apologize to the authors for having taken a long time before taking the time to read through their revision, I have tried to do it seriously when I had the time to do so. I think the authors answered my concerns, as well as the other reviewers' concerns, in a satisfactory way and put a substantial amount of work into improving the manuscript.

Reviewer #3: Uploaded as attachment.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Attachment

Submitted filename: Sparsepro_Revision_Review.pdf

Decision Letter 2

Xiaofeng Zhu, Gao Wang

11 Dec 2023

Dear Dr Zhang,

We are pleased to inform you that your manuscript entitled "SparsePro: an efficient fine-mapping method integrating summary statistics and functional annotations" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Gao Wang

Guest Editor

PLOS Genetics

Xiaofeng Zhu

Section Editor

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Thanks for the response. I don't have any additional concerns.

Reviewer #2: While I think that the points raised by the other reviewers are interesting from my perspective I still think that the manuscript is good enough for publication in PLOS Genetics.

Reviewer #3: I thank the authors for addressing the issues that were raised. The authors have answered my concerns and those of the other reviews. The paper makes an important contribution of providing a way to incorporate annotations into SuSiE-style fine-mapping. I support acceptance of the paper.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-23-00072R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Xiaofeng Zhu, Gao Wang

21 Dec 2023

PGENETICS-D-23-00072R2

SparsePro: an efficient fine-mapping method integrating summary statistics and functional annotations

Dear Dr Zhang,

We are pleased to inform you that your manuscript entitled "SparsePro: an efficient fine-mapping method integrating summary statistics and functional annotations" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofi Zombor

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supplementary notes.

    (PDF)

    S1 Table. Summary of coverage, power and size of 95% credible sets in locus simulations.

    (XLSX)

    S2 Table. Summary of AUPRC in locus simulations.

    (XLSX)

    S3 Table. Annotation enrichment weights and G-test p-values from SparsePro in locus simulations.

    (XLSX)

    S4 Table. Annotation enrichment weights estimated by PAINTOR+ in locus simulations.

    (XLSX)

    S5 Table. Annotation enrichment weights and G-test p-values from SparsePro in genome-wide simulations.

    (XLSX)

    S6 Table. Annotation enrichment weights estimated jointly from SparsePro in genome-wide simulations.

    (XLSX)

    S7 Table. Annotation coefficients estimated by PolyFun in genome-wide simulations.

    (XLSX)

    S8 Table. Summary of AUPRC in genome-wide simulations.

    (XLSX)

    S9 Table. Summary of coverage, power and size of 95% credible sets in genome-wide simulations.

    (XLSX)

    S10 Table. Summary of the relative ratio between the largest and smallest prior inclusion probabilities in genome-wide simulations.

    (XLSX)

    S11 Table. Annotation enrichment weights and G-test p-values from SparsePro in fine-mapping functional biomarkers.

    (XLSX)

    S12 Table. Annotation enrichment weights for tissue-specific annotations and G-test p-values from SparsePro-, SparsePro+ and SparsePro+PolyFun in fine-mapping functional biomarkers.

    (XLSX)

    S13 Table. Percentage of top variants from 95% credible sets annotated to tissue-specific annotations in fine-mapping functional biomarkers.

    (XLSX)

    S14 Table. Fisher’s exact for tissue specificity in fine-mapping functional biomarkers.

    (XLSX)

    S15 Table. 95% credible sets for eGFR from SparsePro+.

    (XLSX)

    S16 Table. 95% credible sets for FFR from SparsePro+.

    (XLSX)

    S17 Table. 95% credible sets for gamma-GT from SparsePro+.

    (XLSX)

    S18 Table. 95% credible sets for glucose from SparsePro+.

    (XLSX)

    S19 Table. 95% credible sets for pulse rate from SparsePro+.

    (XLSX)

    S1 Fig. Annotation enrichment weights estimated by SparsePro in locus simulations.

    Each grid corresponds to a different simulation setting of K (number of causal variants) and W (enrichment intensity). Error bars represent 95% confidence intervals for enrichment estimates. Blue dots are estimated values and yellow triangles are simulated values.

    (TIFF)

    S2 Fig. Annotation enrichment weights estimated by PAINTOR in locus simulations.

    Each grid corresponds to a different simulation setting of K (number of causal variants) and W (enrichment intensity). PAINTOR does not provide confidence intervals for enrichment weights. Blue dots are estimated values and yellow triangles are simulated values.

    (TIFF)

    S3 Fig. Annotation enrichment weights estimated by SparsePro in genome-wide simulations.

    Each row represents a different simulation setting with W (enrichment intensity) = 0, 1, or 2. Error bars represent 95% confidence intervals for enrichment estimates. Blue dots are estimated values and yellow triangles are simulated values.

    (TIFF)

    S4 Fig. Enrichment weights estimated jointly by SparsePro without filtering annotations in genome-wide simulations.

    Each row represents a different simulation setting with W (enrichment intensity) = 0, 1, or 2. Error bars represent 95% confidence intervals for enrichment estimates. Blue dots are estimated values and yellow dots are simulated values.

    (TIFF)

    S5 Fig. Comparison of posterior inclusion probabilities (PIP) obtained using different methods in the simulation setting of W (enrichment intensity) = 1.

    True causal variants are colored red and non-causal variants are colored black.

    (TIFF)

    S6 Fig. Comparison of posterior inclusion probabilities (PIP) obtained using different methods in the simulation setting of W (enrichment intensity) = 0.

    True causal variants are colored red and non-causal variants are colored black.

    (TIFF)

    S7 Fig. Enrichment fold of annotations in fine-mapping functional biomarkers.

    Each row denotes an annotation and each column denotes a functional biomarker. Error bars represent 95% confidence intervals for enrichment estimates.

    (TIFF)

    S8 Fig. Calibration curves for SuSiE and SuSiE with hyperparameters estimated from HESS.

    Variants are grouped into five bins according to their PIP values. Each dot represents one bin. The actual precision (y-axis) is plotted against the expected precision (x-axis) calculated by mean PIP values across all variants in the bin.

    (TIFF)

    S9 Fig. Summary of coverage, power and size of 95% credible sets in different simulation settings for SuSiE, SuSiE with hyperparameters estimated from HESS (SuSiE+HESS) and SuSiE with hyperparameters estimated from HESS and posterior summaries proposed in SparsePro (SuSiE+SparsePro).

    (TIFF)

    S10 Fig. Fine-mapping the GCKR locus for pulse rate.

    (A) GWAS summary statistics for pulse rate at the GCKR locus. (B) Fine-mapping results from SparsePro-. (C) Fine-mapping results from SparsePro+. (D) Fine-mapping results from SparsePro+PolyFun. P-values from GWAS and inferred posterior inclusion probabilities from fine-mapping are illustrated. Variants within a ±500kb window are colored by their linkage disequilibrium r2 with rs1260326.

    (TIFF)

    S11 Fig. Comparison of posterior inclusion probabilities (PIP) in the simulation setting of K (number of causal variants) = 5 and W (enrichment intensity) = 2 between SuSiE and SuSiE with hyperparameters estimated from HESS (SuSiE+HESS).

    True causal variants are colored red and non-causal variants are colored black.

    (TIFF)

    S12 Fig. Comparison of posterior inclusion probabilities (PIP) in the simulation setting of K (number of causal variants) = 5 and W (enrichment intensity) = 2 between SparsePro- (SparsePro without functional annotations) and SuSiE with hyperparameters estimated from HESS (SuSiE+HESS).

    True causal variants are colored red and non-causal variants are colored black.

    (TIFF)

    S13 Fig. Comparison of posterior inclusion probabilities (PIP) in the simulation setting of K (number of causal variants) = 5 and W (enrichment intensity) = 2 between SuSiE with hyperparameters estimated from HESS (SuSiE+HESS) and SuSiE with hyperparameters estimated from HESS and posterior summaries proposed in SparsePro (SuSiE+SparsePro).

    True causal variants are colored red and non-causal variants are colored black.

    (TIFF)

    Attachment

    Submitted filename: sparsepro_review.pdf

    Attachment

    Submitted filename: plos_genetics_response_letter_20230628.pdf

    Attachment

    Submitted filename: Sparsepro_Revision_Review.pdf

    Attachment

    Submitted filename: plos_genetics_second_response_letter.pdf

    Data Availability Statement

    SparsePro is an open-access software publicly available at \url{https://github.com/zhwm/SparsePro}. The simulation scripts are deposited at \url{https://github.com/zhwm/SparsePro_analysis}. Individual-level phenotype and genotype data from the UK Biobank are available upon successful application at \url{https://www.ukbiobank.ac.uk}. GCTA was downloaded from \url{https://cnsgenomics.com/software/gcta/bin/gcta_1.93.2beta.zip}. FINEAMP was downloaded from \url{http://www.christianbenner.com/finemap_v1.4_x86_64.tgz}. SuSiE (version 0.12.16) was installed from CRAN. PolyFun was installed from \url{https://github.com/omerwe/polyfun}. UK Biobank LD information was downloaded from \url{https://alkesgroup.broadinstitute.org/UKBB_LD/}. Tissue-specific annotation was downloaded from \url{https://alkesgroup.broadinstitute.org/LDSCORE/}.


    Articles from PLOS Genetics are provided here courtesy of PLOS

    RESOURCES