A more accurate method for colocalisation analysis allowing for multiple causal variants

Chris Wallace

doi:10.1371/journal.pgen.1009440

. 2021 Sep 29;17(9):e1009440. doi: 10.1371/journal.pgen.1009440

A more accurate method for colocalisation analysis allowing for multiple causal variants

Chris Wallace ^1,^2,^*

Editor: Heather J Cordell³

PMCID: PMC8504726 PMID: 34587156

Abstract

In genome-wide association studies (GWAS) it is now common to search for, and find, multiple causal variants located in close proximity. It has also become standard to ask whether different traits share the same causal variants, but one of the popular methods to answer this question, coloc, makes the simplifying assumption that only a single causal variant exists for any given trait in any genomic region. Here, we examine the potential of the recently proposed Sum of Single Effects (SuSiE) regression framework, which can be used for fine-mapping genetic signals, for use with coloc. SuSiE is a novel approach that allows evidence for association at multiple causal variants to be evaluated simultaneously, whilst separating the statistical support for each variant conditional on the causal signal being considered. We show this results in more accurate coloc inference than other proposals to adapt coloc for multiple causal variants based on conditioning. We therefore recommend that coloc be used in combination with SuSiE to optimise accuracy of colocalisation analyses when multiple causal variants exist.

Author summary

Genetic association studies have found evidence that human disease risk or other traits are under the influence of genetic variants. As results of studies are made publicly available, more research focuses on whether different traits are under influence of the same variants, which may help us understand how variants lead to differences in disease risk. However, one of the popular methods to answer this question, coloc, makes the simplifying assumption that no two members of the set of causal variants for any one trait are close to each other. Here, we examine the potential of the recently proposed Sum of Single Effects (SuSiE) regression framework, for use with coloc. SuSiE is a novel approach that allows evidence for association at multiple causal variants in proximity to be evaluated simultaneously. We show this results in more accurate coloc inference than other proposals to adapt coloc for multiple causal variants based on conditioning. We therefore recommend that coloc be used in combination with SuSiE to optimise accuracy of colocalisation analyses when multiple causal variants exist.

This is a PLOS Genetics Methods paper.

Introduction

Colocalisation is a technique used for assessing whether two traits share a causal variant in a region of the genome, typically limited by linkage disequilibrium (LD). In its original form, it made the simplifying assumption that the region harboured at most one causal variant per trait [1], and we begin by explaining how that enables inference to be made quickly, using only GWAS summary statistics, and without information about LD. The approach begins by enumerating all variant-level hypotheses—the possible pairs of causal variants (or none) for the two traits—and the relative support for each in terms of Bayes factors, calculated from GWAS effect estimates at each SNP and their standard errors [2]. Thanks to the single causal variant assumption, each one of these combinations is associated to exactly one global hypothesis

H₀: no association with either trait in the region
H₁: association with trait 1 only
H₂: association with trait 2 only
H₃: both traits are associated, but have different single causal variants
H₄: both traits are associated and share the same single causal variant

The second step calculates log Bayes factors for each of these global hypotheses by summing the log Bayes factors for all corresponding variant-level hypotheses. Finally, standard combination of Bayes factors with prior probabilities of each hypothesis allows us to calculate posterior probabilities. A full exposition of these steps are found in [3]. Note that the per-SNP Bayes factors relate closely to fine mapping, because they are proportional to fine mapping posterior probabilities of causality under a single causal variant assumption [4]. Thus we can calculate fine mapping posterior probabilities from the single trait Bayes factors, or from the coloc Bayes factors if we are sufficiently convinced of H₄ to produce probabilities that combine information from both traits.

The single causal variant assumption implies that each pair of variants being causal for the two traits are mutually exclusive events. However, the assumption is unrealistic, as multiple causal variants may exist in proximity, which also challenges the definition of colocalisation as presented above as none of the global hypotheses encompass multiple causal variants. Alternative methods for colocalisation have been developed which do not make this assumption. eCAVIAR [5] uses the CAVIAR [6] approach (which accommodates multiple causal variants) to fine map each trait, and gives probabilities that any variant is causal for both traits as the product of the single trait causal probabilities. However, this treats causality at each trait as independent events, when there is abundant evidence that a SNP causal for one trait is more likely to be causal for another. Alternatively, HEIDI/SMR [7] uses a frequentist framework, treating the null hypothesis as colocalisation, and rejecting this when there is evidence against. Here, multiple causal variants are dealt with by requiring colocalisation across all causal variants in a region, and that the effects of each causal variant on the two traits is proportional. That is, if one causal variant has a two-fold greater effect on trait 1 compared to trait 2, then all other causal variants are assumed to also have a two-fold greater effect.

Unlike these, coloc works with a single pair of causal variants at a time, and explicitly allows incorporating any expectation that causal variant are likely to be shared through prior probabilities. In previous work, [3] we allowed for multiple causal variants in coloc by using conditional regression to distinguish lead variants, with the added requirement of supplying an LD matrix for the variants under test. Each pair of lead variants could be examined by a single coloc run, leading to multiple colocalisation comparisons. Thus, if trait 1 had two causal variants tagged by SNPs A and B and trait 2 had one, tagged by SNP C, we would conduct two colocalisation analyses, to ask whether A and C corresponded to a shared causal variant, and whether B and C corresponded to a shared causal variant. This allows the simple combination of log Bayes factors through summation, but explicitly assumes that data can be decomposed into layers corresponding to the causally distinct signals. The stepwise regression approach upon which conditioning is based is known generally to produce potentially unreliable results [8], a phenomenon that can be exacerbated by the extensive correlation between genetic variants caused by LD [9]. Thus, this solution remains unsatisfactory.

A suite of Bayesian fine-mapping methods have been developed recently which calculate posterior probabilities of sets of causal variants for a given trait [6, 10, 11]. However, the marginal posterior probabilities calculated from these are no longer mutually exclusive events, so they could not be easily adapted to the colocalisation framework. An alternative would be to consider all possible combinations of models between two traits, but this combinatorial problem is computationally expensive [9]. Recently, the Sum of Single Effects (SuSiE) regression framework [12] was developed which reformulates the multivariate regression and variable selection problem as the sum of individual regressions each representing one causal variant of unknown identity. This allows the distinct signals in a region to be estimated simultaneously, and enables quantification of the strength of evidence for each variant being responsible for that signal. Conditional on the regression being considered, the variant-level hypotheses are again mutually exclusive. Here we describe the adaptation of coloc, allowing for multiple labelled comparisons in a region, to use the SuSiE framework and demonstrate improved efficacy over the previously proposed approaches. While SuSiE is written in terms of the full genotype matrix, it has been extended to require only summary statistics by combination with a “regression with summary statistics” likelihood formulation [13]. We use the summary statistic module of SuSiE, susie_rss(), so that the format of data currently expected by coloc, GWAS summary statistics for each trait and an LD matrix, is unchanged.

Methods

Adaptation of coloc approach

The new coloc.susie() function in the coloc package (https://github.com/chr1swallace/coloc/tree/susie) takes a pair of summary datasets in the form expected by other coloc functions, runs SuSiE on each and performs colocalisation as described below. We use the susie_rss() function in the susieR package to fine-map each summary statistic dataset, run with default options, although the susie.args argument in coloc.susie() allows arguments to be supplied to susie_rss(). susie_rss() returns a matrix of variant-level Bayes factors for each modelled signal and a list of signals for which a 95% credible set could be formed, corresponding to a subset of rows in the matrix of Bayes factors. These rows are then analysed in the standard coloc approach, for every pair of regressions with a detectable signal across traits. Explicitly, if L₁ and L₂ signals are detected (have a credible set returned) for traits 1 and 2 respectively, then the colocalisation algorithm is run L₁ × L₂ times. Thus, the user is presented with two lists of signals for each trait, and the L₁ × L₂ matrix of pairwise posterior probabilities of H₄ may be examined to infer which pair of tags, if any, represent the same signal.

Simulation strategy

We examined the performance of using SuSiE with coloc by simulation. We downloaded haplotypes for EUR samples in the 1000 Genomes phase 3 data [14], phased by IMPUTE2 [15], from https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html. We used lddetect [16] to divide the genome into approximately LD-independent blocks, and extracted haplotypes consisting of 1000 contiguous SNPs with MAF > 0.01. We simulated case-control GWAS summary statistics for a study with 10,000 cases and 10,000 controls, corresponding to the LD and MAF calculated from these haplotypes using simGWAS [17], with one or two common causal variants (MAF > 0.05) chosen at random and log odds ratios sampled from N(0, 0.2²). We discarded any datasets which did not have a minimum p < 10⁻⁶ to match our expectation that fine-mapping and colocalisation are only conducted when there is at least a nominal signal of association. We simulated 100 such datasets for each of 100 randomly selected LD blocks, and sampled from these sets of summary data for all the simulations detailed below.

We repeatedly simulated GWAS summary data for a single trait with one or two causal variants in small or large genomic regions (1000 or 3000 SNPs, where 3000 SNP regions were constructed by concatenating three 1000 SNP datasets). We constructed pairs of simulated data for two traits, such that each trait had one or two causal variants and each pair of traits shared zero, one or two causal variants. We simulated 10,000 examples from each collection, with each example analysed independently. Analysis compared different approaches:

single single causal variant coloc analysis of every pair of traits
cond_it multiple causal variant coloc analysis using a conditioning approach to allow for multiple causal variants, iterative mode
cond_abo multiple causal variant coloc analysis using a conditioning approach to allow for multiple causal variants, “all but one” mode
susie multiple causal variant coloc analysis using SuSiE to allow for multiple causal variants

Conditioning can be run in two modes. Assume that stepwise regression detects two signals, tagged by SNPs A and B. In the iterative mode, we first use the raw data in a first step, and then the data conditioned on A in a second step. This corresponds to how stepwise identification of independent signals in GWAS is commonly approached. An alternative is to condition on B in the first step, and A in the second step, attempting to isolate the separate signals. This corresponds more closely to the hope in multiple causal variant coloc that we can decompose the data into layers corresponding to the separate signals. However, because the identification of the second signal B is likely to be more uncertain than A (because it is weaker, and was detected through conditioning on the already uncertain A), it may introduce further error.

In order to assess the accuracy of each coloc analysis, we needed to assess whether the comparison corresponded to a case of shared or distinct causal variants. For each signal passed to coloc, we identified the variant with the highest posterior probability of causality, v₁ and v₂ for traits 1 and 2 respectively (it is possible that v₁ = v₂). We then labelled the variant v_i (i = 1, 2) according to the rules:

A r²(v_i, A) > 0.5 ∧ r²(v_i, A) > r²(v_i, B)
B r²(v_i, B) > 0.5 ∧ r²(v_i, B) > r²(v_i, A)
- otherwise

If either of the variants was labelled “-” then the comparison was labelled “unknown”. Otherwise it was labelled by the concatenation of the two labels. We compared the average posterior probability profiles between methods, stratified according to this labelling scheme.

Results in this manuscript were generated using R version 4.0.4 with packages susieR version 0.11.42 and coloc version 5.1.0.

Results

Summary results of the coloc simulation study are given in S1 Table, and presented graphically in Fig 1 and S1 Fig. When when both traits really did contain only a single causal variant, we found that single coloc generally performed best (top two rows of Fig 1). SuSiE-based analysis appeared to lose a little power (lower bar heights indicating fewer comparisons performed) but was equally accurate amongst comparisons performed. The situations when coloc-SuSiE did not perform any comparisons corresponded to cases where SuSiE did not identify any credible sets for one or both traits, which were likely to be examples with higher minimum p values (Fig 2). A hybrid approach, running coloc-SuSiE if possible, and coloc-single if not outperformed any other strategy. When either one or both traits had two causal variants (bottom two rows of Fig 1), SuSiE outperformed all other methods in terms of accurately calling “AB” comparisons distinct (H₃) rather than shared (H₄) and performed as many or more comparisons than the other coloc methods. Hybrid SuSiE-single-coloc was very similar to SuSiE-coloc, or marginally better. In the two causal variant cases, single coloc tended to equivocate between H₃ and H₄ when testing AB-like signals in the presence of a shared causal variant (ie where the peak signals in each trait related to distinct causal variants) which should be inferred H₃. This relates to a known feature of coloc, which may detect the colocalising signal even when additional non-colocalising signals are present [1].

Fig 2 — Each dataset was summarised by its maximum -log10 p value, and the pair of datasets by the minimum of these. A dashed line shows the conventional GWAS significance threshold of 5 × 10⁻⁸. This shows that when coloc-SuSiE does not produce any results it is generally in cases of lower power.

This feature also presents problems for the conditioning approach cond_it, as demonstrated by the high average posterior probability for H₄ in the “AB” comparisons, one of which is examined in detail in Fig 3. In this example, trait 1 has one causal variant, A, whilst trait 2 has two, A and B, with B having slightly greater significance. In the first round of analysis by the conditioning method, the original sets of summary statistics are passed to coloc. Because A is the stronger effect for trait 1, the test is labelled “AB”, but gives a high posterior to H₄ because there is one shared causal variant (A). Then the stronger effect, B, is conditioned out, and the analysis rerun with trait 1, and trait 2 conditioned on B. This test again gives a high posterior for H₄. This situation is confusing, because the same signal in trait 1 appears to colocalise with different signals in trait 2. SuSiE models both signals simultaneously, so we can attempt to colocalise trait 1 with each signal independently, finding high H₃ for one and high H₄ for the other. If we were confident we could infer both the exact number of independent signals and their identity correctly by conditioning, we could attempt to emulate this in the conditioning, using the “all but one” rather than “iterative” mode. This does result in better average performance than the iterative mode (Fig 1). However it is often outperformed by SuSiE. S2 Fig shows an example where the stepwise approach cond_abo is less able to correctly identify the separate signals. The A signal is not well identified, and therefore not be adequately conditioned out, which may results in two apparently different comparisons with trait 1 which both produce a high H₄. In this example too, SuSiE more correctly produces two comparisons, one with high H₃ and one with high H₄.

Fig 3 — a and b show the “observed” data (simulated from 1000 SNPs with MAF > 0.01) as -log₁₀ p values for traits 1 and 2 respectively. Trait 1 has one causal variant, A, and trait 2 has two, A and B. Conditioning identifies a second independent signal for trait 2, and the results of conditioning on the strongest signal is shown in c. Coloc comparisons are based on (a, b) and (a, c) and both find the posterior probability (PP) of the shared causal variant hypothesis H₄ is > 0.8. SuSiE analysis of the same data finds one credible set in trait 1, and log₁₀ Bayes factors (BF) for this are shown in d. It finds two credible sets for trait 2, and the log₁₀ BF for these are shown in e and f. Coloc comparisons are based on (d, e) and (d, f) and find PP of H₄ of > 0.9 and < 10⁻⁴ respectively. Blue and green points are used to highlight SNPs in LD with (r² > 0.8) the true causal variants A and B respectively. The data underlying this figure are available in S1 Data.

Finally, we compared the different approaches in terms of their ability to pinpoint the causal variant by SNP-level posterior probabilities of causality conditional on colocalisation. This vector of posterior probabilities is returned as a side-effect of every coloc comparison, and we would expect the posterior probability at the causal variant to increase when colocalisation (H₄) is called correctly. We took all simulation results which gave P(H₄|Data) > 0.9, and examined the distribution posterior probabilities at the causal variant (Fig 4). We found the expected pattern for single and SuSiE based coloc, but conditioning did not generally result in a higher posterior probability, presumably because the difficulty with these approaches such as that exemplified in Fig 3.

Fig 4 — Each point represents one causal variant in a dataset; its x location shows its maximum fine mapping posterior probability (PP) in either single trait, its y location shows its PP after coloc. Results are divided by rows into those from datasets with 1 (top) or 2 (bottom) causal variants, and by columns according to method. The text in red shows the percent of datasets which led to an increase in PP at causal variants after coloc analysis.

Discussion

While coloc has been a popular method for identifying sharing of causal variants between traits, the common simplifying assumption of a single causal variant has been criticised, because it does not accord with findings that causal variants for the same trait may cluster in location (e.g. because they act via the same gene) [18]. Using the new SuSiE framework to partition the problem into multiple coloc comparisons and assuming the single causal variant assumption holds in each appears to resolve this issue better than the previously proposed conditional approach. It allows multiple signals to be distinguished, and then colocalisation analysis conducted on all possible pairs of signals between the traits. However, when no credible sets can be detected with confidence by SuSiE, single-coloc may still be able to make some inference. This can improve power when there really is one causal variant per trait, but doesn’t appear to cause incorrect inference in the low powered multiple-causal cases. Thus we recommend a hybrid approach be adopted, using coloc-SuSiE where possible, but falling back on coloc-single when SuSiE cannot identify any credible sets.

Note that in earlier preprints of this manuscript, we suggested an approach based on trimming input data to decrease the computational time required to run susie_rss, but more recent versions of susieR, including the one used here, are faster and so we no longer consider that approach to be required.

This manuscript presents one approach to colocalisation in the case of multiple causal variants, that assumes that distinct signals can be decomposed even if physically proximal, which SuSiE appears to do admirably well. This framing of the colocalisation problem implicitly assumes there are a finite number of causal variants for any trait which can be identified, and that traits may be compared in terms of their causal variants to identify shared variants. However, the concept of regional colocalisation can be approached in other ways in the multiple causal variant scenario. One approach reduces the possible hypotheses to two, with the alternative hypothesis corresponding to the existence of a causal variant in a region shared by two (or more) traits. [19] Another focuses on a variant-level definition of colocalisation, estimating the probability that each variant in turn is causal for two traits, whilst allowing that other causal variants (shared or non-shared) may exist in the vicinity [5]. In contrast, the approach proposed here allows the number hypotheses tested to be determined by the data: it is the product of the number of credible sets identified by SuSiE for each trait. Whilst it relaxes the assumption of a single causal variant, one obvious caveat is that we have not yet reached (nor may we ever reach) sample sizes which enable all causal variants to be identified. Missed causal variants will provide incomplete comparisons of traits. It is also established that in lower power situations, even Bayesian fine-mapping methods that simultaneously model causal variants may identify a single SNP which tags two or more causal variants [9] and the interpretation of non-colocalisation at such false signals is likely to be misleading. On the other hand, it does seem useful to go beyond asking whether at least one causal variant is shared, and the attempt to both isolate and count the distinct causal variants per trait may be useful in designing follow-up experiments. As we better understand the architecture of complex traits, and design methods that accomodate the multiple causal variants that have been discovered, it is important to bear in mind that results will continue to be limited by sample size, and limited ability to detect rarer variants or those in regions of particular allelic heterogeneity, which even sophisticated methods such as SuSiE may find challenging.

Supporting information

S1 Table. Results of colocalisation simulations.

The columns shown are: scenario: the simulated causal variants in traits 1 and 2, for example A-AB indicates trait 1 has causal variant A and trait 2 has causal variants A and B. nsnps_in_region: Number of SNPs in simulated region (1000, 3000). method: method used for coloc analysis inferred_cv_pair estimated pair of causal variants under test. H0,H1,H2,H3,H4 average posterior support for each hypothesis. This is calculated as the sum of posterior probabilities for each hypothesis / number of simulations run. As some variant pairs are unlikely to be tested (eg the pair AA is unlikely to be tested in the scenario A-B) this is not the expected posterior support given AA is tested.

(CSV)

Click here for additional data file.^{(17.1KB, csv)}

S1 Fig. Companion to Fig 1, showing the results for simulated datasets with 3000 SNPs.

Legend otherwise as for Fig 1.

(TIF)

Click here for additional data file.^{(177.9KB, tif)}

S2 Fig. Example where the conditional coloc approach, run in “all but one” mode finds misleading results.

a and b show the observed data (-log₁₀ p values) for traits 1 and 2 respectively. Conditioning identifies two independent signals for trait 2, and the results of conditioning on the signal closest to causal variants A and B are shown in c and d respectively. Coloc comparisons are based on (a, c) and then (a, d). SuSiE analysis of the same data finds one signal in trait 1, and log₁₀ Bayes factors (BF) for this signal are shown in e. It finds two signals for trait 2, and the log₁₀ BF for these are shown in f and g. Coloc comparisons are based on (e, f) and (e, g). The boxes on the lower plots show the results of running coloc analysis on that dataset against the data for trait 1 shown in a or e as appropriate. The data underlying this figure are available in S1 Data.

(TIF)

Click here for additional data file.^{(235.3KB, tif)}

S1 Data. Datasets plotted in Figs 4 and S2, including summary statistics and the underlying LD and MAF.

(ZIP)

Click here for additional data file.^{(12.2MB, zip)}

Acknowledgments

We thank Stasia Grinberg and Anna Hutchinson for comments on an earlier version of this manuscript, and Matthew Stephens for detailed explanation of the computational complexities in the susie_rss function.

Data Availability

Code to perform the simulations may be found at https://github.com/chr1swallace/coloc-susie-paper. A version of coloc including SuSiE is available from CRAN at https://cran.r-project.org/package=coloc.

Funding Statement

CW is funded by the Wellcome Trust (WT107881, WT220788) and the MRC (MC UU 00002/4). This study was also supported by the NIHR Cambridge BRC (BRC-1215-20014). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian Test for Colocalisation between Pairs of Genetic Association Studies Using Summary Statistics. PLOS Genetics. 2014. May;10(5):e1004383. Available from: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Wakefield J. Bayes Factors for Genome-Wide Association Studies: Comparison with P -Values. Genet Epidemiol. 2009. Jan;33(1):79–86. Available from: 10.1002/gepi.20359. [DOI] [PubMed] [Google Scholar]
3. Wallace C. Eliciting Priors and Relaxing the Single Causal Variant Assumption in Colocalisation Analyses. PLOS Genetics. 2020. Apr;16(4):e1008720. Available from: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008720. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. The Wellcome Trust Case Control Consortium, Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, et al. Bayesian Refinement of Association Signals for 14 Loci in 3 Common Diseases. Nat Genet. 2012. Oct;44(12):1294–1301. Available from: 10.1038/ng.2435. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Hormozdiari F, van de Bunt M, Segre AV, Li X, Joo JWJ, Bilow M, et al. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet. 2016;99(6):1245–1260. Available from: 10.1016/j.ajhg.2016.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying Causal Variants at Loci with Multiple Signals of Association. Genetics. 2014. Oct;198(2):497–508. Available from: 10.1534/genetics.114.167908. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Wu Y, Zeng J, Zhang F, Zhu Z, Qi T, Zheng Z, et al. Integrative Analysis of Omics Summary Data Reveals Putative Mechanisms Underlying Complex Traits. Nat Commun. 2018;9. Available from: 10.1038/s41467-018-03371-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Miller AJ. Selection of Subsets of Regression Variables. J R Stat Soc Ser A. 1984;147(3):389–425. Available from: http://www.jstor.org/stable/2981576. [Google Scholar]
9. Asimit JL, Rainbow DB, Fortune MD, Grinberg NF, Wicker LS, Wallace C. Stochastic Search and Joint Fine-Mapping Increases Accuracy and Identifies Previously Unreported Associations in Immune-Mediated Diseases. Nature Communications. 2019. Jul;10(1):3216. Available from: https://www.nature.com/articles/s41467-019-11271-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Benner C, Spencer CCA, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: Efficient Variable Selection Using Summary Data from Genome-Wide Association Studies. Bioinformatics. 2016. May;32(10):1493–1501. Available from: 10.1093/bioinformatics/btw018. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Newcombe PJ, Conti DV, Richardson S. JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects. Genet Epidemiol. 2016;40:188–201. Available from: 10.1002/gepi.21953. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Wang G, Sarkar A, Carbonetto P, Stephens M. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2020;82(5):1273–1300. Available from: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Zhu X, Stephens M. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. The annals of applied statistics. 2017;11(3):1561. doi: 10.1214/17-AOAS1046 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A Global Reference for Human Genetic Variation. Nature. 2015. Oct;526(7571):68–74. Available from: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Howie BN, Donnelly P, Marchini J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLOS Genetics. 2009. 06;5(6):1–15. Available from: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Berisa T, Pickrell JK. Approximately Independent Linkage Disequilibrium Blocks in Human Populations. Bioinformatics. 2016. Jan;32(2):283–285. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4731402/. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Fortune M, Wallace C. simGWAS: A Fast Method for Simulation of Large Scale Case-Control GWAS Summary Statistics. Bioinformatics. 2018. Oct;Available from: 10.1093/bioinformatics/bty898. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Yang J, Ferreira T, Morris AP, Medland SE, Genetic Investigation of ANthropometric Traits (GIANT) Consortium, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium, et al. Conditional and Joint Multiple-SNP Analysis of GWAS Summary Statistics Identifies Additional Variants Influencing Complex Traits. Nat Genet. 2012. Apr;44(4):369–75, S1–3. Available from: 10.1038/ng.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Deng Y, Pan W. A powerful and versatile colocalization test. PLoS computational biology. 2020. Apr;16:e1007778. doi: 10.1371/journal.pcbi.1007778 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Genet. doi: 10.1371/journal.pgen.1009440.r001

Decision Letter 0

David Balding, Heather J Cordell

29 Apr 2021

Dear Dr Wallace,

Thank you very much for submitting your Research Article entitled 'A more accurate method for colocalisation analysis allowing for multiple causal variants' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers made positive comments but also raised substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: Please see attached Word doc.

Reviewer #2: Review of Wallace

Summary

This paper introduces an extension of the "coloc" method for colocalization

to deal with multiple causal variants in a region. This extension exploits a

recently-introduced method for fine mapping (SuSiE). The extension is

attractive in its simplicity, and simulations show it to perform better than some

alternative approaches. The paper also suggests a way to speed up computations

by pre-filtering out "non-significant" SNPs.

The key idea of combining SuSiE and coloc is nice, and I think that with

some improvements to the presentation will make a nice publishable contribution.

The idea of speeding up SuSiE by pre-filtering SNPs is also attractive from

a practical point of view, but it has some potential downsides that I feel

are not sufficiently emphasized and explored (even though the manuscript does end

with a statement that trimming might be not beneficial in general final mapping).

Specifically trimming out non-significant SNPs

could increase the potential for false positive identifications,

and indeed such a result has been previously reported in

https://www.biorxiv.org/content/10.1101/631390v3

(their Figure S7). It's not clear to me how, if at all, this is reflected in the results

shown here. Maybe it is simply the case that, as the paper suggests in the discussion,

that "Coloc benefits from comparing posterior probabilities across... two traits".

But the overall way that the manuscript deals with false positive (or indeed

false negative) identifications

is not clear. (Maybe methods are applied with some

knowledge of the true number of causal effects? It isn't clear to me.)

Since there are also other potential ways to speed up computation (see comments below)

I am not really convinced that the pre-filtering approach is really the way to go,

and would like to see at least a stronger assessment of the potential downsides.

Main Comments

1. The presentation of the method requires more details, including more precise

equations showing how quantities computed by SuSiE are used/combined. For

example you could introduce $\\alpha_{lj}$ for the matrix of posterior probabilities output by susie

and then give explicit expressions for the Bayes Factors being computed

($BF_{lj}$) in terms of $\\alpha_{lj}$. I'm not sure what $P_0$ is (is it something output by SuSiE?)

Is $\\pi=1/p$ where p is the number of SNPs in the region, or something else? How

do you set the maximum number of effects in SuSiE (L in the SuSiE paper)? Do you get SuSiE to

estimate the number of effects by estimating the prior variance, or do fix the prior variance?

If $L_g$ is the number of effects identified by SuSiE in the GWAS and $L_e$ the

number identified by SuSiE in the eQTL study, do you end up running coloc $L_g * L_e$ times?

(as suggested by "for every pair of regressions across traits" on p3).

How do you combine/summarise the results from all these different runs of coloc?

2. Presentation of colocalization results also needs more details. Can you say explicitly

what is an "AA" or "BB" comparison and an "AB-like signal"? From the description on p3 I

thought the simulations would include settings where there were 2 causal variants in each trait,

but no sharing. But Fig 3 seems to suggest

only a small portion of potential configurations of up to 2 signals in each trait are actually

included - is that right? (why?) And in Fig 3, what happens if SuSiE finds a signal in one trait

and not in the other - what comparison do you make? (Or do you force SuSiE to find the right

number of effects in each trait by fixing L to the true value? If so, is that cheating?)

Is the smaller height of the AA bar for susie_0 compared with other methods -- and indeed

the slightly smaller height of all bars -- something to be

concerned about? Are all methods equally applicable if (as is always the case) you do not know

the true number of causal signals in each trait?

3. Figure 1 compares only the PIPs at causal variants. Since in practice we don't know the

causal variants, one should also care about PIPs at non-causal variants. Is there a tendency

for SuSiE to inflate PIPs at non-causal variants when trimming?

4. It seems there are many potential ways to improve computation than

filtering out non-significant SNPs, and many of them may ultimately be better choices

(although filtering is obviously very simple to implement!) I don't think the discussion

in the paper really adequately reflects the options available or the many

issues involved.

Although I did not see it explicitly said anywhere, I believe the

paper is using the susie_rss function for applying SuSiE to summary data.

The details of this function are not included in the original SuSiE publication, but at time of writing

this function works by performing an initial eigendecomposition of the reference LD matrix R, which

makes it possible to convert the summary data into "transformed data" to which

regular SuSiE can be applied. This approach is appealing from a software engineering

point of view, but not necessarily the most efficient, computationally. The eigendecomposition

of R is quite expensive, being O(p^3) where p is the number of SNPs.

The subsequent application of SuSiE

to the transformed data is O(p^2) per iteration.

Thus if p is sufficiently large the eigendecomposition step will likely

dominate the susie_rss computation (and Figure 2 does indeed suggest computation maybe

increase something like p^3?)

One way to reduce computational complexity would therefore be to avoid the eigendecomposition

step, and we are currently actively exploring these in our development of susie_rss.

However, note that computing R itself is already

an O(np^2) operation, where $n$ is the number of samples in the reference sample used to compute R. So

if n is big then this computation (which is basically considered free

in this paper since R is precomputed) could be the dominant computational cost. Alternatively

if n<<p, --="" avoid="" entirely="" forming="" one="" perhaps="" r="" should="" then="">in the case n<</p,>

SVD of the reference genotypes (O(n^2p)) which will cheaper than forming R (O(np^2)) when n<<p.

In the future it seems quite likely that pre-computed R and eigen(R) could be made

available for some large panels, avoiding the need for each user to compute them. Once

these pre-computations are done there may no longer be any need to filter SNPs.

Other comments/details

- p3 although the number of potential models increases exponentially, SuSiE computation

does not increase exponentially.

- p4: "We labelled each comparisons considered...." I did not understand this sentence.

- p4: "... having strongest posterior support for H_4" - this should be H_3?

- p8: " this does apply to single trait" - missing *not*?

- In the second row-set of Figure 3, is the figure on the LHS

wrong? (The methods suggest colocalization but the figure shows no shared variant...)

- on p7 the r2 threshold is 0.8 but on p4 it is 0.5. Are there referring to different thresholds?

This review is signed: Matthew Stephens</p.

Reviewer #3: This is an interesting paper. The method is solid and implements M Stephen group's SUSIE method in the coloc framework with some simulation based comparisons with other methods (and "trimming" rather than shrinkage to help compute time). Expanding coloc to multiple variants is a useful advance to the field, and that is what PLoS Genetics Methods section papers are supposed to do.

I only have minor comments.

The formatting of figure 3 - the scenarios - seems to have gone slightly awry and needs to be fixed.

I suggest the discussion could be extended slightly - it is rather brief (although sufficient).

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Matthew Stephens

Reviewer #3: No

Attachment

Submitted filename: review wallace plos genetics.docx

Click here for additional data file.^{(14.8KB, docx)}

PLoS Genet. 2021 Sep 29;17(9):e1009440. doi: 10.1371/journal.pgen.1009440.r002

Author response to Decision Letter 0

18 May 2021

Attachment

Submitted filename: coloc-susie response.pdf

Click here for additional data file.^{(932.3KB, pdf)}

PLoS Genet. doi: 10.1371/journal.pgen.1009440.r003

Decision Letter 1

David Balding, Heather J Cordell

2 Jul 2021

Dear Dr Wallace,

Thank you very much for submitting your Research Article entitled 'A more accurate method for colocalisation analysis allowing for multiple causal variants' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: Please see attached Word doc.

Reviewer #2: Review of Wallace (revised)

Thank you for the responsiveness to issues raised in the previous review.

I have just two points to be addressed.

1. As noted, Susie is under active development, and

since v0.10.1 (March 16th 2021) the susie_rss function no longer performs

eigen-decomposition of R. This fact could be noted

in the discussion, and the version of susie

used to produce the results reported here should be reported.

2. I found Figure 1 hard to read. Most of the ink is not

very informative, and one has to read the actual numbers to

extract the information. Also change in total PIP is probably less relevant than

changes in individual PIPs (eg if all PIPs increase a very small amount,

the total change can be big, but it probably doesn't matter much.)

I think there should be better ways to convey the information.

Possibly a scatterplot of PIPs for each SNP, with vs without trimming,

might work - most of the points will presumably be near (0,0) but

any outliers should be immediately apparent?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Matthew Stephens

Attachment

Submitted filename: plos genetics review.docx

Click here for additional data file.^{(15.8KB, docx)}

PLoS Genet. 2021 Sep 29;17(9):e1009440. doi: 10.1371/journal.pgen.1009440.r004

Author response to Decision Letter 1

3 Aug 2021

Attachment

Submitted filename: coloc-susie revised, response(1).pdf

Click here for additional data file.^{(97.6KB, pdf)}

PLoS Genet. doi: 10.1371/journal.pgen.1009440.r005

Decision Letter 2

David Balding, Heather J Cordell

2 Sep 2021

Dear Chris,

Thank you very much for submitting your Research Article entitled 'A more accurate method for colocalisation analysis allowing for multiple causal variants' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by one independent peer reviewer. The reviewer now recommends acceptance, but identified some minor concerns that we ask you address in a further revised version before we can formally accept your manuscript.

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Reviewer #1: The manuscript is greatly improved since I last saw it. I have a few minor comments below, mostly to clarify some points that were not clear, but I trust the author will address them so I don’t need to see another revision of the manuscript before publication.

Minor comments:

I didn’t understand “ratio of effects” in this sentence: “Here, multiple causal variants are dealt with by requiring colocalisation across all causal variants in a region, and that the ratio of effects of each causal variant on the two traits is constant across variants.”

I didn’t follow this sentence, and how connected with the rest of the paragraph: “Thus, the user is presented with a list of tag SNPs per signal for each trait, and the matrix of pairwise posterior probabilities of H4 may be examined to infer which, if any, pairs of tags represent the same signal.”

“This situation is confusing, because the same signal in trait 1 appears to colocalise with different signals in trait 1.” Should this read “…signals in trait 2”?

I didn’t see where Fig. 2 is referred to in the text.

Why is there no green line in Fig. 3a? And why are there no green points show in Fig. 3a, d?

For completeness, I think Fig. 3 should also show trait 1, susie signal 2? (And similarly for Fig. S2.)

In the Fig. 3 caption you should also make clear what the truth is (I recognize that this is given in the text).

“S2 Fig shows an example where the stepwise approach is less able to correctly identify the separate signals.” By “stepwise approach” do you mean cond_abo?

The example in S2 Fig seems interesting and instructive—maybe it is worth putting in the main text? If I understand correctly, one difference is that susie iteratively improves the fit, whereas cond_abo does not iterate—it conditions B on A, then A on B, then stops. So perhaps one important improvement in susie is that it iteratively improves the fit until convergence? Perhaps this could explain why susie better identifies the signals in S2 Fig?

“However, when no credible sets can be detected with confidence by SuSiE, single-coloc may still be able to make some inference.” Do you know why susie fails in these cases? If there is an explanation, it would be helpful to add it here. I’m guessing that this failure occurs in cases where the support for association is not strong? On the surface the need for the “hybrid” method is a bit surprising, but there could very well be a good reason for it.

Under “Availability”, you might want to mention the susie vignette in the coloc package, which seems particularly helpful for those interested in applying the new susie-based coloc methods.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS Genet. 2021 Sep 29;17(9):e1009440. doi: 10.1371/journal.pgen.1009440.r006

Author response to Decision Letter 2

9 Sep 2021

Attachment

Submitted filename: coloc-susie revised, revised, response.pdf

Click here for additional data file.^{(63.2KB, pdf)}

PLoS Genet. doi: 10.1371/journal.pgen.1009440.r007

Decision Letter 3

David Balding, Heather J Cordell

12 Sep 2021

Dear Dr Wallace,

We are pleased to inform you that your manuscript entitled "A more accurate method for colocalisation analysis allowing for multiple causal variants" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-21-00266R3

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

PLoS Genet. doi: 10.1371/journal.pgen.1009440.r008

Acceptance letter

David Balding, Heather J Cordell

23 Sep 2021

PGENETICS-D-21-00266R3

A more accurate method for colocalisation analysis allowing for multiple causal variants

Dear Dr Wallace,

We are pleased to inform you that your manuscript entitled "A more accurate method for colocalisation analysis allowing for multiple causal variants" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Andrea Szabo

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. Results of colocalisation simulations.

(CSV)

Click here for additional data file.^{(17.1KB, csv)}

S1 Fig. Companion to Fig 1, showing the results for simulated datasets with 3000 SNPs.

Legend otherwise as for Fig 1.

(TIF)

Click here for additional data file.^{(177.9KB, tif)}

S2 Fig. Example where the conditional coloc approach, run in “all but one” mode finds misleading results.

(TIF)

Click here for additional data file.^{(235.3KB, tif)}

S1 Data. Datasets plotted in Figs 4 and S2, including summary statistics and the underlying LD and MAF.

(ZIP)

Click here for additional data file.^{(12.2MB, zip)}

Attachment

Submitted filename: review wallace plos genetics.docx

Click here for additional data file.^{(14.8KB, docx)}

Attachment

Submitted filename: coloc-susie response.pdf

Click here for additional data file.^{(932.3KB, pdf)}

Attachment

Submitted filename: plos genetics review.docx

Click here for additional data file.^{(15.8KB, docx)}

Attachment

Submitted filename: coloc-susie revised, response(1).pdf

Click here for additional data file.^{(97.6KB, pdf)}

Attachment

Submitted filename: coloc-susie revised, revised, response.pdf

Click here for additional data file.^{(63.2KB, pdf)}

Data Availability Statement

[pgen.1009440.ref001] 1. Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian Test for Colocalisation between Pairs of Genetic Association Studies Using Summary Statistics. PLOS Genetics. 2014. May;10(5):e1004383. Available from: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref002] 2. Wakefield J. Bayes Factors for Genome-Wide Association Studies: Comparison with P -Values. Genet Epidemiol. 2009. Jan;33(1):79–86. Available from: 10.1002/gepi.20359. [DOI] [PubMed] [Google Scholar]

[pgen.1009440.ref003] 3. Wallace C. Eliciting Priors and Relaxing the Single Causal Variant Assumption in Colocalisation Analyses. PLOS Genetics. 2020. Apr;16(4):e1008720. Available from: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref004] 4. The Wellcome Trust Case Control Consortium, Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, et al. Bayesian Refinement of Association Signals for 14 Loci in 3 Common Diseases. Nat Genet. 2012. Oct;44(12):1294–1301. Available from: 10.1038/ng.2435. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref005] 5. Hormozdiari F, van de Bunt M, Segre AV, Li X, Joo JWJ, Bilow M, et al. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet. 2016;99(6):1245–1260. Available from: 10.1016/j.ajhg.2016.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref006] 6. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying Causal Variants at Loci with Multiple Signals of Association. Genetics. 2014. Oct;198(2):497–508. Available from: 10.1534/genetics.114.167908. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref007] 7. Wu Y, Zeng J, Zhang F, Zhu Z, Qi T, Zheng Z, et al. Integrative Analysis of Omics Summary Data Reveals Putative Mechanisms Underlying Complex Traits. Nat Commun. 2018;9. Available from: 10.1038/s41467-018-03371-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref008] 8. Miller AJ. Selection of Subsets of Regression Variables. J R Stat Soc Ser A. 1984;147(3):389–425. Available from: http://www.jstor.org/stable/2981576. [Google Scholar]

[pgen.1009440.ref009] 9. Asimit JL, Rainbow DB, Fortune MD, Grinberg NF, Wicker LS, Wallace C. Stochastic Search and Joint Fine-Mapping Increases Accuracy and Identifies Previously Unreported Associations in Immune-Mediated Diseases. Nature Communications. 2019. Jul;10(1):3216. Available from: https://www.nature.com/articles/s41467-019-11271-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref010] 10. Benner C, Spencer CCA, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: Efficient Variable Selection Using Summary Data from Genome-Wide Association Studies. Bioinformatics. 2016. May;32(10):1493–1501. Available from: 10.1093/bioinformatics/btw018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref011] 11. Newcombe PJ, Conti DV, Richardson S. JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects. Genet Epidemiol. 2016;40:188–201. Available from: 10.1002/gepi.21953. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref012] 12. Wang G, Sarkar A, Carbonetto P, Stephens M. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2020;82(5):1273–1300. Available from: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref013] 13. Zhu X, Stephens M. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. The annals of applied statistics. 2017;11(3):1561. doi: 10.1214/17-AOAS1046 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref014] 14. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A Global Reference for Human Genetic Variation. Nature. 2015. Oct;526(7571):68–74. Available from: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref015] 15. Howie BN, Donnelly P, Marchini J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLOS Genetics. 2009. 06;5(6):1–15. Available from: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref016] 16. Berisa T, Pickrell JK. Approximately Independent Linkage Disequilibrium Blocks in Human Populations. Bioinformatics. 2016. Jan;32(2):283–285. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4731402/. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref017] 17. Fortune M, Wallace C. simGWAS: A Fast Method for Simulation of Large Scale Case-Control GWAS Summary Statistics. Bioinformatics. 2018. Oct;Available from: 10.1093/bioinformatics/bty898. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref018] 18. Yang J, Ferreira T, Morris AP, Medland SE, Genetic Investigation of ANthropometric Traits (GIANT) Consortium, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium, et al. Conditional and Joint Multiple-SNP Analysis of GWAS Summary Statistics Identifies Additional Variants Influencing Complex Traits. Nat Genet. 2012. Apr;44(4):369–75, S1–3. Available from: 10.1038/ng.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009440.ref019] 19. Deng Y, Pan W. A powerful and versatile colocalization test. PLoS computational biology. 2020. Apr;16:e1007778. doi: 10.1371/journal.pcbi.1007778 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A more accurate method for colocalisation analysis allowing for multiple causal variants

Chris Wallace

Roles

Abstract

Author summary

Introduction

Methods

Adaptation of coloc approach

Simulation strategy

Results

Fig 1. Average posterior probability distributions in simulated data.

Fig 2. Distribution of maximum -log10 p values for simulated datasets where coloc-SuSiE could find at least one credible set for each trait, or could not.

Fig 3. Example where the conditional coloc approach, run in iterative mode, finds misleading results.

Fig 4. Fine mapping posterior probabilities at causal variants in single trait and coloc analysis, amongst datasets with high probability of colocalisation (P(H4|Data) > 0.9) according to the method shown.

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

David Balding

Heather J Cordell

Roles

Author response to Decision Letter 0

Decision Letter 1

David Balding

Heather J Cordell

Roles

Author response to Decision Letter 1

Decision Letter 2

David Balding

Heather J Cordell

Roles

Author response to Decision Letter 2

Decision Letter 3

David Balding

Heather J Cordell

Roles

Acceptance letter

David Balding

Heather J Cordell

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 4. Fine mapping posterior probabilities at causal variants in single trait and coloc analysis, amongst datasets with high probability of colocalisation (P(H₄|Data) > 0.9) according to the method shown.