Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2020 Apr 20;16(4):e1008720. doi: 10.1371/journal.pgen.1008720

Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses

Chris Wallace 1,*
Editor: Michael P Epstein2
PMCID: PMC7192519  PMID: 32310995

Abstract

Horizontal integration of summary statistics from different GWAS traits can be used to evaluate evidence for their shared genetic causality. One popular method to do this is a Bayesian method, coloc, which is attractive in requiring only GWAS summary statistics and no linkage disequilibrium estimates and is now being used routinely to perform thousands of comparisons between traits. Here we show that while most users do not adjust default software values, misspecification of prior parameters can substantially alter posterior inference. We suggest data driven methods to derive sensible prior values, and demonstrate how sensitivity analysis can be used to assess robustness of posterior inference. The flexibility of coloc comes at the expense of an unrealistic assumption of a single causal variant per trait. This assumption can be relaxed by stepwise conditioning, but this requires external software and an LD matrix aligned to study alleles. We have now implemented conditioning within coloc, and propose a new alternative method, masking, that does not require LD and approximates conditioning when causal variants are independent. Importantly, masking can be used in combination with conditioning where allelically aligned LD estimates are available for only a single trait. We have implemented these developments in a new version of coloc which we hope will enable more informed choice of priors and overcome the restriction of the single causal variant assumptions in coloc analysis.

Author summary

Determining whether two traits share a genetic cause can be helpful to identify mechanisms underlying genetically-influenced risk of disease or other traits. One method for doing this is “coloc”, which updates prior knowledge about the chance of two traits sharing a causal variant with observed genetic association data in a Bayesian statistical framework. To do this using only summary genetic association data that is commonly shared, the method makes certain assumptions, in particular about the number of genetic causal variants that may underlie each measured trait in a genomic region. We walk through several data-driven approaches to summarise the prior knowledge required for this technique, and propose sensitivity analysis as a means of checking that inference is robust to uncertainty about that prior knowledge. We also show how the assumptions about number of causal variants in a region may be relaxed, and that this improves inferential accuracy.

Introduction

As genome-wide association studies (GWAS) have considered a greater diversity of traits in greater numbers of samples, comparative analyses of GWAS results have become a useful tool to explore the aetiological connections between different traits. For example, estimates of genetic correlation obtained via LD score regression quantify the average proportion of genetic variance of two traits that is shared across the genome, [1] although typically large sample sizes are required in both trait studies for accuracy. [2] Linking traits through genetics overcomes at least one major challenge of observational studies, reverse causality, and with careful design, can also address confounding. Epidemiologists have developed and widely deployed the technique of Mendelian randomization (MR), [3] which has been used, for example, to establish causal effects of factors such as alcohol intake on aspects of health. [4] The method uses a genetic variant or variants with established effects on one trait, and assesses whether a second trait is (proportionally) associated with these instrumental variables. Assuming certain assumptions hold true, [5] this provides evidence that the first trait is somehow causal for the second. While MR was originally envisaged as a test of causality of specific risk factors for which tests of causality might be confounded in observational studies, MR has been extended to routinely assess the potential for any GWAS trait to mediate another. [6] However, the ubiquity of genetic effects on some measurable aspect of human physiology or health, which have prompted suggestions of an omnigenic model, [7] raise concerns that LD between causal variants can violate the MR assumption that the instrumental variable is only associated with the outcome through the “mediating” trait. [8] This routine testing of all possible mediators is similar in design to the assessment of potential molecular causes of disease, which has been addressed through alternative approaches that focus not on whether one trait is causal for another, but whether two traits share the same causal variants in a single, LD-defined, genetic region, termed colocalisation.

While one such method is built on MR [9] and proceeds by filtering MR-positive associations via a test of heterogeneity in the estimated proportional effect across multiple SNPs in the region, another popular colocalisation method, coloc, [10] avoids MR assumptions altogether. Instead, coloc enumerates every possible configuration of causal variants for each of two traits, and calculates the support for that causal model in the form of a Bayes factor can be calculated under an assumption that at most one causal variant per trait exists in the region (see S1 Text). Each configuration corresponds to exactly one of five mututally exclusive hypotheses about association and genetic sharing in the region:

H0:noassociationH1:associationtotrait1onlyH2:associationtotrait2onlyH3:associationtobothtraits,distinctcausalvariantsH4:associationtobothtraits,sharedcausalvariant

The coloc approach has also been extended beyond pairs of traits, although computational efficiency scales poorly with numbers of traits [11, 12] unless decisions are binarised [13] and to deal with GWAS data that share controls, though at the expense of requiring raw genotype data [11].

As a Bayesian method, coloc requires specification of three informative prior probabilities: p1, p2, p12 are, respectively, the prior probabilities that any random SNP in the region is associated with exactly trait 1, trait 2, or both traits (Fig 1). Although values for these were suggested in the initial proposal, [12] appropriate values should depend on specific datasets used, particularly for p12, and no specific guidance on how this choice should be made was given.

Fig 1. Each hypothesis for coloc analysis H0H4 may be enumerated by configurations, one configuration per row shown grouped by hypothesis.

Fig 1

Each circle in this figure represents one of n genetic variants, and is shaded orange if causal for trait 1, blue if causal for trait 2. There are different numbers of configurations for each hypothesis, depending on the number of SNPs in a region, and the prior is set according to three prior probabilities so that all configurations within a hypothesis are equally likely.

One of the strengths of coloc is the simplicity of data required. The assumption of at most one causal variant per trait allows inference to be made through reconstructing joint models across all SNPs from univariate (single SNP) GWAS summary data. [14, 15] Importantly, this requires no reference LD matrix and allows combining data from traits studied in differently structured populations. Further, p-values will suffice if internal or external estimates of minor allele frequency (MAF) are available, so that (unsigned) effect estimates and their standard errors can be re-constructed. However, the single causal variant assumption is convenient rather than realistic and when it does not hold colocalisation effectively tests whether the strongest signals for the two traits colocalise [10] which has been shown to be conservative [16].

e-CAVIAR [17] removes the assumption of a single causal variant per trait by integrating over the fine mapping posteriors for two traits, but requires signed effect estimates that are aligned to a reference LD matrix, that the traits are studied in the same population, and does not allow using any prior knowledge that shared causal variants are more or less likely than distinct variants. Perhaps the most challenging of these is the alignment of signed effect estimates to a reference LD matrix. This can be impossible in the case that signed estimates are not provided due to privacy concerns, [18] or that alleles are not provided. Even where alleles are available, palindromic SNPs (A/T, C/G) cannot be aligned unambiguously particularly for MAF ≈ 0.5.

The assumption of a single causal variant in coloc may be relaxed by successively conditioning on the most significant variants for each trait, and testing for colocalisation between each pair of conditioned signals, although this requires either complete genotype data or use of external software such as CoJo [19] together with signed and LD-aligned effect estimates to allow reconstruction of conditional regression effect estimates.

To support more accurate coloc analyses, we explored a variety of data-driven approaches to inform prior choice across a range of traits and developed a framework to explore sensitivity of conclusions to the priors used. Further, we implemented an existing conditioning approach in the coloc package, but also developed an alternative approach to conditioning which does not require aligned LD and effect estimates, to offer an option to deal with multiple causal variants which preserves the simplicity of the data required for coloc analyses.

Results

We used Scopus to identify 60 papers which cited coloc [10] and were published in 2018. Out of these, we extracted the subset of 25 papers that were both applied papers (rather than methodological) and for which full text could be accessed (S1 Table). The studies covered a variety of trait pairs, generally integrating a disease GWAS with molecular quantitative trait loci (QTL) data, [2039] but also comparing pairs of disease GWAS, [40] eQTL and pQTL [41, 42] or eQTL and other molecular traits. [43, 44] Only four studies considered the potential for multiple causal variants in a region, either discussing the implications on their results, or using conditioning in at least one trait, and 22 out of 25 studies used the software default priors across this diverse range of trait pairs.

Given that it is likely that the prior probability of colocalisation may depend on the trait pairs under consideration, we decided to evaluate the effect of mis-specifying prior parameters and/or not conditioning when multiple causal variants exist.

The importance and elicitation of prior parameter values

Before examining the robustness of inference to changes in prior values, we elucidate some properties of prior parameters. While priors are expressed per SNP, our hypotheses and posterior relate to a region—a set of n neighbouring SNPs. The prior that one SNP in the region is causally associated with trait 1 is ≈ np1 (and similarly np2 for trait 2, np12 for colocalisation). All these scale with the number of SNPs—the larger the set of SNPs we consider, the greater the chance one of them is causal for any trait. Despite this, the prior odds for H4/H1—colocalisation compared to association of trait 1 only—remains constant at p12/p1.

The prior for H3 (two distinct variants for the two traits) is ≈ n(n − 1)p1 p2 which scales with the square of n. This means that prior odds of the two hypotheses of greatest interest, H4/H3, depends not only on the per SNP prior of causality for one or other trait, but also on the number of SNPs in a region, to the extent that the same p1, p2, p12 may favour either H3 or H4 as larger regions are considered (Fig 2). This effect can be understood by noting that both H3 and H4 imply that each trait has exactly one causal variant in the region. Simple combinatorics implies that as the number of SNPs in a region increases, then the number of ways two different SNPs can be causal for the two traits (H3) increases more rapidly than the number of ways one SNP can be causal for both (H4). Hence, H3 becomes relatively more likely than H4 as the number of SNPs in the region increases.

Fig 2. Effects of varying p12 on the prior for H4 (coloured lines) compared to H3 (dashed line) as a function of the number of SNPs in the region.

Fig 2

For all plots p1 = p2 = 10−4 is constant. The coloured squares highlight points P(H3) = P(H4) for different p12.

Marginal priors

To elicit values for p1, p2, we reparameterise, focusing on the possible marginal events for any SNP:

A1:SNPiscausallyassociatedtotrait1Probq1=p1+p12A2:SNPiscausallyassociatedtotrait2Probq2=p2+p12

Note that in this notation, A1 and A2 are not mutually exclusive, so that colocalisation is A1A2. q1, q2 can be estimated empirically by considering evidence from the wealth of single trait association data that already exists. For eQTLs, we use GTeX data [45] and find that q. is dependent on the MAF of SNPs considered, which reflects variable power with fewer true eQTL variants detectable at lower MAF, and search window around the gene considered as previously noted, tending to 10−4 for common SNPs and windows ∼1 mb (Fig 3).

Fig 3. Determining plausible priors q1, q2.

Fig 3

a q. estimated for eQTLs as the ratio of estimated number of LD-independent significant eQTL variants divided by number of SNPs considered for an eQTL analysis in GTeX whole blood samples in successively larger windows around a gene TSS. Separate lines show findings in 5 equal groups of MAF, with the top and bottom groups labelled. b The number of hits claimed per study according to the GWAS catalog. q. could be estimated as number of hits / number of common SNPs (∼ 2, 000, 000). c Posterior probability of association at a single SNP as a function of -log10 p values for varying values of q.. We considered both case/control and quantitative trait designs, and a range of MAF (0.05-0.5) and sample size (2000,5000,10000). The relationship between -log10 p (x axis) and posterior probability of association (y axis) is consistent across all designs, affected only by the prior probability of association (q1, q2). The vertical line indicates p = 5 × 10−8, the conventional genome-wide significance threshold in European populations.

The GWAS Catalog [46] enables us to consider something similar by aggregating over 5000 GWAS studies. We find, as expected, and again as previously noted,[47] that the number of hits per study increases steadily with increasing sample size (Fig 3), but that the count also depends on the class of trait considered, with “harder” endpoints such as breast cancer and heel bone mineral density identifying orders of magnitude more associations compared to “weaker” endpoints such as tendency to strenuous sports or activity levels. The largest studies find ∼ 100–1000 hits out of ∼ 2 million common SNPs leading to estimates that 5 in 10,000–100,000 common SNPs are detectably causal for these traits which corresponds to q. ∈ [5 ×10−5, 5 × 10−4]. Even with the largest studies, these estimates must be considered likely to continue to increase with sample size, and therefore conservative. Using conservative priors for p1, p2 in colocalisation analysis is likely to reduce power to detect either shared or distinct causal variants, because weaker signals may be wrongly interpreted as trait-unique or null. However, estimates from the largest available studies also represent at upper bound on the proportion of variants likely to be detectably associated in any new study from the same class of traits, and therefore relaxing the priors further might result in over-stating the evidence for causal variants and erring towards false detection of shared or distinct causal variants.

An alternative approach is to choose the prior according to the p-value that we would consider significant. The threshold of p < 5 × 10−8 has been widely adopted as “genome-wide significant” for GWAS studies in European populations. Across a range of designs (case/control or quantitative trait, with varying MAF and sample size), we see that a prior of q. = 10−4 gives a strong posterior probability of association (≈ 0.94).

The default coloc marginal prior of q1 = q2 = 10−4 + p12 ≈ 10−4 is thus supported by the convergence of these three approaches to values of the order of 10−4.

Prior probability of joint or conditional causality

q1 and q2 themselves place some constraints on p12. On the one hand, the chance of joint causality cannot be greater than the chance of causal association with either trait. On the other hand, if traits were independent, then causal variants for each trait would happen to co-occur at the same location with probability q1 × q2. However, simulations show that the distribution of expected posterior probabilities vary considerably with p12 over this range (Fig 4), indicating that we need to make some effort to elicit plausible values. The results suggest that the coloc default of p12= = 10−5 may be overly liberal, with data simulated under H3 having posterior support for H4, particularly for smaller samples, and that p12 = 5 × 10−6 may be a more generally robust choice.

Fig 4. Distribution of expected posterior probabilities across a wide range of simulated data.

Fig 4

In all analyses we fixed p2 = p1 = 10−4 and varied p12. Coloured bar heights represent the average posterior probability for each hypothesis over the set of simulations for a given simulated hypothesis and sample size.

We consider different approaches to determine data-driven estimation of p12. First, we can set a lower bound if we take into account that not all of the genome is understood to be functional. Estimates of the functional proportion vary considerably, from 25% [48]–80%. [49] Even for traits that are genetically independent, knowing that a SNP is causal for one trait implies it is functional, and thus more likely to be causal for another trait then a random SNP that may or may not be functional. Assuming the proportion of genetic variants that are functional is f, the probability of co-occurence by chance alone is q1 q2/f (see S1 Text).

In the case of comparing two GWAS studies, it may be possible to estimate the genetic correlation, rg. We show in S1 Text that, when shared variants do not have any systematically different distribution of allele frequencies or effects compared to non-shared variants,

|rg|n12(n12+n1)(n12+n2)=p12q1q2

where n12, n1, n2 are the number of variants shared, distinct to trait 1 and distinct to trait 2.

Putting these together, we find

q1q2f<p12,|rg|q1q2<p12,p12<min(q1,q2).

Second, where studies of both traits are well powered, then methods for joint analysis of trait pairs may be informative. For example, gwas-pw [50] extends the original coloc by using empirical Bayes to estimate per-hypothesis priors via joint analysis of all regions genomewide. However, this comes at a cost of ignoring the dependence of per-hypothesis priors on the number of SNPs in a region, and even in simulated data did not generate consistent estimates. This latter may reflect the limited information that exists in any pair of GWAS (the number of regions where detectable signals exist for both traits). Nonetheless, such an approach can probably give a useful order of magnitude estimate for p12.

Finally, in the absence of data about joint trait association at the genome-wide level, it is necessary to rely more on investigator judgement, and here it may helpful to consider conditional probabilities

p12=P(A1A2)=P(A1|A2)×P(A2)=q1|2×q2

The term q1|2 represents the probability that a SNP, already known to be causal for trait 2, is also causal for trait 1. In asymmetric analysis such as GWAS and eQTL, it may be simpler to condition on one event rather than the other—does the investigator have a clearer idea of the chance that a SNP that causally regulates gene expression in a given tissue is causally associated with a disease or the chance that a SNP that is causally associated with a disease does so via transcriptional regulation in that same tissue?

To aid translation of priors between the two parameterisations discussed here, we have created an online tool “coloc explorer” at https://chr1swallace.shinyapps.io/coloc-priors.

Sensitivity analysis

In the expected case that an investigator does not have a strong prior belief in a single value for p12 we can use sensitivity analysis to consider whether conclusions are robust over a range of plausible values. Helpfully, it is not necessary to reanalyse the complete dataset multiple times. Given that

P(Hi|D,π)BFi×P(Hi|π)=P(D|Hi)P(D|H0)×P(Hi|π)

where D represents study data and π = (p1, p2, p12) is the prior parameter vector used for analysis, we can derive posterior probabilities under an alternative prior parameter π* as

P(Hi|D,π*)P(Hi|D,π)×P(Hi|π*)P(Hi|π

and so we can rapidly explore sensitivity of inference to changes to p12. Fig 5 shows an example where conclusions depend heavily on the relative prior belief in H3 and H4 and a conclusion of colocalisation by a decision rule of P(H4|D, π) > 0.5 is only valid if prior beliefs are that H4 is at least as likely as H3. An alternative example where results are robust over a wide range of p12 is shown in S1 Fig. Detailed instructions to run a sensitivity analysis are given at http://chr1swallace.github.io/coloc/articles/a04_sensitivity.html.

Fig 5. Example of sensitivity analysis on a dataset which shows evidence for colocalisation at a predefined rule of posterior P(H4) > 0.5 only when the prior beliefs in H3 and H4 are approximately equal.

Fig 5

The left hand panels show local Manhattan plots for the two traits, while the right hand panels show prior and posterior probabilities for H0-H4 as a function of p12. The dashed vertical line indicates the value of p12 used in initial analysis (the value about which sensitivity is to be checked). H0 is omitted from the prior plot to enable the relative difference for the other hypotheses to be seen.

Conditioning and masking to allow for multiple causal variants

In order to deal with multiple causal variants in a region, we implemented the CoJo approach [19] within the coloc package. We also propose an alternative to conditioning which does not depend on allelic alignment and can be used with p-values alone: masking. Stepwise regression proceeds by identifying the top SNP, and then re-estimating association statistics across all other SNPs to test whether they provide any additional information to infer the trait of interest. Conditional effect estimates at SNPs in LD with the top SNP(s) differ from their unconditional values, so that they capture the residual evidence for association, but conditional and unconditional effect estimates are (effectively) the same at SNPs independent from the top SNP(s). Our proposed masking algorithm relaxes the assumption of a single causal variant by instead assuming that if multiple causal variants exist for any individual trait, they are in linkage equilibrium. It therefore first identifies lead SNPs, then successively masks all SNPs in LD with the top signals(s), testing for significant association in the remainder, and adding SNPs sequentially while residual association remains (Fig 6). When colocalising, each lead SNP is taken in turn, and any SNPs in LD with any other lead SNP are masked, by setting the per-SNP Bayes factor to 1 for any SNP-specific hypothesis relating to that SNP/trait pair. We have implemented both approaches in the development version of the coloc package, https://github.com/chr1swallace/coloc/tree/condmask, and document their use at http://chr1swallace.github.io/coloc/articles/a05_conditioning.html.

Fig 6. Masking as an alternative strategy to conditioning when attempting to colocalise trait signals with multiple causal variants in a region.

Fig 6

Top panel: input local Manhattan plots, with causal variants for each trait highlighted in red. We can use conditioning (left column) to perform multiple colocalisation analyses in a region. First, lead SNPs for each signal are identified through successively conditioning on selected SNPs and adding the most significant SNP out of the remainder, until some significance threshold is no longer reached. Then we condition on all but one lead SNP for each parallel coloc analysis. Note that when multiple lead SNPs are identified for each trait, eg n and m for traits 1 and 2 respectively, then n × m coloc analyses are performed. When an allele-aligned LD matrix is not available, an alternative is masking (right column) which differs by successively restricting the search space to SNPs not in LD with any lead SNPs instead of conditioning. Multiple coloc analyses are again performed, but setting the per SNP Bayes factor to 1 for hypotheses containing SNPs in LD with any but one of the lead SNPs. Note that for convenience of display, all SNPs in r2 > α with the lead SNP are assumed to be in a contiguous block, shaded gray.

We compared conditioning and masking to single coloc analysis across a variety of simulated datasets (Figs 7 and 8). A single coloc comparison generally relates to the strongest signals for each of the two traits, as previously reported, [10] which can miss colocalising signals that are secondary to a primary independent signal (Fig 7, row 3) or that have differently ordered effect sizes (Fig 8, row 5). Conditioning allows more distinct comparisons and shows a marked improvement on single coloc, in particular being able to identify a greater proportion of the truly colocalising signals. Masking increases the number of comparisons compared to single coloc, but is less informative than conditioning. In particular, the number of comparisons that cannot be clearly assigned to a specific causal variant pair (at least one lead SNP does not have r2 > 0.8 with a causal variant) increases when multiple causal variants are in LD (S2 and S3 Figs) and this fraction of comparisons are often inaccurate, finding posterior support for H3 when H4 is true.

Fig 7. Average posterior probabilities for each hypothesis under different analysis strategies when trait 1 has two causal variants, A and B, and trait 2 has just one.

Fig 7

The left column shows the identity of causal variants for each trait and their relative effect sizes under four different models. The right column shows the average posterior that can be assigned to specific comparisons for of variants for trait 1: trait 2. We exploit our knowledge of the identity of the causal variants in simulated data to label each comparison according to LD between the lead SNP for each trait and the simulated causal variants. When labels cannot be unambiguously assigned (r2 < 0.8 with any causal variant) we use “?”.

Fig 8. Average posterior probabilities for each hypothesis under different analysis strategies when both traits have two causal variants.

Fig 8

Information is displayed as described in Fig 7.

Discussion

This paper has focused on two practical aspects of Bayesian colocalisation analysis that hitherto have not received detailed attention. The ability of Bayesian methods to incorporate prior knowledge and beliefs is a strength of the coloc approach, but also places onus on a researcher to evaluate their prior beliefs. Elicitation of informative priors is a subject that has received much attention in the statistical literature [51] but rather less within the genetics community. Nonetheless, the use of Bayesian methods in genomics is growing in popularity, as a natural way to fit joint models to large and complex data sets and to enable integrative analysis over different traits or datasets. When data are large, and the number of events is also large, then empirical Bayes can enable an analyst to learn the prior from the same data used for testing. However, in the case of smaller studies or less common events, the wealth of existing information from other large studies as well as investigators’ own beliefs can be used.

For coloc, the choice of marginal prior parameter values can be readily informed in this way. For joint causality this is harder and while we suggest and walk through several alternative ways of doing this the conclusions we draw are not universally applicable; each investigator should use both available data and their own judgement to elicit their own prior beliefs and those of their co investigators. Perhaps the most widely applicable are the results of simulations, that suggest values of the order p12 ≈ 5×10−6 lead to robust inference over a range of scenarios, but the adoption of sensitivity analysis will help evalutate robustness of inference to changes in prior parameter values.

Attempts to colocalise disease and eQTL signals have ranged from underwhelming [52] to positive.[53] One key difference between outcomes is the disease-specific relevance of the cell types considered, which is consistent with variable chromatin state enrichment in different GWAS according to cell type.[54] For example, studies considering the overlap of open chromatin and GWAS signals have convincingly shown that tissue relevance varies by up to 10 fold, [55] with pancreatic islets of greatest relevance for traits like insulin sensitivity and immune cells for immune-mediated diseases.[54] This suggests that p12 should depend explicitly on the specific pair of traits under consideration, including cell type in the case of eQTL or chromatin mark studies. One avenue for future exploration is whether fold change in enrichment of open chromatin/GWAS signal overlap between cell types could be used to modulate p12 and select larger values for more a priori relevant tissues.

The other focus of this paper is on dealing with multiple causal variants for single traits in a single region. Single coloc can be misleading when there are completely shared causal variants in the two traits, but with different effect sizes, such that colocalisation concludes there are single effects in each trait, different to each other (e.g. row 5 of Fig 8). Inference is much improved with conditioning, and we hope that by including the conditioning method within coloc we will enable more widespread use of this step. Note that if the two traits are measured in different populations, then colocalisation can still be performed, with a separate LD matrix for each. However, if the summary statistics from a single trait are the results of meta analysis of different populations, then conditioning needs to be performed in each population separately.

One advantage of coloc has been the minimal amount of data pre-processing required. In particular, there is no need to harmonize alleles between the two datasets or to some reference dataset. However, harmonization cannot be avoided if multiple causal variants are to be dealt with via conditioning. Here, we propose successively masking most associated SNPs and SNPs in LD with them. This has conceptual similarities to clumping, used in polygenic risk score construction to select the strongest signal in each LD-independent set of SNPs [56], and to the division of the genome into LD-independent blocks, [57] but differs to each. Our motivation is inverted compared to that for clumping: We aim to identify the set of SNPs whose GWAS summary statistics are likely to be unrelated to the masked signal, rather than select a single SNP from the masked group. We also select smaller sets of SNPs than found by dividing the genome into blocks, because we select SNPs only according to LD with the sentinel SNP, rather than finding breakpoints such that every SNP in a block is likely to have minimal LD with any SNP outside that block. While masking loses accuracy in comparison to conditioning, it improves on single coloc, and importantly doesn’t appear to lead to erroneous positive conclusions for H4 when H3 is true, although the reverse—supporting H3 for a secondary comparison when H4 is true—can occur when causal variants are themselves in LD. Therefore secondary H3 conclusions should be treated with some caution, but secondary H4 conclusions may signal true colocalisations that would have otherwise been missed. Often a researcher may be colocalising results from one dataset for which they have complete information (e.g. because it was generated in their lab) with a public disease GWAS with less information, and here we recommend the hybrid strategy of conditioning in the dataset with full information and masking in the public dataset. Masking is also likely to avoid substantial errors in the results of approximate conditioning that can occasionally result from small deviations from LD estimated in a reference population to that in the study sample, particularly when the reference population is smaller than that used to the GWAS [58].

While we have discussed the thought process required to consider prior parameter values, thought is also required to interpret partially colocalising signals (i.e. a convincing mixture of one colocalising and one non-colocalising variant). When the two datasets are different disease GWAS, it may be reasonable that they share only one signal, with the alternate signal operating through a different mechanism. But if there are two signals for an eQTL only one of which colocalises with a disease signal, then this should be interpreted with greater caution than complete colocalisation. It suggests that there are two ways of modifying expression of a gene but that only one of those ways is also associated with variable disease risk. This might mean that the right gene has been identified in the wrong tissue, given the overlap in eQTL signals between tissues, [45] but it might also indicate incidental colocalisation. Similarly, lack of colocalisation may indicate only that the correct tissue or state has not been assayed. We anticipate that systematic analysis of multiple tissues and genes with a single disease may lead to a set of posterior probabilities that are jointly more amenable to interpretion than a single isolated analysis. However, colocalisation will always be limited by its basis in analysis of observational data, and experimental manipulation through CRISPR or through genotype-targeted assays will be required to establish causality.

In summary, we find that coloc default values for the prior probabilities of single trait association, p1, p2, are well supported by data across a range of data types, but that the choice of p12 needs careful thought, and is expected to vary according to the pair of traits being considered. We recommend taking some time to do this before any analysis, documenting and justifying choices, using the coloc explorer app to translate between per-SNP and per-hypothesis values. The simulations here (Fig (4)) suggest that p12 = 5×10−5 provides a reasonable balance between power and false positive calls, but it is unlikely that any single point distribution on p12 captures all prior knowledge. As varying p12 can sometimes have a substantial impact on inference, we strongly advise users to perform sensitivity analysis for key results. Both the justification of choices and the results of sensitivity analyses should be presented to accompany any published results.

Materials and methods

Code to run the simulations and analyses described below is available at https://github.com/chr1swallace/coloc-mask-paper.

A statistical description of the coloc method, including calculation of per-SNP and per-hypothesis Bayes factors and posterior probabilities is given in S1 Text.

To calculate the posterior probability of association shown in Fig 3c and 3d, we use the Bayes factor for association at a single SNP defined in S1 Text, BF1. We calculate the posterior probability for association as a function of the prior probability that a SNP is associated with the trait, π, as

P(J1|Data)=P(Data|J1)πP(Data|J1)π+P(Data|J0)(1-π0)=BF1πBF1π+(1-π0)

where we use J0, J1 to denote the competing hypotheses of association and non-association at this SNP.

Simulations

We evaluated different prior parameter settings, sensitivity analysis, or strategies for dealing with multiple causal variants by simulation. In each case, we simulated GWAS data by sampling 2N haplotypes of length M SNPs for N individuals from 1000 Genomes samples (either EUR or YRI), and selected one or two causal variants at random from amongst common SNPs (MAF>5%) according to the question being addressed.

Effect estimates at each variant were sampled from the set {0.17, 0.33, 0.50, 0.67, 0.83, 1.00, 1.17, 1.33, 1.50}, sample sizes N from the set {100, 200, 500, 1000, 2000, 5000, 10000} and number of SNPs M from {250, 500, 750}. Quantitative traits with residual standard deviation 1 were then simulated according to linear models, i.e. as

Y=ibiGi+e

where i indexes causal variants, bi and Gi the effect estimate and genotype at variant i, and eN(0, 1).

For all analyses, we used p1 = p2 = 10−4 and varied p12 as described in the text.

GTEx analysis

We used GTEx data to estimate the probability that a random SNP could be causally associated with the expression of a gene within some bp-defined window. We analysed GTEx v7 Whole Blood significant eQTLs, downloaded from https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz on 25 June 2019. We used masking to define independent signals within this set for each gene (r2 < 0.01) using 1000 Genomes EUR samples to estimate LD. We estimated q as the ratio of the number of significant lead eQTLs in multiples of 100 kb windows around the TSS to the number of SNPs in 1000 Genomes with SNPs grouped by MAF into 5 groups: [0, 0.1], (0.1, 0.2], (0.2, 0.3], (0.3, 0.4], (0.4, 0.5].

GWAS catalog analysis

We used the GWAS summaries in the GWAS catalog (https://www.ebi.ac.uk/gwas/api/search/downloads/full, download date: 12 June 2019) to estimate the proportion of common SNPs that were independently associated with any given case/control or quantitative trait and examined how this varied according to reported sample size.

Supporting information

S1 Table. Summary of applied papers from 2018 using coloc.

(PDF)

S1 Text. Supporting mathematical derivations.

(PDF)

S1 Fig. Example of sensitivity analysis on a dataset which shows evidence for colocalisation at a predefined rule of posterior P(H4) > 0.5 across a wide range of p12.

(TIF)

S2 Fig. Average posterior probabilities for each hypothesis when trait 1 has two causal variants, and trait 2 has just one, according to whether the maximum r2 between multiple causal variants is ≤ 0.01 or > 0.01.

(TIF)

S3 Fig. Average posterior probabilities for each hypothesis when both traits have two causal variants, according to whether the maximum r2 between multiple causal variants is ≤ 0.01 or > 0.01.

(TIF)

Acknowledgments

We thank Stasia Grinberg and members of the BSU for helpful discussions during the preparation of this manuscript.

The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS.

The NHGRI-EBI GWAS Catalog is funded by NHGRI Grant Number 2U41HG007823, and delivered by collaboration between the NHGRI, EMBL-EBI and NCBI.

Data Availability

Code to run all the simulations and analyses is available at https://github.com/chr1swallace/coloc-mask-paper. The updated coloc package may be installed as described at https://chr1swallace.github.io/coloc.

Funding Statement

CW is supported by the Wellcome Trust https://wellcome.ac.uk/ (WT107881) and the MRC https://mrc.ukri.org/ (MC UU 00002/4). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Schizophrenia Working Group of the Psychiatric Genomics Consortium, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–295. 10.1038/ng.3211 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Ni G, Moser G, Schizophrenia Working Group of the Psychiatric Genomics Consortium, Wray NR, Lee SH. Estimation of Genetic Correlation via Linkage Disequilibrium Score Regression and Genomic Restricted Maximum Likelihood. Am J Hum Genet. 2018;102(6):1185–1194. 10.1016/j.ajhg.2018.03.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Gray R, Wheatley K. How to avoid bias when comparing bone marrow transplantation with chemotherapy. Bone Marrow Transplant. 1991;7 Suppl 3:9–12. [PubMed] [Google Scholar]
  • 4. Chen L, Smith GD, Harbord RM, Lewis SJ. Alcohol intake and blood pressure: a systematic review implementing a Mendelian randomization approach. PLoS Med. 2008;5(3):e52 10.1371/journal.pmed.0050052 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Haycock PC, Burgess S, Wade KH, Bowden J, Relton C, Davey Smith G. Best (but oft-forgotten) practices: the design, analysis, and interpretation of Mendelian randomization studies. Am J Clin Nutr. 2016;103:965–978. 10.3945/ajcn.115.118216 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Hemani G, Zhengn J, Elsworth B, Wade KH, Haberland V, Baird D, et al. The MR-Base platform supports systematic causal inference across the human phenome. Elife. 2018;7:e34408 10.7554/eLife.34408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Boyle EA, Li YI, Pritchard JK. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell. 2017;169(7):1177–1186. 10.1016/j.cell.2017.05.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Smith GD, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol. 2003;32(1):1–22. 10.1093/ije/dyg070 [DOI] [PubMed] [Google Scholar]
  • 9. Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet. 2016;48(5):481–487. 10.1038/ng.3538 [DOI] [PubMed] [Google Scholar]
  • 10. Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian Test for Colocalisation between Pairs of Genetic Association Studies Using Summary Statistics. PLoS Genet. 2014;10(5):e1004383 10.1371/journal.pgen.1004383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Fortune MD, Guo H, Burren O, Schofield E, Walker NM, Ban M, et al. Statistical colocalization of genetic risk variants for related autoimmune diseases in the context of common controls. Nat Genet. 2015; p. 839–849. 10.1038/ng.3330 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Giambartolomei C, Liu JZ, Zhang W, Hauberg M, Shi H, Boocock J, et al. A Bayesian framework for multiple trait colocalization from summary association statistics. Bioinformatics. 2018;34(15):2538–2545. 10.1093/bioinformatics/bty147 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Foley CN, Staley JR, Breen PG, Sun BB, Kirk PDW, Burgess S, et al. A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits; 2019. Available from: https://www.biorxiv.org/content/10.1101/592238v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Wakefield J. Bayes factors for genome-wide association studies: comparison with P -values. Genet Epidemiol. 2009;33(1):79–86. 10.1002/gepi.20359 [DOI] [PubMed] [Google Scholar]
  • 15. Wellcome Trust Case Control Consortium, Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet. 2012;44(12):1294–1301. 10.1038/ng.2435 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 2016;48(3):245–252. 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Hormozdiari F, van de Bunt M, Segre AV, Li X, Joo JWJ, Bilow M, et al. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet. 2016;99(6):1245–1260. 10.1016/j.ajhg.2016.10.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4(8):e1000167 10.1371/journal.pgen.1000167 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Yang J, Ferreira T, Morris AP, Medland SE, Genetic Investigation of ANthropometric Traits (GIANT) Consortium, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012;44(4):369–75, S1–3. 10.1038/ng.2213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun. 2018;9:1825 10.1038/s41467-018-03621-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Bhalala OG, Nath AP, Inouye M, Sibley CR, Consortium UKBE. Identification of expression quantitative trait loci associated with schizophrenia and affective disorders in normal brain tissue. PLoS Genet. 2018;14(8):e1007607 10.1371/journal.pgen.1007607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Bryois J, Garrett ME, Song L, Safi A, Giusti-Rodriguez P, Johnson GD, et al. Evaluation of chromatin accessibility in prefrontal cortex of individuals with schizophrenia. Nat Commun. 2018;9:3121 10.1038/s41467-018-05379-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Endo C, Johnson TA, Morino R, Nakazono K, Kamitsuji S, Akita M, et al. Genome-wide association study in Japanese females identifies fifteen novel skin-related trait associations. Sci Rep. 2018;8:8974 10.1038/s41598-018-27145-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Gusev A, Mancuso N, Won H, Kousi M, Finucane HK, Reshef Y, et al. Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nat Genet. 2018;50(4):538–548. 10.1038/s41588-018-0092-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Hannon E, Schendel D, Ladd-Acosta C, Grove J, Hansen CS, Andrews SV, et al. Elevated polygenic burden for autism is associated with differential DNA methylation at birth. Genome Med. 2018;10 10.1186/s13073-018-0527-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Hirata T, Koga K, Johnson TA, Morino R, Nakazono K, Kamitsuji S, et al. Japanese GWAS identifies variants for bust-size, dysmenorrhea, and menstrual fever that are eQTLs for relevant protein-coding or long non-coding RNAs. Sci Rep. 2018;8:8502 10.1038/s41598-018-25065-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. James T, Linden M, Morikawa H, Fernandes SJ, Ruhrmann S, Huss M, et al. Impact of genetic risk loci for multiple sclerosis on expression of proximal genes in patients. Hum Mol Genet. 2018;27(5):912–928. 10.1093/hmg/ddy001 [DOI] [PubMed] [Google Scholar]
  • 28. Knowlest DA, Burrowet CK, Blischak JD, Patterson KM, Serie DJ, Norton N, et al. Determining the genetic basis of anthracycline-cardiotoxicity by molecular response QTL mapping in induced cardiomyocytes. Elife. 2018;7:e33480 10.7554/eLife.33480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Lamontagne M, Berube JC, Obeidat M, Cho MH, Hobbs BD, Sakornsakolpat P, et al. Leveraging lung tissue transcriptome to uncover candidate causal genes in COPD genetic associations. Hum Mol Genet. 2018;27(10):1819–1829. 10.1093/hmg/ddy091 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Li J, Loebel A, Meltzer HY. Identifying the genetic risk factors for treatment response to lurasidone by genome-wide association study: A meta-analysis of samples from three independent clinical trials. Schizophr Res. 2018;199:203–213. 10.1016/j.schres.2018.04.006 [DOI] [PubMed] [Google Scholar]
  • 31. Mo A, Marigorta UM, Arafat D, Chan LHK, Ponder L, Jang SR, et al. Disease-specific regulation of gene expression in a comparative analysis of juvenile idiopathic arthritis and inflammatory bowel disease. Genome Med. 2018;10:48 10.1186/s13073-018-0558-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Morrow JD, Glass K, Cho MH, Hersh CP, Pinto-Plata V, Celli B, et al. Human Lung DNA Methylation Quantitative Trait Loci Colocalize with Chronic Obstructive Pulmonary Disease Genome-Wide Association Loci. Am J Respir Crit Care Med. 2018;197(10):1275–1284. 10.1164/rccm.201707-1434OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Mullin BH, Zhu K, Xu J, Brown SJ, Mullin S, Tickner J, et al. Expression Quantitative Trait Locus Study of Bone Mineral Density GWAS Variants in Human Osteoclasts. J Bone Miner Res. 2018;33(6):1044–1051. 10.1002/jbmr.3412 [DOI] [PubMed] [Google Scholar]
  • 34. Richard AC, Peters JE, Savinykh N, Lee JC, Hawley ET, Meylan F, et al. Reduced monocyte and macrophage TNFSF15/TL1A expression is associated with susceptibility to inflammatory bowel disease. PLoS Genet. 2018;14(9):e1007458 10.1371/journal.pgen.1007458 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Sun BB, Maranville JC, Peters JE, Stacey D, Staley JR, Blackshaw J, et al. Genomic atlas of the human plasma proteome. Nature. 2018;558(7708):73–79. 10.1038/s41586-018-0175-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Theriault S, Gaudreault N, Lamontagne M, Rosa M, Boulanger MC, Messika-Zeitoun D, et al. A transcriptome-wide association study identifies PALMD as a susceptibility gene for calcific aortic valve stenosis. Nat Commun. 2018;9:988 10.1038/s41467-018-03260-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Wang L, Pittman KJ, Barker JR, Salinas RE, Stanaway IB, Williams GD, et al. An Atlas of Genetic Variation Linking Pathogen-Induced Cellular Traits to Human Disease. Cell Host Microbe. 2018;24(2):308–323. 10.1016/j.chom.2018.07.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Wyss AB, Sofer T, Lee MK, Terzikhan N, Nguyen JN, Lahousse L, et al. Multiethnic meta-analysis identifies ancestry-specific and cross-ancestry loci for pulmonary function. Nat Commun. 2018;9:2976 10.1038/s41467-018-05369-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Xue A, Wu Y, Zhu Z, Zhang F, Kemper KE, Zheng Z, et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun. 2018;9:2941 10.1038/s41467-018-04951-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Venkateswaran S, Prince J, Cutler DJ, Marigorta UM, Okou DT, Prahalad S, et al. Enhanced Contribution of HLA in Pediatric Onset Ulcerative Colitis. Inflamm Bowel Dis. 2018;24(4):829–838. 10.1093/ibd/izx084 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Dobbyn A, Huckins LM, Boocock J, Sloofman LG, Glicksberg BS, Giambartolomei C, et al. Landscape of Conditional eQTL in Dorsolateral Prefrontal Cortex and Co-localization with Schizophrenia GWAS. Am J Hum Genet. 2018;102(6):1169–1184. 10.1016/j.ajhg.2018.04.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Yao C, Chen G, Song C, Keefe J, Mendelson M, Huan T, et al. Genome-wide mapping of plasma protein QTLs identifies putatively causal genes and pathways for cardiovascular disease. Nat Commun. 2018;9:3268 10.1038/s41467-018-05512-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Alasoo K, Rodrigues J, Mukhopadhyay S, Knights AJ, Mann AL, Kundu K, et al. Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response. Nat Genet. 2018;50(3):424–431. 10.1038/s41588-018-0046-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Pierce BL, Tong L, Argos M, Demanelis K, Jasmine F, Rakibuz-Zaman M, et al. Co-occurring expression and methylation QTLs allow detection of common causal variants and shared biological mechanisms. Nat Commun. 2018;9:804 10.1038/s41467-018-03209-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Aguet F, Brown AA, Castel SE, Davis JR, He Y, Jo B, et al. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204–213. 10.1038/nature24277 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47(D1):D1005–D1012. 10.1093/nar/gky1120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet. 2017;101(1):5–22. 10.1016/j.ajhg.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Graur D. An Upper Limit on the Functional Fraction of the Human Genome. Genome Biol Evol. 2017;9(7):1880–1885. 10.1093/gbe/evx121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Kellis M, Wold B, Snyder MP, Bernstein BE, Kundaje A, Marinov GK, et al. Defining functional DNA elements in the human genome. Proc Natl Acad Sci U S A. 2014;111(17):6131–6138. 10.1073/pnas.1318948111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Pickrell JK, Berisa T, Liu JZ, Segurel L, Tung JY, Hinds DA. Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet. 2016;48(7):709–717. 10.1038/ng.3570 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Johnson SR, Tomlinson GA, Hawker GA, Granton JT, Feldman BM. Methods to elicit beliefs for Bayesian priors: a systematic review. J Clin Epidemiol. 2010;63(4):355–369. 10.1016/j.jclinepi.2009.06.003 [DOI] [PubMed] [Google Scholar]
  • 52. Guo H, Fortune MD, Burren OS, Schofield E, Todd JA, Wallace C. Integration of disease association and eQTL data using a Bayesian colocalisation approach highlights six candidate causal genes in immune-mediated diseases. Hum Mol Genet. 2015;24(12):3305–3313. 10.1093/hmg/ddv077 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Bossini-Castillo L, Glinos DA, Kunowska N, Golda G, Lamikanra A, Spitzer M, et al. Immune disease variants modulate gene expression in regulatory CD4+ T cells and inform drug targets; 2019. Available from: https://www.biorxiv.org/content/10.1101/654632v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Trynka G, Westra HJ, Slowikowski K, Hu X, Xu H, Stranger BE, et al. Disentangling the Effects of Colocalizing Genomic Annotations to Functionally Prioritize Non-coding Variants within Complex-Trait Loci. Am J Hum Genet. 2015;97(1):139–152. 10.1016/j.ajhg.2015.05.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Iotchkova V, Ritchie GRS, Geihs M, Morganella S, Min JL, Walter K, et al. GARFIELD classifies disease-relevant genomic features through integration of functional annotations with association signals. Nat Genet. 2019;51(2):343–353. 10.1038/s41588-018-0322-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Wray NR, Lee SH, Mehta D, Vinkhuyzen AAE, Dudbridge F, Middeldorp CM. Research review: Polygenic methods and their application to psychiatric traits. J Child Psychol Psychiatry. 2014;55(10):1068–1087. 10.1111/jcpp.12295 [DOI] [PubMed] [Google Scholar]
  • 57. Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32(2):283–285. 10.1093/bioinformatics/btv546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Benner C, Havulinna AS, Järvelin MR, Salomaa V, Ripatti S, Pirinen M. Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. Am J Hum Genet. 2017;101(4):539–551. 10.1016/j.ajhg.2017.08.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Michael P Epstein, Hua Tang

26 Feb 2020

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Wallace,

Thank you very much for submitting your Research Article entitled 'Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved.

We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Michael P. Epstein

Associate Editor

PLOS Genetics

Hua Tang

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Colocalization is an increasingly important aspect of genetic fine mapping efforts (>60 papers in 2018) but, unusually in statistical genetics, the most popular software (“coloc”) implements a Bayesian analysis with subjective priors. This paper demonstrates the potential sensitivity of coloc to the prior probability of colocalization, examines a huge amount of data to elicit suggestions for setting reasonable values, and provides software for performing sensitivity analysis. In addition, the assumption of one causal variant per region per trait is examined and a new approach, called masking, is suggested for situations in which current methods cannot be applied. Overall this paper gives useful guidance to users of coloc and provides insights into the method that should be of value.

Minor comments

1. P4 L52 “ubiquity of genetic effects … concordant with an omnigenic model” – suggests that such ubiquity has been established when it remains a conjecture. Some rewording needed.

2. The Introduction starts off by introducing MR and appears to motivate colocalization primarily as a way to validate instruments in MR studies. But, as seen elsewhere in the paper, most of the applications are in delineating molecular pathways to disease. I’d suggest reworking the opening paragraph to better reflect the broader motivations for colocalization.

3. P7 L101 full text could be accessed for only 25 of 60 papers. Was this due to limitations of institutional subscriptions? Could not the corresponding authors provide manuscripts for research purposes?

4. P7 L104 it would be interesting to know also how many papers used eCaviar or some other method to deal with multiple causal variants. Also, how often did the original discovery studies perform conditional analyses and rule out additional causal variants? So that when going to colocalization, the single causal variant assumption can be justified to some extent.

5. P7 L107 “prior probability … will depend” – should say “may depend” since at this point we haven’t established this, and anyway since priors are subjective the user is free to believe that there is no dependence on the traits (but may then draw the wrong conclusion).

6. P8 L124 “more likely” should be “relatively more likely”, otherwise this sentence is confusing. Initially I found this sentence counter-intuitive – seems that by looking at fewer SNPs we are more likely to find colocalization – but the point is that the prior probability of colocalization is higher relative to distinct variants when fewer SNPs are considered. However the lower number of SNPs would provide less evidence for colocalization so this is a false economy. Anyway some interpretation should be added to this and the previous paragraph as it is unclear what one should conclude from the observations.

7. P8 L132 note that all the estimates of p’s and q’s are based on statistically significant SNPs, and the number of truly associated variants must be larger. So the elicited priors must be lower bounds. What implications does this have for the final inferences?

8. P9 L145 not clear how to get a posterior probability of association from just a prior and a p-value.

9. P10 L163, 165 the Appendix was not available to review.

10. P12 L208 “unlinked” -> “not in linkage disequilibrium”. There is a difference between linkage and LD.

11. P12 The masking method still needs an LD matrix, so the only real advantage over CoJo is that there is no need to align the alleles.

12. P12 The masking method looks a lot like “clumping” as often used, for example, in constructing polygenic risk scores. Please clarify the difference, or use the same term to prevent jargon creep.

13. P213 Figure 6 caption, “setting to 1 the Bayes factor” –the main text suggests setting the log Bayes factor to -3. Log in what base?

14. P14 L244 is it feasible to make the sensitivity analysis a default action in coloc, with the results being returned in the same object as the posteriors?

15. P16 the Discussion would benefit from a summary take-home message, such as that the default values of p1 and p2 are OK but p12 needs more thought (and a summary of how to do this would also help).

Typos etc

1. P3 L32 “underly” -> “underlie”

2. P4 L59 “For example…” – the sentence has no active verb.

3. P7 L117 delete “a”; change final “,” to “.”

4. P8 L140 the double “-“ is confusing, suggest just saying “to” or writing as an interval.

5. P9 L150 “One” -> “On”

6. P11 L178 in the equation below, can delete the intersection with A2 in the third expression.

7. P11 L180 spelling of “asymmetric”

8. P12 Figure 5 caption line 2, “belief” -> “beliefs”. What does the dotted line marked “results” mean?

9. P12 L212 “is” -> “are”

10. P14 L234 “are” -> “is”

11. P16 L288 “interpretable” -> “interpretation”

12. References are a bit sloppy, eg page numbers for refs 11 and 14.

13. Supplement P1, footnote 4 “P=1105” etc looks incorrect.

Reviewer #2: This paper considers two important extensions to the currently most popular

and influential colocalization method/software "coloc": a more suitable

prior specification (than the current default) and relaxing the assumption

of only one causal SNP. In particular, the first problem has been largely

ignored in practice while its implication is significant, as the

author has clearly shown in the paper. Although the proposed methods are

not technically sophisticated, they can be tremendously useful as implemented

in the "coloc" software. The paper was well written. I only have two very

minor comments.

Minor comments:

1. Prior elicitation is a well known and general problem in Bayesian statistics,

both important and challenging. I agree with the author on all her points,

and commend the author for providing a useful online tool "coloc explorer".

However, without a "default" prior, I am not sure how useful it would be to

a "typical" biologist without deep understanding of Bayesian statistics or

"coloc" method; in fact, I would be a bit worried that someone might do

"prior mining" to try to get more significant results. Some comments or

guidelines might be helpful to a typical user.

2. I completely agree with the author on both the advantages and limitations

of the conditioning approach as compared to the proposed "masking" approach.

However, if I understand correctly, with a typical small genomic region of

interest, one would potentially mask out ALL SNPs in the region that are in

LD with the lead SNP; in other words, is the new assumption simply

that there is at most only one causal SNP in EACH LD block? If true, it is

still like doing coloc analysis under the single causal SNP assumption for

each LD block, which can be too restrictive given that there are only about

two thousand (approximately independent) LD blocks in the huiman genome.

Some clarifications and comments would be helpful.

Reviewer #3: This manuscript investigated how to derive data driven priors for best power of COLOC, provided a sensitivity analysis framework to assess the robustness of priors, and proposed a new masking approach for dealing with scenarios with multiple signals per region. It is very useful to provide guidelines for users of COLOC about how to setup priors to achieve the best power. However, this paper does not provide a clear guideline to readers. I have the following comments:

1) It would help refresh reader’s mind if a brief description about the statistical procedure of the COLOC tool could be provided either in the Introduction section along with the five stated hypotheses, or at the beginning of the Results section.

2) I think it would be helpful to make a clear guideline table for readers, e.g., suggestive p1, p2, p12 prior values for a few different combinations of number of SNPs in the test region, total number of trait signals, if multiple signals exist in the test region. Or a such table could be provided for GTEx expression traits of different tissue types, which will provide readers a concrete example.

3) It would be helpful if the authors could provide some descriptions about “coloc explorer” and “condmask coloc” and how to implement these two tools in the supplementary text.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Decision Letter 1

Michael P Epstein, Hua Tang

17 Mar 2020

Dear Dr Wallace,

We are pleased to inform you that your manuscript entitled "Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Michael P. Epstein

Associate Editor

PLOS Genetics

Hua Tang

Section Editor: Natural Variation

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-02090R1

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Michael P Epstein, Hua Tang

9 Apr 2020

PGENETICS-D-19-02090R1

Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses

Dear Dr Wallace,

We are pleased to inform you that your manuscript entitled "Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Kaitlin Butler

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Summary of applied papers from 2018 using coloc.

    (PDF)

    S1 Text. Supporting mathematical derivations.

    (PDF)

    S1 Fig. Example of sensitivity analysis on a dataset which shows evidence for colocalisation at a predefined rule of posterior P(H4) > 0.5 across a wide range of p12.

    (TIF)

    S2 Fig. Average posterior probabilities for each hypothesis when trait 1 has two causal variants, and trait 2 has just one, according to whether the maximum r2 between multiple causal variants is ≤ 0.01 or > 0.01.

    (TIF)

    S3 Fig. Average posterior probabilities for each hypothesis when both traits have two causal variants, according to whether the maximum r2 between multiple causal variants is ≤ 0.01 or > 0.01.

    (TIF)

    Attachment

    Submitted filename: coloc response.pdf

    Data Availability Statement

    Code to run all the simulations and analyses is available at https://github.com/chr1swallace/coloc-mask-paper. The updated coloc package may be installed as described at https://chr1swallace.github.io/coloc.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES