Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2021 Sep 20;17(9):e1009733. doi: 10.1371/journal.pgen.1009733

Identifying causal variants by fine mapping across multiple studies

Nathan LaPierre 1,#, Kodi Taraszka 1,#, Helen Huang 2, Rosemary He 3, Farhad Hormozdiari 6, Eleazar Eskin 1,4,5,*
Editor: Eleftheria Zeggini7
PMCID: PMC8491908  PMID: 34543273

Abstract

Increasingly large Genome-Wide Association Studies (GWAS) have yielded numerous variants associated with many complex traits, motivating the development of “fine mapping” methods to identify which of the associated variants are causal. Additionally, GWAS of the same trait for different populations are increasingly available, raising the possibility of refining fine mapping results further by leveraging different linkage disequilibrium (LD) structures across studies. Here, we introduce multiple study causal variants identification in associated regions (MsCAVIAR), a method that extends the popular CAVIAR fine mapping framework to a multiple study setting using a random effects model. MsCAVIAR only requires summary statistics and LD as input, accounts for uncertainty in association statistics using a multivariate normal model, allows for multiple causal variants at a locus, and explicitly models the possibility of different SNP effect sizes in different populations. We demonstrate the efficacy of MsCAVIAR in both a simulation study and a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL).

Author summary

Genome-Wide Association Studies (GWAS) have successfully identified numerous genetic variants associated with a variety of complex traits in humans. However, most variants that are associated with traits do not actually cause those traits, but rather are correlated with the truly causal variants through Linkage Disequilibrium (LD). This problem is addressed by so-called “fine mapping” methods, which attempt to prioritize putative causal variants for functional follow-up studies. In this work, we propose a new method, MsCAVIAR, which improves fine mapping performance by leveraging data from multiple studies, such as GWAS of the same trait using individuals with different ethnic backgrounds (“trans-ethnic fine mapping”), while taking into account the possibility that causal variants may affect the trait more or less strongly in different studies. We show in simulations that our method reduces the number of variants needed for functional follow-up testing versus other methods, and we also demonstrate the efficacy of MsCAVIAR in a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL).

Introduction

Genome-Wide Association Studies (GWAS) have successfully identified numerous genetic variants associated with a variety of complex traits in humans [13]. However, most of these associated variants are not causal, and are simply in Linkage Disequilibrium (LD) with the true causal variants. Identifying these causal variants is a crucial step towards understanding the genetic architecture of complex traits, but testing all associated variants at each locus using functional studies is cost-prohibitive. This problem is addressed by statistical “fine mapping” methods, which attempt to prioritize a small subset of variants for further testing while accounting for LD structure [4].

The classic approach to fine mapping involves simply selecting a given number of SNPs with the strongest association statistics for follow-up, but this performs sub-optimally because it does not account for LD structure [5]. Bayesian methods that did account for LD structure were developed [6, 7], but were based upon the simplifying assumption that each locus only harbors a single causal variant, which is not true in many cases [8]. Additionally, many early methods required individual-level genetic data, whereas many human GWAS often provide only summary statistics due to privacy concerns. CAVIAR [8] introduced a Bayesian approach that relied only on summary statistics and LD, accounted for uncertainty in association statistics using a multivariate normal (MVN) distribution, and allowed for the possibility of multiple causal SNPs at a locus. This approach was widely adopted and later made more efficient by methods such as CAVIARBF [9], FINEMAP [10], and JAM [11].

There is growing interest in improving fine-mapping by leveraging information from multiple studies. One of the most important examples of this is trans-ethnic fine mapping, which can significantly improve fine mapping power and resolution by leveraging the distinct LD structures in each population [1214], as seen in methods such as trans-ethnic PAINTOR [15] and MR-MEGA [16]. Intuitively, the set of SNPs that are tightly correlated with the causal SNP(s) will be different in different populations, allowing more SNPs to be filtered out as potential candidates. However, the varying LD patterns also present a unique challenge in the multiple study setting that trans-ethnic fine mapping methods must handle. Additionally, while there is evidence that the same SNPs drive association signals across populations [12, 17, 18], there is also heterogeneity in their effect sizes [13, 17, 18], presenting another challenge. Existing methods either assume a single causal SNP at each locus [16, 19] or do not explicitly model heterogeneity [15], limiting their power [20].

In this paper, we present MsCAVIAR, a novel method that addresses these challenges. We retain the Bayesian MVN framework of CAVIAR while introducing a novel approach to explicitly account for the heterogeneity of effect sizes between studies using a Random Effects (RE) model. Our method requires only summary statistics and LD matrices as input, allows for multiple causal variants at a locus, and models uncertainty in association statistics and between-study heterogeneity. The output is a set of SNPs that, with a user-set confidence threshold (e.g. 95%), contains all causal SNPs at the locus.

We show in simulation studies that MsCAVIAR outperforms existing trans-ethnic fine mapping methods [15] and extensions of methods such as CAVIAR [8] to the multiple study setting. We further demonstrate the efficacy of MsCAVIAR in a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein. MsCAVIAR is freely available at https://github.com/nlapier2/MsCAVIAR.

Results

MsCAVIAR overview

Our method, MsCAVIAR, takes as input the association statistics (e.g. Z-scores) and linkage disequilibrium (LD) matrix for SNPs at the same locus in each study (Fig 1A). The LD matrix can be computed from in-sample genotyped data or appropriate reference panels such as the 1000 Genomes project [21] or HapMap project [22]. MsCAVIAR computes and outputs a minimal-sized “causal set” of SNPs that, with probability at least ρ, contains all causal SNPs, and ideally contains far fewer SNPs than the set of significant SNPs obtained via meta-analysis (Fig 1B).

Fig 1. Overview of MsCAVIAR.

Fig 1

(A) Simulated Z-scores for SNPs at a quantitative trait locus in two different populations: East Asians (top) and Europeans (bottom), shown by their −log10(p-value). LD matrices for these populations were derived using data from the 1000 Genomes project. These are the input files to MsCAVIAR. (B) Meta-analysis results for this locus, showing many significant SNPs. Also displayed are the SNPs that are in the causal set that MsCAVIAR returns (red stars) and the truly causal SNPs (black stars).

By our definition of a causal set, every causal SNP must be contained in the set with high probability, but not every SNP in the set needs to be causal. Concretely, each SNP can be assigned a binary causal status: 1 for causal or 0 for non-causal. So long as none of the SNPs outside of the causal set are set to 1, the assignments are compatible with our definition of a causal set. We can represent these causal status assignments in a binary vector with one entry for each SNP denoting its causal status; we call such a vector a “configuration” and denote it as C. For each configuration C compatible with the causal set, we compute its (posterior) probability in a Bayesian manner: the probability of a configuration of SNPs being causal given the association statistics can be computed by modeling a prior probability for that configuration and a likelihood function for the association statistics given the assumed causal SNPs given by C (see Methods for details).

The overall likelihood function can be decomposed into a product over the likelihood function for each study, since we assume that the studies are independent. More specifically, we assume that there is a true global effect size for a SNP over all possible populations, around which the effect sizes for that SNP in different studies are independently drawn according to a heterogeneity variance parameter (Methods). This allows MsCAVIAR to model the fact that effect sizes of a SNP across different studies are related, but not equal. Because we expect the summary statistics to be a function of their LD with the causal SNPs, the parameters of the likelihood function for each study are different, assuming the studies have different LD patterns. By computing the product over the likelihood of each study, we are able to account for their different LD patterns to determine the likelihood over all the studies.

The posterior probability for a causal set is then computed by summing the posterior probabilities of all compatible configurations, and then dividing by the sum of the posterior probabilities for all possible configurations. We start by assessing causal sets containing only one SNP, and then causal sets containing two SNPs, and then three SNPs, and so on until a causal set exceeds the posterior probability threshold ρ. In practice, ρ is set to a high value such as 95%.

MsCAVIAR improves fine mapping resolution in a simulation study

We now describe our simulation study to evaluate the performance of MsCAVIAR as compared with other methods. We selected two samples of 9,000 unrelated individuals from the UK Biobank [23], one with European ancestry and the other with Asian ancestry. In order to generate realistic fine mapping scenarios, we centered 100kbp windows around SNPs that reached genome-wide significant association with High-Density Lipoprotein cholesterol in the UK Biobank [23] summary statistics released by the Neale lab [24]. From these windows, we selected three loci that reflected high, medium, and low patterns of LD as defined by the proportion of SNPs with at least 90% LD (32%, 25%, and 8%, respectively). We then obtained the imputed genotype data for these loci for our samples in the UK Biobank. The loci were filtered for missing genotypes (> 0%) and low minor allele frequency (< 1%). The loci with low, medium, and high LD had 144, 126, and 154 SNPs, respectively.

We then simulated causal SNPs and their effect sizes βiN(5.29000,1), for the cases of 1, 2, or 3 causal SNPs randomly chosen within each locus. For simplicity, we take the absolute value of the effect size and restrict causal SNPs to being positively correlated with each other. We then used GCTA [25] to simulate phenotypes using different heritability levels: 0.2%, 0.4%, 0.6%, 0.8%, and 1%, times the number of causal SNPs. Concretely, GCTA simulates the phenotypes y according to y = + e, where X is the standardized genotype matrix for the causal variant(s), β is the vector of causal variant effect sizes, and e is a vector of environmental noise terms where each ei=σg2(1/h2-1). In other words, the environmental variance is scaled to achieve the desired heritability. Thus, modulating the heritability affects the strength of the association signal between variants and the phenotype, while drawing different βi for different causal variants allows for the modeling of heterogeneity.

Finally, we run a linear regression using fastGWA [26] to generate the summary statistics. We simulated 20 replicates (re-drawing the causal SNPs and their effect sizes) for each level of heritability and number of causal SNPs for a total of 900 simulations.

Using this data, we compared MsCAVIAR to the trans-ethnic mode of PAINTOR [15] and to CAVIAR [8] run on Asians and Europeans, individually (Fig 2). For each number of causal SNPs (1, 2, or 3), we averaged the results across all simulated scenarios. For each method, we provided the in-sample LD and the summary statistics described above. All methods were run with posterior probability threshold ρ* = 0.95, so methods with 95% or higher sensitivity were considered “well-calibrated” (dashed line in Fig 2A). MsCAVIAR’s heterogeneity parameter was set to τ2 = 0.52 (Methods). We also evaluated methods for the size of their returned causal sets (Fig 2B) because, conditioned on having a well-calibrated recall, it is preferable to return a small causal set. This can be thought of as higher “precision”, as non-causal SNPs in the causal sets can be thought of as “false positives”.

Fig 2. Comparison of sensitivity, precision, and set sizes using simulated data.

Fig 2

We compare MsCAVIAR, PAINTOR, and CAVIAR with c ∈ {1, 2, 3} causal variants implanted with results averaged over 20 replicates for 3 loci and 5 levels of heritability for all 3 values of c. (A) Bar graph indicating the sensitivity of each method with a dashed line to reflect the expected posterior probability, ρ, of recovering all causal SNPs (B) Box plots showing the average set sizes returned by the methods. Each box is the interquartile range of causal set sizes with the middle black line representing the median, and the white crosses showing the mean. (C) Bar graph displaying the average the number of SNPs in descending order of posterior inclusion probability (PIP) until 1, 2, or 3 causal SNPs is identified. Stacked bars represent increasing numbers of causal SNPs identified, until the true number of causal SNPs (x-axis) are identified.

All of the methods in this assessment were well-calibrated (Fig 2A), which is expected, as previously shown for CAVIAR [8] and PAINTOR [15]. For each number of causal SNPs, MsCAVIAR and PAINTOR returned substantially smaller set sizes than CAVIAR run on either population individually, highlighting the benefit of utilizing information from multiple studies.

With one causal SNP in the locus, MsCAVIAR and PAINTOR had similar causal set sizes, with MsCAVIAR’s mean and median set sizes being 18.7 and 15.0 and PAINTOR’s being 17.6 and 14.0, respectively. When there were two causal SNPs simulated, MsCAVIAR’s causal sets were smaller on average than PAINTOR’s, and the difference increased when three causal SNPs were simulated. When two causal SNPs were simulated, MsCAVIAR’s mean and median set sizes were 41.6 and 36.5, respectively, while PAINTOR’s mean and median set sizes were 46.4 and 45.5, respectively. Finally, with three causal SNPs, MsCAVIAR had mean and median set sizes of 52.4 and 50.0, respectively, and PAINTOR’s were 60.1 and 64.5, respectively.

As the goal of most statistical fine mapping methods is to prioritize variants for functional follow-up, it lends the question of how informative a variant’s posterior probability is to its causal status. We, therefore, sort the SNPs in descending order of posterior probability to determine on average how many SNPs are added to the causal set before the causal SNPs are placed in the causal set.

We evaluated this quantity for MsCAVIAR, PAINTOR, and CAVIAR run on the Asian and European populations (Fig 2C). MsCAVIAR and PAINTOR were generally better at prioritizing variants than CAVIAR, again highlighting the importance of utilizing multiple studies when possible. On average, MsCAVIAR was able to capture the causal variant(s) with fewer SNPs than PAINTOR.

Trans-Biobank fine mapping of high density lipoprotein loci

In order to evaluate the performance of MsCAVIAR on real data, we performed a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL) using summary statistics from the UK Biobank (UKB) [23, 24] and Biobank Japan (BBJ) [27, 28] projects. These studies involved 361,194 and 70,657 people, respectively. The UKB summary statistics, obtained from the Neale lab [24], were generated using only White British individuals.

To generate loci for fine mapping, we centered 1 megabase windows around genome wide-significant peak SNPs (p-value ≤ 5 * 10−8), discarding all SNPs that did not reach even marginal significance (p > 0.05), as they were highly unlikely to be informative and would slow down analyses. We also excluded all loci with fewer than ten SNPs in each study after filtering SNPs with p > 0.05, as fine mapping may not be seen as necessary or may even be trivial for existing methods when there are only a few strongly associated SNPs. Two very large loci were excluded for computational reasons. We excluded loci from chromosome six, where there were numerous statistically significant SNP effect sizes due to the presence of human leukocyte antigen (HLA) regions.

The procedures described above yielded 185 loci consisting of 29,479 SNPs in total. Individual locus sizes ranged from 11 to 755 SNPs. All but two SNPs in the loci had a minor allele frequency of at least 1% at least one of the studies. Linkage disequilibrium (LD) matrices were generated from the 1000 Genomes project [21], with “European” and “East Asian” as the population names, using the “CalcLD_1KG_VCF.py” script from the PAINTOR [29] GitHub repository. We used the 1000 Genome project to generate LD to reflect the common situation where summary statistics are available but not the full genotyped data [8, 10].

We ran CAVIAR [8], the trans-ethnic mode of PAINTOR [15], and MsCAVIAR on these loci, and evaluated their causal set sizes, since these methods have been shown to be well-calibrated and no ground truth is available (Fig 3). For MsCAVIAR, we set the heterogeneity parameter τ2 (Methods) to its default value of 0.52. For CAVIAR, we evaluated its performance when applying it to only the Asian (BBJ) data or to only the European (UKB) data. For all methods, we set the posterior probability threshold ρ* to 95% and set the maximum number of causal SNPs to 3.

Fig 3. Comparing fine mapping resolution in trans-ethnic HDL analysis.

Fig 3

Comparison of the results of MsCAVIAR when applied to 185 loci from two high-density lipoprotein (HDL) GWAS, White European people from the UK Biobank [23, 24] and Japanese people from Biobank Japan [27, 28], versus trans-ethnic PAINTOR [15] and applying CAVIAR [8] to each population individually. The y-axis is the size of the causal set for each locus. The boxes represent the interquartile range of causal set sizes identified by each tool, the lines inside the boxes represent the median, and the whiskers extend to the non-outlier extremes. Outliers are represented as dots above or below the whiskers.

While the original loci totaled 29,479 SNPs, averaging 159.3 SNPs per locus, the causal sets returned by MsCAVIAR totaled 9,390 SNPs, averaging 50.8 SNPs per locus with a median of 31 SNPs. Meanwhile, PAINTOR’s causal sets totaled 9,118 SNPs (49.3 average, 34 median), CAVIAR’s sets using the UKB data totaled 11,538 SNPs (62.4 average, 44 median), and CAVIAR’s sets using the BBJ data totaled 18,520 SNPs (100.0 average, 70 median). Thus, similarly to our simulation study’s findings, MsCAVIAR and PAINTOR generally returned smaller causal set sizes than CAVIAR, and MsCAVIAR’s median causal set size was slightly smaller than PAINTOR’s. In contrast with the simulation study, MsCAVIAR’s average causal set size was slightly larger than that of PAINTOR’s. A full list of the loci we identified and the causal set sizes returned can be found in Table A in S1 Text.

As an additional way of viewing the results, we generated scatter plots of the causal set sizes at each locus for MsCAVIAR compared to those of PAINTOR and CAVIAR (Fig 4). This visualizes the comparative causal set sizes at individual loci. The scatter plots and their associated lines of equality reveal that MsCAVIAR’s set sizes were consistently smaller than CAVIAR’s across almost all loci, with one notable exception in which CAVIAR’s causal set size was substantially smaller than MsCAVIAR’s. The comparison with PAINTOR illustrates how MsCAVIAR’s median causal set size was smaller than PAINTOR’s but its average was higher: MsCAVIAR returned slightly smaller causal set sizes than PAINTOR for most loci, but in some cases, MsCAVIAR’s causal set size was much larger than PAINTOR’s, dragging MsCAVIAR’s average causal set size above that of PAINTOR.

Fig 4. Comparison of methods’ set sizes for each locus in the trans-ethnic HDL analysis.

Fig 4

Comparison of the returned causal set sizes of MsCAVIAR when applied to two high-density lipoprotein (HDL) GWAS, White European people from the UK Biobank [23, 24] and Japanese people from Biobank Japan [27, 28], versus trans-ethnic PAINTOR [15] and applying CAVIAR [8] to each population individually. In each scatter plot, each point reflects a specific locus, and the x-coordinate is MsCAVIAR’s returned causal set size, while the y-coordinate is a different method’s causal set size. Diagonal lines representing equal set sizes were plotted for each scatter plot. Points above the line represent loci where the alternate method had a larger causal set size than MsCAVIAR, while points below the line indicate the opposite.

Discussion

In this work, we introduced MsCAVIAR, a method for identifying causal variants in associated regions while leveraging information from multiple studies. Our approach requires only summary statistics as opposed to genotype data and handles heterogeneity of effect sizes, differing sample sizes, and different LD structures between studies, making trans-ethnic fine mapping an ideal application. We demonstrated that our method is well-calibrated and improves fine-mapping resolution in simulation studies. MsCAVIAR is available as free and open source software at https://github.com/nlapier2/MsCAVIAR.

We make several important assumptions in this model, which may not always be true. It has been shown that many causal SNPs are shared across populations [12, 17, 18]. MsCAVIAR is designed to leverage this phenomenon for increased power; however, causal variants may be unique to one population. In those instances, MsCAVIAR’s model doesn’t match the data, so it may not be well-calibrated or it may return large causal sets. If one population has an obvious GWAS signal while the other population(s) lack even a marginally significant signal in the same locus, applying CAVIAR to the population with signal may be more appropriate.

We also assume that all studies are drawn with equal heterogeneity τ2. This is unlikely to be true if multiple studies are from a single population while another study is from a different population. In such a scenario, we recommend grouping the studies by population, running fixed effects meta-analysis on each group, and then running MsCAVIAR on the results for the different groups. Concretely, the input summary statistics for MsCAVIAR should be the results from the meta-analysis of each population, and the input LD matrices should be derived from either the genotype data (if available) or the appropriate reference panels for each population. However, it is still possible that even ostensibly different populations may be more similar to each other at certain loci than other populations. Therefore, we plan to extend our method to handle this case in future work.

In practice, we set the τ2 parameter to a fixed value, which was chosen to give power to detect both small and large amounts of heterogeneity (Methods, “Parameter Setting in Practice”). This value could, in principle, be adjusted based on the apparent heterogeneity present in the data. However, care would have to be taken to not overfit the parameter to the summary statistics in each locus, since the heterogeneity of different causal SNPs can vary across loci and some causal SNPs may be missed when the heterogeneity parameter is overfitted. Future work could develop a procedure for fitting this parameter.

Several methodological extensions to MsCAVIAR are possible as well. MsCAVIAR aims to return a causal set that contains all causal SNPs in a locus, while another fine mapping method, SuSiE [30] solves a complementary problem by returning one or more credible sets that each contain at least one causal SNP. The advantage of the former approach is its completeness in terms of identifying all causal signals, while the advantage of the latter approach is its ability to separate distinct causal signals within a locus into separate sets. A future extension to MsCAVIAR could aim to accomplish the benefits of both by returning a causal set with all causal SNPs, and then partitioning this set into distinct subsets with separate causal signals.

Functional information can in principle be factored into MsCAVIAR’s model by modifying the prior distribution P(C) so that not every variant has the same prior probability of being causal, as described in the CAVIAR paper [8]. However, setting these priors arbitrarily can yield misleading results, and future work is needed to determine how best to model various functional priors in the context of MsCAVIAR’s model.

Finally, stochastic search could be used to speed up MsCAVIAR in cases where there are possibly many causal variants [10, 31]. MsCAVIAR’s runtime is largely determined by the number of SNPs in the locus and the number of causal SNPs allowed: if there are M total SNPs and up to K are allowed to be causal, then there are potentially up to (MK) causal status vectors to evaluate. Thus, runtime can become an issue when there are many SNPs in a locus or many studies, and especially when users desire to allow for more than three possibly causal SNPs at a locus. Stochastic search can help reduce the search space by not evaluating every possible combination of causal SNPs, though this involves managing the risk of missing the optimally minimal causal set.

Methods

Overview of the MsCAVIAR model

In this section, we expand upon the high-level discussion of the method given in “MsCAVIAR Overview”. We briefly describe the search for the minimal-sized causal set of SNPs and the generative model behind it. In the following subsections, we describe the computational details in depth.

As discussed in the “MsCAVIAR Overview” section, our method takes as input the association statistics (i.e. Z-scores) and linkage disequilibrium (LD) matrix at the same locus for each study. MsCAVIAR computes and outputs a minimal-sized “causal set” of SNPs that, with probability at least ρ, contains all causal SNPs. By our definition of a causal set, every causal SNP must be contained in the set with high probability, but not every SNP in the set needs to be causal. Concretely, each SNP can be assigned a binary causal status: 1 for causal or 0 for non-causal. So long as none of the SNPs outside of the causal set are set to 1, the assignments are compatible with our definition of a causal set. We can represent these causal status assignments in a binary vector with one entry for each SNP denoting its causal status; we call such a vector a “configuration” and denote it as C. For a putative causal set of SNPs K, we define CK as the set of causal configurations that are compatible with this causal set—the set of vectors with no ‘1’ entries for SNPs not in K.

For each configuration C* compatible with the causal set, we compute its (posterior) probability in a Bayesian manner:

P(C*|S)=P(S|C*)P(C*)CCP(S|C)P(C) (1)

where S denotes the summary statistics for all input studies, and C is the space of possible causal status indicator vectors (including those not compatible with the causal set).

We must now define the likelihood P(S|C) and the prior P(C). For the prior, we assume that each variant is equally likely to be causal, with probability γ, and thus the prior probability P(C) is

j=1mγCj(1-γ)1-Cj (2)

where Cj is the jth entry (SNP) in C. The likelihood for P(S|C) can be written as

S|CN(0,Σ+ΣΣCΣ) (3)

where Σ is a block-diagonal matrix where each block corresponds to one study’s LD matrix and ΣC is a matrix modeling the covariance structure between the causal SNPs. Further computational details on the model are provided in the subsections below, but we will make two statements here for clarity.

The first being that the likelihood function in Eq 3 depends on the assumption that the summary data Sq for each study q is independently distributed as such: Sq|ΛqN(ΣqΛq,Σq) where Σq is the LD matrix for the study and Λq is its non-centrality parameters. This is then coupled with the assumed prior for Λq|CN(0,ΣCq) where ΣCq is the covariance structure between causal SNPs for study q. Using the distribution of Λq as a conjugate prior, the overall distribution in a single study is Sq|CN(0,Σq+ΣqΣCqΣq). This is restated and more fully described in the following section (“Fine mapping in a single study”), particularly in Eqs 811.

The second being that Eq 3 also depends on the assumption of how the causal variants in each study relate to one another. We began by concatenating the non-centrality parameters across Q studies where each contains M SNPs to create the QM-length vector vec(Λ). The distribution of vec(Λ) is vec(Λ)N(0,Σc) where Σc can be written using the following Kronecker product (⊗)

ΣC=(τ2IQ+σ21Q1QT)diag(1causal)M (4)

where σ2 corresponds to the non-centrality parameter’s variance as seen in the per-study setting (see Eq 10) which we assume is identical across studies and where τ2 captures the heterogeneity between studies (see Eq 21). Let 1Q1QT be a matrix of all 1s, IQ an identity matrix, and diag(1causal)M be an (M × M) diagonal matrix whose diagonal entries are given by the (1 × M) indicator vector 1causal whose entries m are 1 if SNP m is causal and 0 otherwise. The Kronecker product for ΣC is stated again as Eq 26 in the section “Efficient meta-analysis” where it is more fully described.

With these clarifications on the likelihood function P(S|C), we now proceed to how we calculate the posterior probability that K contains all the causal SNPs:

P(CK|S)=C*ϵCKP(C*|S) (5)

The goal is then to find the minimum-sized set K* that has a posterior probability of at least ρ*, called the “ρ* confidence set”:

P(CK*|S)ρ* (6)

This is done by evaluating causal configuration vectors with only one non-zero element, and then those with two non-zero elements, and so on until the end condition above is met. In practice, we limit the search space C by allowing the user to set the maximum number of causal SNPs allowed to 3 by default.

As stated previously, the following subsections explain the derivation of S|CN(0,Σ+ΣΣCΣ), the structure of ΣC, and computational efficiency details. We begin by reviewing fine mapping in a single study, and then proceed to the multiple study case.

Fine mapping in a single study

We now describe a standard approach for fine mapping significant variants from a genome-wide association study (GWAS). In the GWAS, let there be N individuals, all of whom have been genotyped at M variants. For each individual n, we measure a quantitative trait yn, resulting in the N × 1 column vector Y of phenotypic values. We denote G as the N × M matrix of the genotypes where gnm ∈ {0, 1, 2} is the minor allele count for the nth individual at variant m. We standardize G according to the population proportion p of the minor allele and denote this as X where xij{-2p2p(1-p),1-2p2p(1-p),2-2p2p(1-p)}.

We assume Fisher’s polygenic model, which means Y is normally distributed and each variant xm has a linear effect on Y. We, therefore, have the following model:

Y=μ1+m=1Mβmxm+e (7)

where βm is the effect size of variant xm and e is the variation in Y not explained by additive genetic effects and follows the Gaussian distribution eN(0,σe2I).

We now model the observed summary statistics S = [s1, …, sm] according to

S|ΛCN(ΣΛC,Σ) (8)

where Σ represents the pairwise Pearson correlations between the genotypes. ΛC=[λC1λCM] represents the true standardized causal effect sizes of each SNP, where each entry λCm=0 if SNP m is non-causal and λCm0 otherwise.

The distribution of ΛC can be defined as:

ΛC|CN(0,ΣC) (9)

where C = {0, 1}M is an M × 1 binary vector indicating whether each variant is causal, and

ΣC={0,ifij.σ2,ifiiscausal.ϵ,ifiisnotcausal. (10)

and where ϵ is a small constant to ensure that the matrix ΣC is full rank. (We later relax the need for ΣC to be full rank in “Handling Low Rank LD Matrices”). Here, and below, we use the shorthand σ2 to represent the variance of the λCm (see the subsection “Extending MsCAVIAR to different sample sizes” for details on this parameter). The off-diagonals of ΣC are zero because the effect sizes of causal variants are independent of one another.

We use the shorthand Λ = ΣΛC to refer to the non-centrality parameters (NCPs) of the statistics of all SNPs, which are induced by Linkage Disequilibrium (LD) with the causal SNPs. Thus, S|ΛN(Λ,Σ). Since Λ = ΣΛC and LD structure is symmetric (Σ = ΣT), we have the following distribution for Λ|C:

(Λ|C)N(0,ΣΣCΣ) (11)

We will now define γ as the probability of a variant being causal, which makes the causal status for the mth variant a Bernoulli random variable with the following probability mass function: f(cm;γ)=γcm(1-γ)1-cm. We assume the causal status for each variant is independent of the other variants, leading to the following prior for the our indicator vector: P(C)=m=1MγCm(1-γ)1-Cm. Assuming that each variant has a probability γ of having a causal effect, the prior can then be written as follows:

P(Λ,C)=P(Λ|C)P(C)=f(Λ,0,ΣC)m=1MγCm(1-γ)1-Cm (12)

where f(Λ, 0, ΣC) is the probability density function shown in Eq 11.

We determine which variants are causal by calculating the posterior probability of each configuration C*C, where C is the set of all possible configurations, given the set of summary statistics:

P(C*|S)=P(S|C*)P(C*)cCP(S|c)P(c)=ΛC*P(S|Λ,C*)P(Λ=ΣΛC*,C*)dΛC*cCΛcP(S|Λ,c)P(Λ=ΣΛc,c)dΛc (13)

For us to calculate the posterior probability of C* given S, we need to integrate over all possible values for the non-centrality parameters of the causal variants in Λ in order to get the values of Λ that makes observing S most probable.

Efficient computation of likelihood functions

The integral above is intractable in the absence of parametric assumptions about the data. Fortunately, a closed-form solution is available due to the fact that, when a conjugate prior is multivariate normally distributed, its predictive distribution is also multivariate normal. As shown above, S|ΛN(Λ,Σ) and (Λ|C)N(0,ΣΣCΣ). The predictive form of S is then

SN(0,Σ+ΣΣCΣ) (14)

However, computing the likelihood of S with this distribution is still computationally expensive. Consider the multivariate normal probability density function, assuming the variable Z below is MVN distributed with mean μ and covariance matrix Σ:

f(Z;μ,Σ)=1(2π)M|Σ|exp(-12(Z-μ)TΣ-1(Z-μ)) (15)

For S, the covariance matrix is Σ + ΣΣC Σ, which has dimension (M × M), where M is the number of SNPs in each study. Taking the determinant or inverse of this covariance matrix, as required by the above likelihood function, would take O(M3) time. Here, we demonstrate how to compute this likelihood efficiently, leveraging insights from several studies that have explored this topic [9, 10, 32].

We need to compute ST(Σ + ΣΣCΣ)−1S and |Σ + ΣΣCΣ| (note that our μ is 0). We can factor out Σ from both of the equations above:

ST(Σ+ΣΣCΣ)-1S=STΣ-1(I+ΣCΣ)-1S (16)
|Σ+ΣΣCΣ|=|Σ||I+ΣCΣ| (17)

Notably, STΣ−1 and |Σ| can be computed once and re-used for every causal configuration ΣC. Below, we assume Σ is of full-rank; Lozano et. al [32] show how to address the low-rank case.

We use the Woodbury matrix identity [33], below, to speed up the matrix inversion equation:

(A+UEV)-1=A-1-A-1U(E-1+VA-1U)-1VA-1 (18)

Here, we set A = IM×M, E = IK×K where K is the number of causal SNPs per study, and UV = ΣCΣ. In particular, U is the (M × K) matrix of rows corresponding to causal SNPs in ΣC. We are taking advantage of the fact that rows corresponding to non-causal SNPs are zeros and thus do not affect the matrix multiplication. Similarly, V is the corresponding columns of Σ, and is (K × M). Applying the Woodbury matrix identity to our case, we get:

(IM×M+ΣCΣ)-1=(IM×M+UV)-1=IM×M-1-IM×M-1U(IK×K-1+VIK×K-1U)-1VIM×M=IM×M-U(IK×K+VU)-1V (19)

Crucially, we are now inverting a (K × K) matrix instead of an (M × M) matrix, where KM since most SNPs are not causal [32]. We use Sylvester’s determinant identity [34] to speed up the determinant computation as follows:

|IM×M+UV|=|IK×K+VU| (20)

Similarly, we are computing the determinant of a (K × K) matrix instead of an (M × M) matrix. Using these speedups, the computation of the likelihood function of S is reduced from O(M3) to O(K3) plus some O(MK2) matrix multiplication operations, which is tractable under the reasonable assumption that each locus has at most K = 3 causal SNPs. In the “Efficient meta-analysis” subsection below, we discuss the computational complexity and the use of these efficient matrix computations in the multiple study setting.

Fine mapping across multiple studies

As GWAS continue to grow in size, frequency, and diversity, there is an increasing need for fine mapping methods that leverage results from multiple studies of the same trait. A simple approach is to assume that there is one true non-centrality parameter for every variant; therefore ΛC is identical across studies. This approach is referred to as a fixed effects model. In this case, the qth study’s ΛCq=ΛC.

While there is evidence that many causal SNPs are shared across populations [12, 17, 18], the assumption that the true causal non-centrality vector ΛC is the same across studies is unrealistic, especially when the studies are measured in different ethnic groups [13, 17, 18].

We relax this assumption by utilizing a random effects model, in which each study q is allowed to have a different ΛCq. Under this model, a causal SNP m has an overall mean non-centrality parameter, which we denote with the scalar λCm, from which the non-centrality parameter for SNP m in each study q, denoted by the scalar λCmq, is drawn with heterogeneity (variance) τ2. According to the polygenic model, λCm is distributed as λCmN(0,σ2); therefore, λCmq is distributed as λCmqN(λCm,τ2). Consequently, the vector ΛCm for this SNP across all studies will have the following distribution:

ΛCmN(0,σ211T+τ2I) (21)

where Q is the number of studies, 1 is a (Q × Q) matrix of 1s, and I is the (Q × Q) identity matrix. Intuitively, since the SNP m was drawn with variance σ2, this variance component is shared across studies, while the variance component τ2 is study-specific and therefore it is only present along the diagonal of the covariance matrix. If a variant is not causal, its true effect size should be zero. We construct a matrix ΛC of size (MQ × MQ), where M is the number of SNPs and each row corresponds to the Q-length vector ΛCm corresponding to SNP m. In practice, we ensure that this matrix is full-rank by drawing the non-causal SNPs according to ΛCmN(0,ϵI), where ϵ is a small constant.

From this we will now build out the posterior probability of P(C*|Sq)similarly to Eq 13. Now instead of ΛCq=ΣqΛC for study q, we have to account for Λq=ΣqΛCq where ΛCq is drawn from a multivariate normal distribution. This means we have to integrate over the domain-space of ΛCq to as well as ΛC to describe P(C*|Sq)=P(Sq|C*)P(C*)CCP(Sq|C)P(C)

P(C*|Sq)=ΛCq*P(Sq|Λq,C*)ΛC*P(Λq=ΣqΛCq*|ΛC*,C*)P(ΛC*,C*)dΛC*dΛCq*cCP(Sq|Λq,c)ΛcqP(Sq|Λq,c)ΛcP(Λq=ΣqΛcq|Λc,c)P(Λc,c)dΛcdΛcq (22)

Efficient meta-analysis

Now that we have described the distribution of each SNP in our meta-analysis, we show how to jointly analyze them. We begin by explicitly defining the structure of the covariance matrix between studies by way of a small example with three SNPs at a locus in two different studies. Since the covariance of a matrix is undefined, we denote vecC) as the vectorized form of the original matrix (ΛC). Concretely:

vec(ΛC)=vec([λC11λC21λC12λC22λC13λC23])=[λC11λC12λC13λC21λC22λC23] (23)

Assume SNPs 1 and 3 are causal and SNP 2 is not causal. Then the vectorized form of the non-centrality parameters given the causal statuses has the following multivariate normal distribution:

graphic file with name pgen.1009733.e098.jpg (24)

We call the covariance matrix above ΣC. Viewing ΣC as having a block structure, the blocks along the diagonal represent SNPs from the same study, while off-diagonal blocks represent SNPs from different studies. Here ΣC is (3 * 2 × 3 * 2) = (6 × 6); in general, for M SNPs and Q studies, ΣC will be (MQ × MQ). In other words, there will be an (Q × Q) grid of (M × M) blocks. Within each block, the diagonal represents each SNP’s variance, while the off-diagonal represents covariation between different SNPs. As SNPs are assumed to be independent, these are always 0. There are two variance components: the global genetic variance σ2 from which the global mean non-centrality parameter for a SNP is drawn, and the heterogeneity between studies τ2. When a SNP is causal, its variance (its covariance with itself in the same study) will contain both variance components (τ2 + σ2), while its covariance with the same SNP in a different study will be σ2, because they were drawn from the same overall non-centrality parameter with variance σ2 but were drawn separately with variance τ2.

The ΣC above, leaving aside ϵ for now, can alternately be written in the more-compact form

ΣC=[τ2+σ2σ2σ2τ2+σ2][100000001] (25)

where ⊗ represents the Kronecker product operator. This can be further condensed and generalized into:

ΣC=(τ2IQ+σ21Q1QT)diag(1causal)M (26)

where Q is the number of studies, M is the number of SNPs, 1Q1QT is the (Q × Q) matrix of all 1s, IQ is the (Q × Q) identity matrix, and diag(1causal)M is an (M × M) diagonal matrix whose diagonal entries are given by the (1 × M) indicator vector 1causal whose entries m are 1 if SNP m is causal and 0 otherwise.

As with CAVIAR, the ϵ entries along the diagonal are small numbers to ensure full rank. Also note that the CAVIAR model is a specific case of this model, in which there is only one study and thus there is no τ2 component. The CAVIAR ΣC has the same structure as the upper left block in the ΣC above, when there are 3 SNPs and τ2 is set to 0.

The efficient computation properties for the single-study case also apply to the multiple-study case. In the latter setting, the matrices that need to be inverted are (MQ × MQ) instead of (M × M), where M and Q are the number of SNPs in a locus and the number of studies, respectively. Consequently, in the Woodbury matrix identity equations, U and V are (MQ × KQ) and (KQ × MQ), respectively, where KM is the number of causal SNPs, and the matrix given by the Woodbury identity is (KQ × KQ). Sylvester’s determinant identity gives a matrix of this size as well. The computation time is thus reduced from O(M3Q3) to O(K3Q3).

Handling low rank LD matrices

The methods described above assume that the Linkage Disequilibrium (LD) matrix is full rank, in order to invert this matrix in the process of computing the Multivariate Normal (MVN) likelihood function. In practice, this is often not the case, because SNPs are sometimes in perfect LD. This can even happen when SNPs are not in perfect LD due to many highly correlated SNPs being a linear function of each other. CAVIAR [8] employs a method to add a small amount of random noise to the diagonal of the LD matrix to avoid this, but we found this adjustment to be insufficient to avoid the latter situation when LD matrices were sufficiently large, especially with blocks of high-LD.

Lozano et al [32] developed a method for computing the MVN likelihood function when the LD matrix is low rank. MsCAVIAR implements this method and thereby avoids the aforementioned low rank issue. We briefly describe the method below on an intuitive level, but readers should refer to the work by Lozano et al [32] for the full derivation.

Since the LD matrix Σ is positive semi-definite, it can be eigendecomposed as follows:

Σ=WΩWT (27)

where W is the matrix of eigenvectors, such that the i-th column of W is the i-th eigenvector of Σ, and Ω is a diagonal matrix that consists of eigenvalues of Σ where the i-th diagonal element of Σ is the i-th eigenvalue of Σ. Lozano et al. then introduce a new set of summary statistics S′ = Ω−1/2WTS which, using some algebra, is shown to have the joint distribution

S=Ω-1/2WTSN(0,I+mBΣCBT) (28)

where I is the identity matrix, m is the number of SNPs, and B = Ω−1/2WT. Since I + mBΣCBT is full rank, we can compute the likelihood function for S′, even when S is not full rank.

In order to evaluate the likelihood function for our original summary statistics S, we first transform the original summary statistics S to S′ via S′ = Ω−1/2WTS, and then apply the above procedure to evaluate the likelihood function for S′. This obviates the need for the ϵ parameter previously used to ensure full rank in the definition of ΣC, so we now define ΣC in the single study setting as

ΣC={0,ifijorSNPiisnotcausal.σ2,ifSNPiiscausal. (29)

Extending MsCAVIAR to different sample sizes

In “Fine mapping across multiple studies”, we discussed the MsCAVIAR model, in which the non-centrality parameters λCmq for SNP m in each study q are drawn around a global mean non-centrality parameter λCmN(0,σ2) with variance τ2, such that λCmqN(λCm,τ2). We note that λCm is itself a function of the non-standardized effect size βm, where λCm=βmNσe and βmN(0,σg2). Thus, λCm and its variance σ are functions of the sample size N. Since the sample size may not be consistent across the studies, this λCm is an oversimplification that cannot be used when different studies have different sample sizes. Below, we show how to model the λCmq for each study while taking into account possibly different sample sizes.

We will again draw the qth study’s non-centrality parameter for variant m according to this model. Each study q has its own sample size Nq and environmental component σeq, and we draw it with heterogeneity parameter τ2 as previously defined, so

λCmqN(βmσeqNq,τ2) (30)

We will now operate under the standard assumption that the trait has unit variance and variance explained by any particular SNP is small, thus σe ≈ 1.

Σ=WΩWT (31)

Using our previous definition for a single study, we now have

Λ|CN(0,ΣC) (32)

where

ΣC={0,ifijorSNPiisnotcausal.σ2,ifSNPiiscausal. (33)

We now define σ2 more formally to be σg2Nq for the qth study, in the single study setting. In the multiple study setting, when we consider our matrix

ΣC=[τ2+σ2σ2σ2τ2+σ2][100000001] (34)

the σ2 along the diagonal is defined identically to the precise single study definition; however, when modeling multiple studies, this adjustment changes the covariance between causal variant for two studies. We now define σ2=Nq1Nq2σg2 for two studies q1 and q2 with population sizes Nq1 and Nq2. Note that if two studies have the same population size N, we get the original definition of σ2=NNσg2=Nσg2.

Parameter setting in practice

Traditionally, the effect size βN(0,σg2) would be derived as a notion of the per-snp heritability. Here we do not define σg2 as such, but rather treat it as an abstraction: we avoid making any assumptions on how heritable the given trait is and how that heritability is partitioned between loci. The way we set this parameter in practice is as a parameter for statistical power. If study q1 has the smallest sample size, we set this value such that σ=σg2Nq1=5.2 for all variants. This value corresponds to the traditional genome-wide significant Z-score of 5.2, for which the two-sided Wald test p-value is 5 × 10−8, which is considered significant by (conservatively) correcting for multiple testing [35]. Then the NCP for variant m in the corresponding study q1 is λCq1,mN(5.2,τ2). For another study q2 with larger sample size, its NCP is drawn as λCq2,mN(5.2Nq2Nq1,τ2).

This value of σg2 may not represent the actual heritability partitioning, but we set the parameter this way in our method for the practical purpose of giving MsCAVIAR power to fine map borderline significant variants in the smallest study. Similarly, we set τ2 = 0.52 by default, e.g. 10% of the value of σ=σg2Nq1, with the value chosen to give power to detect both small and large amounts of heterogeneity. We empirically observed that small misspecifications in the heterogeneity parameter do not have a substantial adverse effect (Fig G in S1 Text).

Supporting information

S1 Text. Additional simulations and a table of the real data results.

(PDF)

Data Availability

MsCAVIAR is free and open source, and the source code is available on GitHub: (https://github.com/nlapier2/MsCAVIAR). Code and instructions to replicate our results are also available on GitHub: (https://github.com/nlapier2/mscaviar_replication). The UK Biobank HDL Cholesterol dataset can be downloaded from https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/30760_raw.gwas.imputed_v3.both_sexes.tsv.bgz. The Biobank Japan HDL Cholesterol dataset can be downloaded by accessing http://jenger.riken.jp/en/result and clicking the "Download" button next to "High-density-lipoprotein cholesterol (HDL-C) (autosome)". The 1000 Genomes data was downloaded by using the following script https://github.com/gkichaev/PAINTOR_V3.0/blob/master/PAINTOR_Utilities/CalcLD_1KG_VCF.py; instructions are available at https://github.com/gkichaev/PAINTOR_V3.0/wiki/2a.-Computing-1000-genomes-LD.

Funding Statement

NL would like to acknowledge the support of National Science Foundation grant DGE-1829071 and National Institute of Health grant T32 EB016640. EE is supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676, 1065276, 1302448, 1320589 and 1331176, and National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568, P01-HL28481, R01-GM083198, R01-ES021801, R01-MH101782, and R01-ES022282. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Ikeda M, Takahashi A, Kamatani Y, Okahisa Y, Kunugi H, Mori N, et al. A genome-wide association study identifies two novel susceptibility loci and trans population polygenicity associated with bipolar disorder. Molecular psychiatry. 2018;23(3):639. doi: 10.1038/mp.2016.259 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Fritsche LG, Igl W, Bailey JNC, Grassmann F, Sengupta S, Bragg-Gresham JL, et al. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nature genetics. 2016;48(2):134. doi: 10.1038/ng.3448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197. doi: 10.1038/nature14177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Schaid DJ, Chen W, Larson NB. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature reviews Genetics. 2018;. doi: 10.1038/s41576-018-0016-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Faye LL, Machiela MJ, Kraft P, Bull SB, Sun L. Re-ranking sequencing variants in the post-GWAS era for accurate causal variant identification. PLoS genetics. 2013;9(8):e1003609. doi: 10.1371/journal.pgen.1003609 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, Su Z, et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature genetics. 2012;44(12):1294. doi: 10.1038/ng.2435 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Beecham AH, Patsopoulos NA, Xifara DK, Davis MF, Kemppinen A, Cotsapas C, et al. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nature genetics. 2013;45(11):1353. doi: 10.1038/ng.2770 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics. 2014; p. genetics–114. doi: 10.1534/genetics.114.167908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chen W, Larrabee BR, Ovsyannikova IG, Kennedy RB, Haralambieva IH, Poland GA, et al. Fine mapping causal variants with an approximate Bayesian method using marginal test statistics. Genetics. 2015;200(3):719–736. doi: 10.1534/genetics.115.176107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Benner C, Spencer CC, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32(10):1493–1501. doi: 10.1093/bioinformatics/btw018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Newcombe PJ, Conti DV, Richardson S. JAM: a scalable Bayesian framework for joint analysis of marginal SNP effects. Genetic epidemiology. 2016;40(3):188–201. doi: 10.1002/gepi.21953 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Peterson RE, Kuchenbaecker K, Walters RK, Chen CY, Popejoy AB, Periyasamy S, et al. Genome-wide association studies in ancestrally diverse populations: Opportunities, methods, pitfalls, and recommendations. Cell. 2019;. doi: 10.1016/j.cell.2019.08.051 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570(7762):514–518. doi: 10.1038/s41586-019-1310-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Consortium DS, Consortium DM, Mahajan A, Go MJ, Zhang W, Below JE, et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature genetics. 2014;46(3):234. doi: 10.1038/ng.2897 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kichaev G, Pasaniuc B. Leveraging functional-annotation data in trans-ethnic fine-mapping studies. The American Journal of Human Genetics. 2015;97(2):260–271. doi: 10.1016/j.ajhg.2015.06.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Mägi R, Horikoshi M, Sofer T, Mahajan A, Kitajima H, Franceschini N, et al. Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Human molecular genetics. 2017;26(18):3639–3650. doi: 10.1093/hmg/ddx280 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lam M, Chen CY, Li Z, Martin AR, Bryois J, Ma X, et al. Comparative genetic architectures of schizophrenia in East Asian and European populations. Nature genetics. 2019;51(12):1670–1678. doi: 10.1038/s41588-019-0512-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Marigorta UM, Navarro A. High trans-ethnic replicability of GWAS results implies common causal variants. PLoS genetics. 2013;9(6):e1003566. doi: 10.1371/journal.pgen.1003566 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Morris AP. Transethnic meta-analysis of genomewide association studies. Genetic epidemiology. 2011;35(8):809–822. doi: 10.1002/gepi.20630 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Han B, Eskin E. Random-effects model aimed at discovering associations in meta-analysis of genome-wide association studies. The American Journal of Human Genetics. 2011;88(5):586–598. doi: 10.1016/j.ajhg.2011.04.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Consortium GP, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Consortium IH, et al. The international HapMap project. Nature. 2003;426(6968):789. doi: 10.1038/nature02168 [DOI] [PubMed] [Google Scholar]
  • 23.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine. 2015;12(3):e1001779. doi: 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Neale Lab B. UK Biobank Summary Statistics; 2018. http://www.nealelab.is/uk-biobank/.
  • 25.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Jiang L, Zheng Z, Qi T, Kemper KE, Wray NR, Visscher PM, et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nature genetics. 2019;51(12):1749–1755. doi: 10.1038/s41588-019-0530-8 [DOI] [PubMed] [Google Scholar]
  • 27.Sakaue S, Kanai M, Tanigawa Y, Karjalainen J, Kurki M, Koshiba S, et al. A global atlas of genetic associations of 220 deep phenotypes. medRxiv. 2020;. [Google Scholar]
  • 28.Kanai M, Akiyama M, Takahashi A, Matoba N, Momozawa Y, Ikeda M, et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nature genetics. 2018;50(3):390–400. doi: 10.1038/s41588-018-0047-6 [DOI] [PubMed] [Google Scholar]
  • 29.Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS genetics. 2014;10(10):e1004722. doi: 10.1371/journal.pgen.1004722 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wang G, Sarkar AK, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine-mapping. J. R. Stat. Soc. B. 2020; 82: 1273–1300. doi: 10.1111/rssb.12388 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Asimit JL, Rainbow DB, Fortune MD, Grinberg NF, Wicker LS, Wallace C. Stochastic search and joint fine-mapping increases accuracy and identifies previously unreported associations in immune-mediated diseases. Nature communications. 2019;10(1):1–15. doi: 10.1038/s41467-019-11271-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lozano JA, Hormozdiari F, Joo JWJ, Han B, Eskin E. The Multivariate Normal Distribution Framework for Analyzing Association Studies. bioRxiv. 2017; p. 208199. [Google Scholar]
  • 33.Henderson HV, Searle SR. On deriving the inverse of a sum of matrices. Siam Review. 1981;23(1):53–60. doi: 10.1137/1023004 [DOI] [Google Scholar]
  • 34.Akritas AG, Akritas EK, Malaschonok GI. Various proofs of Sylvester’s (determinant) identity. Mathematics and Computers in Simulation. 1996;42(4-6):585–593. doi: 10.1016/S0378-4754(96)00035-3 [DOI] [Google Scholar]
  • 35.Panagiotou OA, Ioannidis JP, Project GWS. What should the genome-wide significance threshold be? Empirical replication of borderline genetic associations. International journal of epidemiology. 2012;41(1):273–286. doi: 10.1093/ije/dyr178 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

David Balding, Eleftheria Zeggini

5 Oct 2020

Dear Dr Eskin,

Thank you very much for submitting your Research Article entitled 'Identifying Causal Variants by Fine Mapping Across Multiple Studies' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

You can use the link below to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Eleftheria Zeggini

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Summary: Minority population GWAS and trans-ethnic finemapping have increased in popularity due to the diversity in LD patterns across populations and how this diversity can be exploited to improve fine-mapping resolution. To this end, the authors present the statistical finemapping method “MsCAVIAR”, which is an extension of the previously developed Bayesian fine-mapping software CAVIAR. The primary advantage and novelty of the proposed approach is that it simultaneously addresses both effect heterogeneity and multiple causal variants at a given locus of interest for trans-ethnic finemapping. The authors’ perform a number of simulation studies as well as real data application to demonstrate the performance of their approach against leading methods that accommodate multiple causal variants. The method performs as well or better than its primary competitor (PAINTOR) in many regards. Overall, the paper is clear and well-written. The authors’ carefully outline their statistical methodology, addressing a number of computational hurdles. They address a number of limitations in their assumptions and highlight the strengths/weaknesses of their method. The authors also make code publicly available on Github, which appears to be complete and well-documented. I do have some questions regarding computational efficiency, and there are some issues with respect to consistency of notation – although most of my comments are minor.

As the authors’ note, there are a few limitations on their assumptions regarding tau and how it is fixed a priori in application. This corresponds both across loci and across studies, where imbalance in contributing studies vis a vis ancestral population may be a concern (as published GWAS are in predominantly European populations). Regarding the latter, one advantage of MR-MEGA in trans-ethnic finemapping is how it leverages meta-regression to account for distribution of genetic ancestry across the contributing individual studies, thus decomposing the effect heterogeneity into ancestral and random components. To this end, the authors suggest selecting one study from each population for inclusion. However, this naturally reduces finemapping efficiency depending on the distribution of study sizes. Would an alternative strategy using MsCAVIAR be to initially conduct fixed-effects meta-analyses within homogenous populations, where warranted, and then combine diverse population results for trans-ethnic finemapping?

Speaking of which – the authors initially mention MR-MEGA in the introduction but do not discuss it any further or include it in their comparisons. Given that the MR-MEGA paper demonstrates improved finemapping over PAINTOR in their paper, I assume the rationale for its exclusion for comparative performance analyses is due to its limiting assumption regarding number of underlying causal variants?

How would the authors additionally integrate functional annotation into their fine-mapping method, similar to PAINTOR?

It may be useful to quickly mention in “Parameter setting in practice” the relevance of 5.2 as a Z-score in relation to the traditional genome-wide significance criterion (i.e., the corresponding p under two-sided Wald test would be ~5-e08). Similarly - I’m not finding any clear justification for MsCAVIAR’s default value of tau2 = 0.52 – I assume the connection lies with just taking forcing a mean/variance relationship in effect heterogeneity for Z-scores at the genome-wide significance threshold?

Equation 10. It’s kind of confusing to use T to denote both the number of studies as well as the transpose operator. Similarly, later in the methods n is used to denote the number of studies, but also the number of subjects within a study. And then again it seems that ‘m’ is used to index study in “Extending MsCAVIAR to different sample sizes”. Some care should be used in maintaining consistent notation.

It’s not immediately clear how the computational cost scales with respect to number of included studies. Given the single study setting is O(k^3 ) plus some O(mk^2) – do we replace k with (k*n) to get the computational costs, where n = # of studies? Thus, is # of studies a similarly strong limiting factor in computational burden as k, which would potentially motivate the population-wise study aggregation study mentioned above?

Reviewer #2: LaPierre and colleagues present a novel approach for fine-mapping loci using summary statistics from multiple studies whilst accounting for heterogeneity in the effects of causal variants between them. The methodology can allow for multiple causal variants at a locus, and can be applied in the context of trans-ethnic fine-mapping by allowing for study-specific patterns of LD between variants. The methodology tackles an important challenge in human genetics, is extremely timely, and likely to be of great interest to the readership of PLoS Genetics. I am looking forward to trying the software! The manuscript is generally well written, although some additional details of the simulation study and the applications to trans-ethnic GWAS of type 2 diabetes and cholesterol would be useful.

Comments on methodology

1. Presumably it would be straightforward to incorporate a non-uniform prior of causality? Given the availability of enrichment in associations in specific annotations, it would be really useful to allow this flexibility in the method/software.

2. In the context of trans-ethnic meta-analysis, could the authors provide some guidance as to whether each study should be included separately in the meta-analysis, or whether a fixed-effects meta-analysis of each ethnic group should undertaken first, and then used as input to msCAVIAR? Presumably this would be computationally more demanding, but are there advantages in allowing for heterogeneity between studies from the same ancestry?

3. Heterogeneity is modelled under a random-effects model, but would it be feasible (and beneficial) to consider alternative models (such as less heterogeneity between more genetically similar studies)? Could the tau2 parameter be considered as a hyperparameter to be estimated, and could this give some intuition as to the extent of heterogeneity?

4. It wasn’t totally clear to me whether the (maximum) number of causal variants needs to be specified in advance.

5. Some details of computational efficiency would be beneficial. In simulations and data applications, SNPs are thinned by various criteria. Is this because the methodology/software does not work well if there are SNPs in strong LD, or is computationally demanding if the number of SNPs is large?

6. One of the nice features about Susie is that it effectively gives a credible set for each causal variant (reflecting the fact that these each represent a distinct association signal). If we were then keen to colocalise association signals with eQTLs, this could be done for each credible set separately. However, for msCAVIAR, it does not seem that we could extract equivalent information. For example, if there were two signals at a locus (and two causal variants), the first signal might be easy to fine-map, so that we are clear of the causal variant for that signal, but for the other signal, there might be several variants with equivalent fine-mapping support. Is there anyway to distinguish the fact that the variants in the credible set are somehow grouped by distinct associations, and if not, do the authors view this as a disadvantage?

Comments on the simulation study

1. Details of the simulation study are rather scant. How large are the two regions (physical distance and number of SNPs)? I can understand removing SNPs in perfect LD to select causal variants (although in practice, I guess two causal variants could be in perfect LD), but then this only leaves 48/38 SNPs, which does not seem like a “realistic” fine-mapping scenario. I also was not clear if the SNPs included in the analysis were also LD pruned, or if all SNPs in the region were considered as potentially causal. Could there every be a situation where a causal variant was specific to just one population (i.e. monomorphic in the other) – or did causal variants have to be present in both populations (at some frequency)?

2. What is Sigma_i on line 118? It might be described later, but it is not clear what it is at this point in the manuscript.

3. The authors suggest that the improved performance of msCAVIAR over PAINTOR could be because of the modelling of heterogeneity. Could the authors investigate this further by simulating effects of causal SNPs that are homogenous across studies?

4. Line 151. The authors state that the fact that msCAVIAR performs better than CAVIAR applied to each population is an indication of the improved fine-mapping resolution offered by trans-ethnic meta-analysis. However, could this actually be a reflection of the larger total ample size used by msCAVIAR? Could the authors also run simulations where the sample size of the population-specific studies are the same as a trans-ethnic study (i.e. just simulate two European studies of equal size, and compare with one European and one East Asian study of equal size)?

Comments on data applications

1. Effective sample size is a more useful way of representing the sample size for a disease phenotype – would actually better to give the number of diabetes cases and controls for each study.

2. It would be good to give information on the loci used in comparisons in supplementary information (for both applications), together with the numbers of credible causal variants for each locus with the different methods.

3. Centering the loci 50kb up- and down-stream of the lead SNP seems rather restrictive – we know that LD often extends over greater distances, and I think using 500kb up- and down-stream would be much more realistic. Was this done for computational reasons?

4. I didn’t follow the motivation for removing SNPs with p>0.0001. There could be examples where a causal variant does not have strong association in a single SNP analysis, and is only revealed when considering multiple causal variants (depending on patterns of LD between causal SNPs and directions of effect on the outcome). Was this done for computational reasons?

5. Figures 4 and 5. I did not find the violin plots useful in Figure 4. I understand that it is hard to summarise results when there are just five loci in Figure 4. It would be useful to have three scatter plots where the x-axis was the credible set size in msCAVIAR and the y-axis was the credible set size in one of each of the three other methods (i.e. each point is a locus) – this will provide useful information on within locus comparisons that could not be assessed with the current presentation. The box and whisker plots in Figure 5 are more useful, but I think could also be presented alongside scatter plots as described above.

6. I wasn’t totally clear about the final sentence of the last paragraph (line 259). In particular, I wasn’t clear about why it mattered that msCAVIAR models heterogeneity as being the same at each locus. As far as I am aware, PAINTOR also models heterogeneity as being the same at each locus (i.e. fixed effects, so no heterogeneity). So do these results imply that msCAVIAR does not perform well if effects are homogenous across studies? I think this emphasizes the importance of running some simulations under a model in which effects are the same in the two studies.

Comments on the discussion

1. I think it would be beneficial to expand somewhat on the comment about “equal heterogeneity” – presumably this is just the underlying assumption of a random effects model? I agree that this model is not appropriate when several studies are from one population, and one study is from another (because less heterogeneity would be expected between studies from the same population). However, I disagree with the recommendation to use a single study from each population. It would be much better to meta-analyse studies from the same population/ethnicity together, and use those as input to msCAVIAR (assuming you can use the same LD matrix for all studies from the sample population).

Reviewer #3: Review of LaPierre et al

The paper presents an extension of the fine-mapping

method (CAVIAR) to deal with multiple studies, allowing

for heterogeneity in effects among studies.

The paper uses simulation to show that this extension

(MsCAVIAR) produces better localization, in that

it reduces the size of the "causal set" produced compared with CAVIAR

(and other methods) run on the individual studies.

Results on real data show similar trends (smaller causal sets.)

This is a potentially useful -- if conceptually fairly straightforward --

extension of the CAVIAR method. However, there are several important

issues that would need addressing to make it suitable for publication.

Main Issues

1. While the main text is very clearly written and nicely presented,

the Methods section is confusing and difficult to follow.

I suggest the following:

First, there needs to be a very simple and clear statement of the model and

prior distribution used. This should be separated from any "derivation" of this

model (which can likely be mostly justified by appropriate citations of previous work)

and also separated from the computational tricks (which also seem like

straightforward extensions of previous work).

At the moment the model is very hard to extract from the text.

Lambda is used in different places in different ways: at the

top of p11 lambda_i = beta_i sqrt(n_i)/sigma_e , but then

later in the same paragraph it is used for the mean of S, which is

not the same thing. Maybe because of this there appear to be

circular definitions (Lambda_C is defined in terms of Lambda at

(2), and then Lambda is defined as \\Sigma Lambda_C at the top

of p15). [I actually don't think you need both Lambda and Lamba_C:

you can just use Lamdba for the true non-centrality parameters (which will

be 0 for non-causal SNPs) and then directly use \\Sigma Lambda for

the expectation of S, so

S| Lambda \\sim N (\\Sigma Lambda, \\Sigma)

or something like this?]

The extension to different sample sizes is described

imprecisely in words (top of p23), and needs equations to make it precise.

I would suggest just giving the model for different sample sizes

directly, since the case where they are the same are then a special case.

The model seems to be a "matrix normal" model, and making that explicit could help.

Second, the definitions of the summary data S need to be made

clear. At the moment they are defined as beta-hat_i \\sqrt(n_i)/sigma_e

but sigma_e is unknown. And at line 444 you say "we now operate... that

sigma_e has been standardized (sigma_e=1)". But there is no

way to standardize to ensure sigma_e=1 because we do not know the true residual variance.

It is common to standardize $y$ to have unit variance, but this does not imply the residual

variance is 1. (I think maybe in the model you are assuming $y$ has been standardized

to have variance 1, and then making the approximation that sigma_e \\approx 1

under the assumption of low heritability? But not sure whether you are also

doing this for s_i, so taking s_i = beta-hat_i \\sqrt(n_i) ? Or using an estimate

of the residual variance? In any cae these kinds of details, assumptions

and approximations need to be more precise.)

2. The simulation study is rather too favorable to the method, and should

be made more realistic. In particular i) the simulation is performed under

the assumed summary data model, rather than under a more realistic

full data model (ie the regression model (1)); ii) the simulation

is performed with "effect sizes" (actually, non-centrality parameter)

with a narrow range centered on 5.2, which not only

seems also to be used in the prior, but also seems unrealistic - it will seldom

produce either small difficult-to-detect effects or very large effects,

both of which are likely to occur frequently in practice;

iii) the simulation is done assuming the same LD structure in

both the study and the inference, whereas the real data analysis

uses a panel to approximate the study LD.

It would seem easy to generate more realistic simulated data

by simulating outcomes Y from the full data model (1), using real genotype

data (X) an a range of beta values (eg randomly drawn from N(0,sigma^2_g)

for some sigma^2_g) so that both small and big effects occur.

3. As argued in Wang et al, the idea of outputting a "causal

set" that, with high probability, contains *all* causal SNPs is

flawed. There are two reasons for this. First, it can ignore a lot of useful information.

For example, suppose there are two causal SNPs, and that one is in LD with

just itself, but the other is in LD with 50 others. Then the causal set will

contain (at least) 52 SNPs, and does not include the information that one SNP is very

precisely mapped (which a user could clearly find helpful!)

Second, if we allow that SNPs may have small effects, which is realistic,

then it becomes impossible for any method to be confident to include all the causal SNPs in a set

(at least, not without the set being very large). The paper sidesteps this

issue by avoiding simulations with small effects, which should

be rectified (see 2 above).

In light of this it seems unsatisfactory to rely on the size of the causal set

as the only indicator of improved performance, and I think the paper should

also provide other evidence for the superiority of the multi-study approach.

One possibility would be to demonstrate the benefits in terms of PIPs (posterior

inclusion probabilities). For example, does MsCAVIAR show a better

precision-recall curve (equivalently, true-positive rate vs false discovery rate)

as PIP threshold is modified?

4. The filtering in the real data analysis seems very ad hoc. and it is not clear

why it is done or whether it is necessary. Is it necessary to make the method's

performance look good compared with other methods? If so, this seems worrying.

If not, why not present results on much less filtered data? (even if it

may be more computationally intensive).

To comment in more detail on the filters: i) Discarding SNPs with marginal

p values >0.0001 could miss signals as SNPs can become more significant

once one controls for other SNPs in LD. ii) The logic that if the peak SNP is genome-wide

significant in one population and >0.0001 in another then the second population

won't help with localization isn't clear: first,

the potentially different patterns of LD in the two

populations mean that a second population could still help with localization even

without a genome-wide significant association; second maybe there are secondary SNPs

that will only show up as significant when one analyzes both populations.

iii) You say "fine mapping is not as useful when there are few strongly associated SNPs".

Why? I would think these loci may give the potential to fine map quite precisely!

Surely the problem cases are whether there are many SNPs in strong LD, all strongly associated,

which makes fine mapping difficult?

iv) "As a final step, we pruned groups of SNPs that were perfectly correlated with

each other in both studies... would cause the LD matrix to be low rank". Does

this mean you pruned if they were perfectly correlated in the *study* samples

or in the LD panel? It could make sense to pool together SNPs that are perfectly

correlated in the study, but if they are perfectly correlated in the panel

but not in the study then it seems you would want to keep them both.

(In that case perhaps you need the methods that allow for

low rank LD matrices in the panel, eg using the methods you cite

from Lozano et al.)

Other issues

- the caption to Figure 2 should include some explanation of the fact that the

calibration of SuSie can't be compared to the other methods because its sets

have a different goal.

- at line 110-111, I understand you pruned SNPs in perfect LD to reduce computation

and possibly to reduce low-rank issues for MsCAVIAR. However, while pruning may initially

seem innocuous, it raises several concerns. For example, a group of 10 SNPs in complete LD should

have approximately 10 times the probability that at least one of them

is causal compared with a single SNP that is in LD only with itself.

Most analysis methods would take that into account if the SNPs were simply

in "very high LD", but this is hard to do if

you have pooled/pruned the SNPs. And the pruning will understate

the size of typical causal sets (and so overstate performance) for all methods.

Also posterior quantities of interest (eg posterior inclusion probabilities)

may be difficult to correct for this pooling. The bottom line: if

pooling is just a way to reduce computation, can you show

that results of analyzing data with pooling are similar to results

without pooling?

- line 131-3 suggest that Fig 2 boxplots are for only a subset of

the simulations (even though the Figure caption does not mention it).

That seems dangerous, and I do not see why

not to include all simulations in the boxplot.

- line 148-9; they are only equivalent under the *assumption* that there is 1 causal SNP,

and not when methods are applied to data where the truth is only 1 causal SNP but they

do not assume only 1 causal SNP.

- line 218: Do the reported set sizes include all SNPs that were pruned

for being in complete LD with

selected SNPS? It seems that they should in order to give an accurate impression

of the effectiveness of fine mapping in practice.

- In the real data, how many causal SNPs do you estimate/identify at each locus?

- l273: this advice seems useless because how would one know? How much worse is MsCAVIAR

than CAVIAR if effects are unique to one population? One might hope that it

would be robust to this because of the heterogeneity in the model, and

this robustness seems worth assessing in simulations.

- Please provide a stable link to the code used to perform the simulations and data analysis

Details:

- l99: "dividing by the sum of posterior probabilities of all configurations": isn't this necessarily 1?

- l101: "continue increasing the size of the causal set" how is this done?

- l107-8: do you mean 20%/80% of SNP *pairs*?

- l114: don't use effect sizes and non-centrality parameter interchangeably as they are different.

- l115: casual -> causal

- l153 effect size -> non-centrality parameter

- l190: Should "White European" be "White British"?

- In the methods section using (lower-case) sigma for a variance (eg in equation (4)

and subsequently) is confusing. Use sigma^2 for a variance. Related to this, line 294 I think should be

sigma_e^2, and line 448 \\sigma_g should be squared.

- l338 "the integral above is intractable..." but then you say closed form is available!

- l339,341 posterior predictive -> predictive

- l352: " rows ... are zero" isn't this only true if you set epsilon=0?

It would be helpful to be more consistent throughout about treatment

of epsilon and what value it takes.

- in equation before l438 sqrt(n_m) should be n_m?

- line 459, say where 5.2 comes from

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: No: I don't think the specific details of the data in Figures 4 and 5 have been presented - I've made some suggestions in my comments about providing additional supplemental information to deal with this.

Reviewer #3: No: 

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Matthew Stephens

Decision Letter 1

David Balding, Eleftheria Zeggini

11 Jun 2021

Dear Dr Eskin,

Thank you very much for submitting your Research Article entitled 'Identifying Causal Variants by Fine Mapping Across Multiple Studies' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by reviewer 3.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Eleftheria Zeggini

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: The authors have largely addressed all of my concerns. I have no further comment.

Reviewer #2: The authors have made extensive revisions to the manuscript and have dealt with all comments. I look forward to trying out the software!

Reviewer #3: Reveiew of LaPierre et al (revised)

The authors have been very responsive to the previous reviews, and the manuscript is much improved.

I have just one major remaining issue, and a few remaining minor comments I would ask the authors to address.

Main issue:

1. Although the new simulation study is much improved, it still shows results only for "in sample" LD matrices.

As the authors note, when using summary data it is very common to use "out of sample" LD matrices (eg LD matrices from a similar population reference panel like 1000 Genomes), and indeed the real data

analysis used that out-of-sample strategy. It is important for readers and users to understand whether this impacts accuracy, and so the simulation study should show results for both in-sample and out-of-sample LD matrices.

Other issues:

2. The MSCaviar model is essentially specified bymthe equation after l296, S|C \\sim N(0, \\Sigma + \\Sigma \\Sigma_C \\Sigma). I suggest to number that equation, since it is key. Further, please add at this point the precise descriptions of \\Sigma and \\Sigma_C. \\Sigma is a block-diagonal matrix (not a diagonal matrix) with each block containing the LD matrix from a study. And \\Sigma_C is given in the equation after l429, which you could

give here. It might seem a minor point, but it took me quite a while to collect those pieces all together to convince myself the model had been fully specified. Putting them all in the same place will help the reader see the model immediately, and numbering the key equations would help future writers refer to them if necessary.

3. I believe it would be further useful to state here that this model comes from assuming S_q | \\Lambda_q \\sim N(\\Sigma_q Lambda_q, \\Sigma_q) independently for each q where S_q are the summary data in study q,

Sigma_q is the LD matrix in study q and Lambda_q is the NCP in study q. (This basic assumption is stated in words in the main text, but seems worth repeating here). This is combined with a prior on Lambda

vec(\\Lambda) \\sim N(0, \\Sigma_C) where vec(\\Lambda) is the KM vector of concatenated NCPs from the K different studies to give the final model.

4. In equation (10) I would assume S_q denotes the data for study q. But then this equation seems irrelevant as you want p(C* | S) [ie posterior given all the data] and not p(C* | S_q).

********** 

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: None

********** 

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Andrew Morris

Reviewer #3: No

Decision Letter 2

David Balding, Eleftheria Zeggini

21 Jul 2021

Dear Dr Eskin,

We are pleased to inform you that your manuscript entitled "Identifying Causal Variants by Fine Mapping Across Multiple Studies" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Eleftheria Zeggini

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-01186R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

David Balding, Eleftheria Zeggini

3 Sep 2021

PGENETICS-D-20-01186R2

Identifying Causal Variants by Fine Mapping Across Multiple Studies

Dear Dr Eskin,

We are pleased to inform you that your manuscript entitled "Identifying Causal Variants by Fine Mapping Across Multiple Studies" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Andrea Szabo

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Additional simulations and a table of the real data results.

    (PDF)

    Attachment

    Submitted filename: mscaviar_response_to_reviewers.pdf

    Attachment

    Submitted filename: MsCaviar_Response_to_Reviewers.docx

    Data Availability Statement

    MsCAVIAR is free and open source, and the source code is available on GitHub: (https://github.com/nlapier2/MsCAVIAR). Code and instructions to replicate our results are also available on GitHub: (https://github.com/nlapier2/mscaviar_replication). The UK Biobank HDL Cholesterol dataset can be downloaded from https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/30760_raw.gwas.imputed_v3.both_sexes.tsv.bgz. The Biobank Japan HDL Cholesterol dataset can be downloaded by accessing http://jenger.riken.jp/en/result and clicking the "Download" button next to "High-density-lipoprotein cholesterol (HDL-C) (autosome)". The 1000 Genomes data was downloaded by using the following script https://github.com/gkichaev/PAINTOR_V3.0/blob/master/PAINTOR_Utilities/CalcLD_1KG_VCF.py; instructions are available at https://github.com/gkichaev/PAINTOR_V3.0/wiki/2a.-Computing-1000-genomes-LD.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES