Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2016 Sep 22;33(2):248–255. doi: 10.1093/bioinformatics/btw615

Improved methods for multi-trait fine mapping of pleiotropic risk loci

Gleb Kichaev 1,†,*, Megan Roytman 1,, Ruth Johnson 2, Eleazar Eskin 1,3,4, Sara Lindström 5, Peter Kraft 6, Bogdan Pasaniuc 1,3,7
PMCID: PMC5254076  PMID: 27663501

Abstract

Motivation: Genome-wide association studies (GWAS) have identified thousands of regions in the genome that contain genetic variants that increase risk for complex traits and diseases. However, the variants uncovered in GWAS are typically not biologically causal, but rather, correlated to the true causal variant through linkage disequilibrium (LD). To discern the true causal variant(s), a variety of statistical fine-mapping methods have been proposed to prioritize variants for functional validation.

Results: In this work we introduce a new approach, fastPAINTOR, that leverages evidence across correlated traits, as well as functional annotation data, to improve fine-mapping accuracy at pleiotropic risk loci. To improve computational efficiency, we describe an new importance sampling scheme to perform model inference. First, we demonstrate in simulations that by leveraging functional annotation data, fastPAINTOR increases fine-mapping resolution relative to existing methods. Next, we show that jointly modeling pleiotropic risk regions improves fine-mapping resolution compared to standard single trait and pleiotropic fine mapping strategies. We report a reduction in the number of SNPs required for follow-up in order to capture 90% of the causal variants from 23 SNPs per locus using a single trait to 12 SNPs when fine-mapping two traits simultaneously. Finally, we analyze summary association data from a large-scale GWAS of lipids and show that these improvements are largely sustained in real data.

Availability and Implementation: The fastPAINTOR framework is implemented in the PAINTOR v3.0 package which is publicly available to the research community http://bogdan.bioinformatics.ucla.edu/software/paintor

Contact: gkichaev@ucla.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

Genome-wide association studies (GWAS) have identified thousands of regions in the genome containing risk variants for complex traits and diseases (Global Lipids Genetics Consortium et al., 2013; Locke et al., 2015; Okada et al., 2014; Type et al., 2014; Wood et al., 2014). However, the vast majority of the GWAS reported variants are not biologically causal, but rather, correlated to the true causal variants through linkage disequilibrium (LD) (Hormozdiari et al., 2014; Kichaev et al., 2014; Visscher et al., 2012). Fine mapping studies gather detailed genetic information within the loci that have been implicated in GWAS (Kote-Jarai et al., 2013; Meyer et al., 2013; Wu et al., 2013) and statistically dissect these regions to prioritize variants according to probability of causality. The top variants resulting from this procedure may become candidates for functional validation (Claussnitzer et al., 2015; Musunuru et al., 2010).

Many statistical methods for fine-mapping have been developed for the prioritization of causal variants. Standard approaches range from a simple ranking of SNPs based on their p-values to more sophisticated LD-aware ranking algorithms that quantify probabilities for variants to be causal (Benner et al., 2015; Chen et al., 2015; Hormozdiari et al., 2014; Kichaev et al., 2014). Initial probabilistic methods have assumed a simple model in which only one variant per locus is biologically causal (Maller et al., 2012), with more recent methods extending the statistical frameworks to accommodate multiple casual variants at risk regions (Chen et al., 2015; Hormozdiari et al., 2014; Kichaev et al., 2014; Kichaev and Pasaniuc, 2015). Although modeling multiple causal variants drastically increases performance, particularly at loci with evidence of multiple signals of association, it also presents a combinatorially challenging problem in performing inference in the model. That is, the likelihood formulation contains a model space size exponential in the number of variants at a locus, which clearly cannot be enumerated over for even a modestly sized locus. To account for this combinatorial explosion, initial methods approximated the full likelihood by restricting the maximum number of causal variants allowed at a risk locus to a small number (Chen et al., 2015; Hormozdiari et al., 2014; Kichaev et al., 2014; Kichaev and Pasaniuc, 2015) More recent works (Benner et al., 2015) further improved computational efficiency by sampling likely causal models using stochastic search, leveraging the intuition that most of the terms in the likelihood computation have near negligible contribution. The authors demonstrated that this achieves drastic reduction in runtime with comparable fine-mapping accuracy relative to enumerative methods (Benner et al., 2015). However, this was done in the context of a single fine-mapping locus and did not integrate multiple sources of information.

Many GWAS loci are known to be implicated in multiple related traits—a phenomenon that is observed in many phenotypic classes. For example, breast cancer and mammographic density (Lindström et al., 2014), high density lipoprotein (HDL) and low density lipoprotein (LDL) (Global Lipids Genetics Consortium et al., 2013), or rheumatoid arthritis and irritable bowel disease (Liu et al., 2015; Okada et al., 2014) are all pairs of traits that share overlapping GWAS signals. Combining association signals at these pleiotropic regions may strengthen the signal from the causal variants that are impacting both traits. A standard approach used when combining association information across multiple studies is fixed-effects meta-analysis, which assumes that causal variants across studies share the same effect sizes. The random-effects model does allow for effect size heterogeneity, but it is poorly suited for situations in which the variant has opposite effect sizes in the various phenotypes (Solovieff et al., 2013). For this reason, multivariate analyses that jointly analyze association data from multiple phenotypes and account for effect size heterogeneity are beneficial—particularly for related traits that have opposing phenotypic consequences such as HDL and LDL (Global Lipids Genetics Consortium et al., 2013).

Considerable effort has been put forth into characterizing the chromatin landscape across the entire spectrum of human tissues (ENCODE Project Consortium et al., 2012; Kundaje et al., 2015; Zhou et al., 2011). Most recently, the Roadmap Epigenomics consortium interrogated 111 cell types, charting histone modifications, DNA accessibility, DNA methylation and gene expression, to produce genome-wide maps of functional elements (Kundaje et al., 2015). Previous works have demonstrated that principled integration of such data can aid fine-mapping performance in the context of single and multi-population fine-mapping studies (Kichaev et al., 2014; Kichaev and Pasaniuc, 2015). Since related traits have been shown to share an underlying genetic basis (Bulik-Sullivan et al., 2015) that localizes within similar functional classes (Finucane et al., 2015), it is plausible that functional annotation data can also augment cross-trait fine-mapping.

In this work we propose a unified framework to perform fast, integrative fine-mapping across multiple traits. We integrate the strength of association across multiple traits with functional annotation data to improve performance in the prioritization of causal variants. Our approach makes the assumption that the same variants at the risk loci impact both traits though with potentially distinct effect sizes. A key advantage of our approach is that it requires only summary association data for each trait, thus avoiding the restrictions that arise from the sharing of individual-level data. To balance computational efficiency and accuracy we propose an Importance Sampling technique that provides guarantees for convergence, while relaxing the assumption of the maximum number of causal variants allowed at each risk locus.

Through simulations we show that our integrative method delivers well-calibrated probabilities for SNPs to be causal and improves fine-mapping performance relative to current state-of-the-art strategies. To our knowledge, the only existing method that performs joint mapping for pleiotropy while incorporating functional annotation data is GPA (Chung et al., 2014). We show that our approach provides superior accuracy to GPA, likely due to the explicit modeling of LD in our framework. We illustrate the benefit of our proposed methodologies by fine-mapping pleiotropic regions of lipid traits in a GWAS of over 180K individuals (Global Lipids Genetics Consortium et al., 2013).

2 Methods

2.1 Overview

Here, we introduce statistical methods for fine-mapping of pleiotropic loci with functional annotation data (see Fig. 1). We build upon previous works (Hormozdiari et al., 2014; Kichaev et al., 2014; Kichaev and Pasaniuc, 2015) that make use of a Multivariate Normal (MVN) distribution to jointly model association statistics at all SNPs at the locus. This not only allows for the possibility of multiple causal variants at any risk locus, but also avoids the need to access individual level genotype data as LD can be approximated using the appropriate population-matched reference panel (1000 Genomes Project Consortium et al., 2012). We integrate relevant functional annotation data through a prior probability for SNPs to be causal. We introduce an Importance Sampling procedure to improve computational efficiency over methods that enumerate all possible models of causal configurations. The primary output of our approach are posterior probabilities for SNPs to be casual in both traits which can subsequently be used to prioritize SNPs individually (Kichaev et al., 2014) or used to compute fine-mapping credible sets (Maller et al., 2012).

Fig. 1.

Fig. 1

Example of input and output of fastPAINTOR at locus chr4:35Mb for LDL and TG. As input, fastPAINTOR receives an LD matrix, functional annotations and multiple sets of Z-scores at the given locus. fastPAINTOR performs inference and outputs posterior probabilities for each SNP that quantifies the likelihood that the SNP is causal for both traits (Color version of this figure is available at Bioinformatics online.)

2.2 A statistical framework for fine-mapping

The standard approach to connect genotype to phenotype is through a linear model. For individual i, let yi be the trait value and gi be their vector of genotypes spanning m SNPs. The trait can be modeled as yi=giT β+ϵi, where ϵiN(0,σe2) is random environmental noise. The vector, β, represents the allelic effects whose entries will be non-zero only at the causal SNPs. Given N individuals with measured genotypes and trait values, the effect size at SNP j is typically estimated using standard linear regression as βj^=(gjTgj)1gjTY. The strength of association is then quantified using the Wald statistic (Buse, 1982):

Zj=β^jSE(β^j) (1)

which asymptotically follows a normal distribution ZjN(λj,1) with mean

λj=βjVar(gj)σeN. (2)

Here, λj, is referred to as the Non-Centrality Parameter (NCP) and dictates of power of finding a significant association and, by extension, the power to distinguish causal from non-causal SNPs (i.e. βj0 versus βj=0). When the jth SNP is causal, the effect sizes are non-zero and therefore the association statistic (Z-score) corresponding to that SNP will be drawn from a non-central Normal distribution. However, LD (i.e. correlations between SNPs at each locus) will induce non-zero NCPs at non-causals variants through tagging. Therefore, neighboring non-causal SNPs will appear to be significantly associated to a trait indirectly through LD. Previous works (Hormozdiari et al., 2014; Kichaev et al., 2014; Kichaev and Pasaniuc, 2015) have shown that the NCPs at any SNP can be approximated from the NCPs at the causal SNPs:

Λj=crj,cλc (3)

where rj,c denotes the Pearson correlation between SNP j and causal SNP c. If we collect all the pairwise correlations into a matrix, Σ, and let λC be the vector of standardized effects sizes at the causal SNPs given by the indicator vector C, the entire set of regional summary statistics, Z, can be approximated by a Multivariate Normal distribution (MVN)) (Hormozdiari et al., 2014; Kichaev et al., 2014):

Z|λC,ΣN(ΣλC,Σ) (4)

However, the causal effect sizes (λC) are typically unknown a priori and must be either approximated (Kichaev et al., 2014; Kichaev and Pasaniuc, 2015) or integrated out (Hormozdiari et al., 2014). Leveraging the standard infinitesimal model (Yang et al., 2011), Hormorzdiari et al. (2014) proposed to use a normal prior on the causal NCPs which, due to conjugacy, can be conveniently integrated analytically as follows:

λC|C,σ2N(0,ΣC)) (5)
ΣC=σ2Diag(C)+Diag(ϵ) (6)
Z|Σ,C(N(ΣλC,Σ)N(0,ΣC)dλC)P(C) (7)
=N(0,Σ+ΣΣCΣ)P(C) (8)

Here the prior probability of the causal set vector (P(C)) can be set to be uniform (Maller et al., 2012), hypergeometric (Hormozdiari et al., 2014), or can be estimated empirically using more sophisticated approaches that incorporate functional genomic data(Kichaev et al., 2014; Kichaev and Pasaniuc, 2015; see Section 2.4). Chen et al. (2015) made the observation that the marginal likelihood in (Eq. 8) is approximately proportional to a Bayes Factor comparing a causal and null model which depends on the Z-scores and LD only at the causal SNPs. This effectively reduces the computational burden from cubic in the number of SNPs to cubic in the number of causal variants considered at each likelihood evaluation. This not only improves efficiency, but also improves numerical stability since a much smaller matrix is inverted thus alleviating the need for stringent regularizations. In this work, we follow the Chen et al. implementation of the likelihood computations (Benner et al., 2015; Chen et al., 2015).

2.3 Fine-mapping pleiotropic loci

Next, we extend the framework to exploit pleiotropy across related traits. Given multiple phenotypic measurements across T traits, one can compute Z-scores for each trait independently. If a locus harbors a significant association for multiple traits, a reasonable assumption would be that the underlying causal variants driving this association are shared. It follows that the vectors of association statistics are conditionally independent given the causal variants (C), thus the joint distribution for all T sets of Z-scores decomposes into product:

P(Z1ZT|Σ,C)=t=1TP(Zt|Σt,C,σt2) (9)

To simplify notation we hereafter refer to the collection of Z-scores at a fine-mapping locus as Z*={Z1ZT}. We assume that all trait measurements have been performed in a single population and therefore assume that Σt,=Σ for all t. Importantly, we note that our formulation makes no assumptions on the coupling between effect sizes at causal SNPs across traits which allows for arbitrary levels of heterogeneity. Accommodating this effect size heterogeneity could be important for related traits that have opposing phenotypic consequences.

Under the assumption that causal variants are shared across pleiotropic loci, the marginal likelihood of the data can be written as a summation across all possible causal sets, C:

L(Z*|Σ,σ2)=CCt=1TP(Zt|Σ,C,σt2)P(C) (10)

We can now use this to obtain the posterior probability of any causal set with a straightforward application of Bayes’ rule:

P(C|Z*,Σ)=t=1TP(Zt|Σ,C,σt2)P(C)L(Z*|Σ,σ2) (11)

which can be marginalized to yield per-SNP posterior probabilities:

P(Cj=1|Z*,Σ,γ)=C:Cj=1P(C|Z*,Σ) (12)

2.4 Incorporating functional genomic data

To integrate functional annotation data within this framework, we use a logistic function to connect a SNP’s functional genomic context to its causal status as follows:

P(Cj=1|γ,A)=exp(γAj)1+exp(γAj) (13)
P(C|γ,A)=j=1mP(Cj|γ,A)Cj(1P(Cj|γ,A))1Cj (14)

The vector Aj is the set of annotations corresponding to the jth SNP and γk is the prior-log odds that a SNP in annotation k is causal. We note that γ can be estimated directly from the data through an Empirical Bayes approach first described in Kichaev et al. (2014). However, this restricts functional enrichment estimation to only the fine-mapping loci under investigation. Alternatively, one could exploit potentially more powerful, genome-wide approaches such as stratified LD-score regression (Finucane et al., 2015) that can infer global functional genomic enrichments using only summary data. Our framework is amenable to both approaches, and we allow the user to estimate γ from all the fine-mapping loci jointly using the EM algorithm proposed in Kichaev and Pasaniuc (2015) or supply it from external analyses.

2.5 Model inference via Importance Sampling

The marginal likelihood in (Eq. 10) requires enumeration of O(2m) possible causal sets (C). This rapidly becomes intractable as the number of SNPs grows large, and strategies for dealing with this computational bottleneck need to be considered. Earlier frameworks (Chen et al., 2015; Kichaev et al., 2014; Kichaev and Pasaniuc, 2015) avoided this problem by simply restricting the total number of potential casual variants to a small number (km), thus reducing the computational burden to O(mk). However, even in this reduced model space, enumerating over all possible combinations is inefficient as most causal configurations will contribute minimally to the overall likelihood of the data. Recent works have shown that sampling can circumvent brute-force enumeration by efficiently exploring likely causal configurations through stochastic search (Benner et al., 2015)—though this still requires pre-specifying a subjective prior that explicitly upper-bounds the maximum number of causal variants considered at the locus.

In this work, we make use of Importance Sampling, a variance reduction technique commonly used in Monte Carlo integration (Glynn and Iglehart, 1989), to provide an efficient approximation of the marginal likelihood (Eq. 10). Unlike other recently proposed sampling techniques, Importance Sampling comes with asymptotic convergence guarantees and allows us to drop the hard cutoff on the maximum number of potential causal variants considered. The summation given in (Eq. 10) could naively be approximated by sampling directly from the prior and computing a simple Monte Carlo average:

CjBern(P(Cj|γ,A)) (15)
L(Z*|Σ,σ2)1Ss=1St=1TP(Zt|Σ,C(s),σt2) (16)

However, this is inefficient as highly probable causal sets in the posterior may not necessarily be reflected in the prior. To better guide the sampling of highly probable causal sets, we build off the intuition that SNPs with stronger associations (i.e. large Z-scores) are more likely to be casual than ones with weak associations. We can thus construct a discrete proposal distribution, G, to take this into account by simulating causal sets (C(s)) at iteration s as independent Bernoulli draws with probabilities given by:

G(Cj|Z*)Bern(t(Ztj)2itT(Zti)2) (17)
G(C(s)|Z*)=j=1mG(Cj|Z*)Cj(1G(Cj|Z*))1Cj (18)

Accumulating evidence across multiple traits by summing the chi-square statistics (i.e (Zj)2), and normalizing by the total sum across all SNPs and traits, creates a probability distribution with the desirable property that it will favor selecting SNPs that have strong evidence of association in multiple traits. By operating in the space of the chi-square statistics (as opposed to Z-scores), we have additional flexibility that allows for strongly associated SNPs to have opposing directional effects in different traits. We can then compute importance weights and re-adjust the bias introduced by sampling from G as follows:

L(Z*|Σ,σ2)s=1St=1TP(Zt|Σ,C(s),σt2)W(C(s))s=1SW(C(s)) (19)
W(C(s))=P(C(s)|γ,A)G(C(s)|Z*) (20)

Which we can then use to approximate the per-SNP probabilities using the same S samples:

P(Cj=1)s=1S1(Cj(s)=1)t=1TP(Zt|Σ,C(s),σt2)W(C(s))s=1St=1TP(Zt|Σ,C(s),σt2)W(C(s)) (21)

2.6 Simulation setup

To mimic real genotype data, we used HAPGEN2 (Su et al., 2011) and the 1000 Genomes (1000 Genomes Project Consortium et al., 2012) European samples, to simulate 20 000 haplotypes for a number of randomly selected 25KB loci from chromosome 1. We filtered rare SNPs (MAP < 0.01) and normalized genotypes to be mean-centered with unit variance. We overlapped our simulated regions with DNase Hypersensitivity (DHS) sites spanning 217 cell types and tissues (Gusev et al., 2014). Using these annotations, we drew causal status for each SNP according to the logistic model described previously, setting the DHS enrichment to 5.1 to reflect what was reported in (Gusev et al., 2014). Each locus harbored one causal variant in expectation, though the random assignment of causal status could yield zero or multiple casual variants for a given locus. In experiments that were done over 50 loci simultaneously, this typically resulted in an average of 18 loci with a single causal variant and 14 loci with multiple causals. Once we established the causal SNPs, we simulated phenotypes under a linear model such that for individual i, their phenotype value Yi was given by Yi=j=1Ncβj·gij+ϵi, where Nc is the number of causal variants, βj is the effect size of the jth causal SNP, and gij is number of copies of the risk allele j for individual i. We drew ϵi for each individual from a normal distribution N(0,σe2), where σe2 was given by the formula hg2=σg2σg2+σe2, setting σg2 to the empirically observed genetic component.

We computed Z-scores for all the SNPs within causal loci by regressing the phenotype vector Y on each genotype vector Gj and then taking the Wald statistic. To simulate correlated traits, the effect sizes (β1c,β2c) at the shared causal variants were drawn from an MVN distribution:

[β1cβ2c]N(0,[hg2/Ncρhg2/Ncρhg2/Nchg2/Nc]) (22)

where ρ represents the desired genetic correlation. We chose a ρ of 0.4, consistent with typical correlations for lipids data reported in (Bulik-Sullivan et al., 2015).

For computational efficiency, we also performed simulations in which the vectors of association statistics where drawn directly from an MVN distribution (Eq. 4). In this scenario the NCP (λC) was set to 5 at all causal SNPs.

2.7 Existing methods

We compared our approach to several existing fine-mapping methods. For single-trait fine-mapping, we compared to FINEMAP and CAVIARBF (Benner et al., 2015; Chen et al., 2015), two methods based on the CAVIAR (Hormozdiari et al., 2014) model that do not incorporate functional annotation data. We ran CAVIARBF v1.4 using the default settings, setting prior variance explained to be 0.05 and the maximum number of causal variants in the model to 3. After CAVIARBF computed Bayes factors for each SNP, we ran their model search algorithm, which outputs posterior probabilities based on Bayes factors. In this step, we set the prior probability of each SNP being causal to 1/m, where m is the number of variants in the locus. We ran the FINEMAP v1.1 software using default settings, allowing for 3 causal SNPs per locus with prior probabilities of (0.6, 0.3, 0.1) for 1, 2 and 3 causals respectively.

For multi-trait fine-mapping, we compared to GPA (Chung et al., 2014). To our knowledge, GPA is the only other method that performs multi-trait fine-mapping while leveraging functional annotation data. As GPA requires P-values as input, we converted Z-scores from our simulations to P-values for each SNP. We provided GPA with the same DHS annotation data as we did for our approach. On multi-trait analyses, GPA outputs 4 posterior probabilities for each variant, indicating the probability that the SNP is causal for neither trait, Trait 1, Trait 2, or both traits. When evaluating accuracy, we considered the SNP to be deemed causal by GPA if it was implicated in both traits. In addition, we explored traditional meta-analysis techniques to combine information across traits by computing inverse variance fixed effects association statistics (Evangelou and Ioannidis, 2013). We then used these Z-scores in fine-mapping under the assumption of a single causal variant (Maller et al., 2012) as well as within our framework as a single trait.

2.8 Empirical lipids data

We downloaded GWAS summary data across four blood lipids phenotypes: High Density Lipoprotein, Low Density Lipoprotein and Triglycerides (Global Lipids Genetics Consortium et al., 2013). For each of the traits, we used Imp-G summary (Pasaniuc et al., 2014) to impute Z-scores up to the latest version (V3) of the 1000 Genomes European reference panel (1000 Genomes Project Consortium et al., 2012) yielding approximately 7.6 million SNPs per trait in total. We then compiled a list of 24 pleiotropic regions which we defined as a GWAS hit that was observed in least two traits of the three traits. For each of these regions, we centered a 250KB window around the lead SNP and overlapped these regions with two functional marks derived from the Roadmap Project: Liver H3K4me1 and Liver H3K27ac (Kundaje et al., 2015).

3 Results

3.1 Fast and reliable performance in single trait fine-mapping

We first sought to empirically assess how our sampling-based approach compared to fine-mapping methods CAVIARBF and FINEMAP. These previous approaches can model multiple causal variants, but were not designed to exploit pleiotropy. As such, in order to make the comparisons fair, we conducted our initial investigation in the context of a single trait. Furthermore, because these methods, as well as our proposed approach, are faster generalizations of the underlying CAVIAR model, we chose not to compare to CAVIAR nor PAINTOR, both of which would predictably have slower computational performance but similar accuracies.

We first assessed performance on the basis of CPU runtime. The number of samples that are drawn to approximate the posterior distribution is invariably connected to the resulting runtime for our method, fastPAINTOR. Therefore, we determined the number of samples required to yield approximately unbiased credible sets and find that one million samples was typically sufficient across a wide-range of locus sizes (Supplementary Fig. S1). We then compared to existing approaches and, not surprisingly, discover that methods that approximate the posterior model space through sampling vastly outperform methods that enumerate over all possible combinations (Supplementary Fig. S2). For example, both fastPAINTOR and FINEMAP scale favorably with the size of the locus, with average run times of (11.5 s, 10.8 s) per 25KB locus and (186 s, 31 s) per 250KB locus. The added computational overhead of fastPAINTOR is due to the fact that functional enrichments must be iteratively estimated using an EM-algorithm. If these estimates are supplied from external analyses, running fastPAINTOR* takes an average of 75 s per 250KB locus to produce probabilities.

We next evaluated the accuracy of these methods in resolving causal variants to ensure that our sampling approximation did not deflate performance. We simulated 100KB regions with various levels of DHS enrichment to reflect a wide diversity of potential functional genetic architectures. In general, we see that leveraging functional annotation data improves fine-mapping resolution relative to non-integrative approaches (Fig. 2)—particularly as causal variants localize within smaller fractions of the genome (i.e. increasing enrichment). For example, the average rank of the causal SNPs was around 21.9 and 21.4 for CAVIARBF and FINEMAP across all functional genetics architectures. On the other hand, when causal variants are diffusely enriched within DHS, their average rank based on fastPAINTOR probabilities is 21.4 while strong functional enrichment yields an average rank of 15.0. Taken together, these results suggest that sampling-based, integrative methods are both scalable and achieve greater accuracy than current state-of-the-art methodologies.

Fig. 2.

Fig. 2

fastPAINTOR effectively leverages functional annotation data. We simulated fifty 100KB loci under various functional genetic architectures by drawing summary statistics directly from an MVN distribution. We applied all three methods using default settings and report the average ranks of the causal variants across all simulated loci

3.2 Multi-trait fine-mapping

Having established that our new computationally efficient approach compared favorably in standard fine-mapping scenarios, we next sought to investigate how leveraging information across related traits as well as functional annotation data affected fine-mapping performance. We simulated two genetically correlated traits with 10K individuals where the causal variants are shared between the traits but have heterogeneous effects sizes (see Methods section). To control for the effect of sample size, we also simulated a single trait with 20K individuals. We find that by borrowing information across related traits, we are able to improve fine-mapping performance with greater efficiency than just simply increasing sample size for any single trait (see Fig. 3). In our multi-trait analysis with fastPAINTOR, we required (1.4, 12.4) SNPs per locus for follow-up in order to capture (50%, 90%) of the true causal variants, as compared with (1.9, 23.1) SNPs in a single-trait analysis. Intuitively, this is due to the fact that power to detect causal variants grows with the square root of the sample size, while growing linear with the allelic effects (see Eq. 2). Therefore leveraging multiple genetically correlated traits (i.e. traits that share casual effects) will, on average across multiple loci, be more beneficial than simply increasing the sample size for one of the traits.

Fig. 3.

Fig. 3

Integrative methods improve fine-mapping resolution in multiple traits. We simulated fifty 25KB loci for two traits with shared causal variants at each locus. We measure accuracy as the proportion of causal variants identified as we increase the size of our candidate SNP set

We next explored principled strategies for assembling data spanning multiple traits. Our main comparator was GPA—a method specifically proposed to use pleiotropy and functional data to prioritize variants. In addition, we ran two meta-analysis approaches using fixed effects association statistics—a standard meta-analysis that assumes a single causal variant (Maller et al., 2012), as well as running fastPAINTOR using these fixed effects association statistics as a single trait, which allows for multiple causal variants. In general, our approach is more accurate and robust than previously proposed methods, requiring (1.4, 12.4) SNPs per locus for follow-up in order to identify (50%, 90%) of the causal variants compared to (2.3, 25.1) for fastPAINTOR with FE or (11.6, 32.3) for GPA (Fig. 3). Furthermore, our approach outperforms competing strategies at loci that contain both a single and multiple casual variants (see Supplementary Fig. S3) as well as varying effect size distributions (see Supplementary Fig. S4). One of the critical model assumptions of GPA is that SNPs are independent. Clearly, in the context of fine-mapping, this assumption is strongly violated which explains the sub-optimal performance. Alternatively, FE can be viewed as simply a weighted-average of the effect sizes. In the extreme, though not implausible, scenario where causal effects are going in opposite directions, FE will provide weak evidence that a SNP is causal. We validated this hypothesis in simulations by examining scenarios where the FE assumption was met and when it was strongly violated. In simulations where the effect size of the causal SNPs were identical in both traits (ρ = 1) our the multi-trait framework gives identical performance to fastPAINTOR with FE. However, when effect sizes are anti-correlated, fastPAINTOR with FE statistics leads to severely deflated fine-mapping accuracy (see Supplementary Fig. S5), highlighting the benefit of the multi-trait joint-MVN formulation.

Finally, we developed our framework with the assumption that causal variants are shared across traits. This may not always hold in practice and we wanted to understand how our method responds to violations of this assumption. We performed simulations in which causal variants for the two traits were drawn independently leading to potentially distinct causal SNPs and uncorrelated effect sizes. We find that our joint fine-mapping method is robust to pleiotropic loci with differing causals, yielding relatively small mis-calibration of the credible sets on the order of 10% (see Table 1). We predict that, in cases where the effect sizes among distinct causal variants are correlated, the disparity between the shared causal and distinct causal cases would be even less. We can thus conclude that our proposed framework that jointly models sets of association statistics, explicitly accounts for local correlation structure, and integrates functional data prioritizes variants robustly and accurately.

Table 1.

The performance of fastPAINTOR is largely sustained when the assumption of shared causal variants across traits is violated

Method Proportion of causals identified SNPs selected (s.e.)
Trait 1 0.96 46.01 (0.27)
Trait 2 0.96 45.54 (0.27)
Differing causals 0.86 28.42 (0.22)
Same causals 0.97 26.00 (0.17)

As compared with fine-mapping single traits independently, the reduction in the 95% credible set size is sustained while still capturing a large proportion of the causal variants. We define an 95% confidence set as the number of SNPs we need to select in order to accumulate 95% of the total posterior probability mass per locus.

3.3 Multi-trait fine-mapping in lipids data

In order to demonstrate that the gains in our multi-trait fine-mapping approach are realized in real data, we analyzed summary association data from a large-scale GWAS of lipids (Global Lipids Genetics Consortium et al., 2013). High Density Lipoprotein (HDL), Low Density Lipoprotein (LDL) and Total Triglycerides (TG) are prototypical pleiotropic traits, sharing 24 GWAS hits for at least two. To showcase our pleiotropic fine-mapping framework, we obtained GWAS data over these traits spanning 180K individuals (Global Lipids Genetics Consortium et al., 2013) and did integrative fine-mapping across putative pleiotropic regions. Functional annotation selection was guided by the genome-wide heritability-based functional enrichments reported in Finucane et al. (2015). The authors analyzed HDL, LDL and TG and found that the H3K4me1 mark in liver tissue had the strongest enrichment of heritability across all three traits. Their result provides strong support for the key assumption that causal variants are shared across traits in our model. In addition to liver H3K4me1, we also used the liver H3K27ac mark, which displayed strong enrichment for multiple traits. In addition to a joint analysis, we applied our framework with and without functional data as well as on each trait independently. To quantify fine-mapping resolution we use 99% credible sets (Maller et al., 2012; Kichaev et al., 2014) which are defined as the set of variants that aggregate to capture 99% of the posterior probability mass. Consistent with simulations, pleiotropic fine-mapping provided a reduction in the size of the credible set as compared with investigating individual traits alone (see Table 2). Additional functional data helps refine the signal, though only marginally, since exceedingly strong associations at these regions dominate the prior evidence. Moreover, we show that the 99% credible sets obtained from the cross-trait analysis contained 13 novel SNPs not found in any of the single-trait analyses alone (See Supplementary Fig. S6). This suggests that, for some loci, leveraging association strength across related traits may increase our power to detect more weakly associated causal variants in the individual traits. In conclusion, these encouraging results illustrate that carefully merging related traits can improve the resolution of statistical fine-mapping.

Table 2.

Pleiotropic fine-mapping is superior to single locus fine-mapping

95% Credible Set
99% Credible Set
Annotations + +
HDL 4.6 4.6 4.8 5.1
LDL 5.9 5.9 14.3 11.4
TG 4.2 4.2 5.4 5.4
Multi-trait 3.7 3.7 4.7 4.7

Presented here are the mean number of SNPs that are in the 95 and 99% fine-mapping credible sets.

4 Discussion

In this work, we introduced a fast fine-mapping method that integrates several sources of genetic data to efficiently and accurately prioritize causal variants. Our Importance Sampling strategy dramatically reduces runtime due to its ability to efficiently sample high probability causal configurations, demonstrating that enumerating over complex model spaces is not necessary for integrative fine-mapping. We generalized this approach to leverage multiple traits simultaneously and demonstrated, both in simulations and real data, that this strategy can improve the ability to detect causal variants impacting both traits. As GWAS data accumulate and evidence for the abundance of pleiotropic risk loci mounts, there is a need for fine-mapping methods that can perform large-scale integrative analyses. Moreover, efforts by large consortia such as ENCODE will continue to provide genomic annotation data that will improve the accuracy of fine-mapping studies. A key advantage to our method is that it requires only summary association data, overcoming the issues that arise when sharing individual data that would otherwise limit sample sizes. In light of these developments, our proposed methodology will become increasingly applicable in the future, particularly where multiple genetically correlated traits show at least suggestive evidence of association at a locus. Furthermore, our approach could even be applied to fine-map seemingly disparate traits such as height and educational attainment, which, nonetheless, share a genetic component (Bulik-Sullivan et al., 2015).

We conclude by highlighting some caveats and limitations of our proposed framework. The power of our multi-trait fine-mapping framework hinges on the assumption that causal variants are shared at pleiotropic risk regions. While this notion is supported by the fact that related traits have shared functional genetic architectures (Finucane et al., 2015), it is unknown whether this holds in general when doing fine-mapping. Reassuringly, we demonstrated in simulations that the coverage of the resulting credible sets is reduced by a modest 10% when this assumption is violated. Second, most large-scale GWAS have overlapping samples and the conditional independence assumption given in (Eq. 9) may be violated. However, it is unclear whether this violation will bias the results dramatically if the underlying causal variants are shared across traits. Finally, while our Importance Sampling scheme does not explicitly upper-bound the number of causal variants at a fine-mapping regions, it favors exploring parsimonious models over complex ones. We therefore advocate that fine-mapping using our approach be undertaken where there is evidence of only moderate allelic heterogeneity.

Funding

This research has been supported by U01-CA194393, CA182821, and R01 GM053275 from the National Institute of Health. GK and MR are supported by the Biomedical Big Data Training Program (NIH-NCI T32CA201160).

Conflict of Interest: none declared.

References

  1. 1000 Genomes Project Consortium. et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature, 491, 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Benner C. et al. (2015) Finemap: efficient variable selection using summary data from genome-wide association studies. bioRxiv, 027342.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bulik-Sullivan B. et al. (2015) An atlas of genetic correlations across human diseases and traits. Nat. Genet., 47, 1236–1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Buse A. (1982) The likelihood ratio, Wald, and Lagrange multiplier tests: an expository note. Am. Stat., 36, 153–157. [Google Scholar]
  5. Chen W. et al. (2015) Fine mapping causal variants with an approximate Bayesian method using marginal test statistics. Genetics, 200, 719–736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chung D. et al. (2014) GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation. PLoS Genet., 10, e1004787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Claussnitzer M. et al. (2015) FTO obesity variant circuitry and adipocyte browning in humans. N. Engl. J. Med., 373, 895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. ENCODE Project Consortium. et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Evangelou E., Ioannidis J.P. (2013) Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet., 14, 379–389. [DOI] [PubMed] [Google Scholar]
  10. Finucane H.K. et al. (2015) Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet., 47, 1228–1235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Global Lipids Genetics Consortium. et al. (2013) Discovery and refinement of loci associated with lipid levels. Nat. Genet., 45, 1274–1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Glynn P.W., Iglehart D.L. (1989) Importance sampling for stochastic simulations. Manag. Sci., 35, 1367–1392. [Google Scholar]
  13. Gusev A. et al. (2014) Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet., 95, 535–552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hormozdiari F. et al. (2014) Identifying causal variants at loci with multiple signals of association. Genetics, 198, 497–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kichaev G., Pasaniuc B. (2015) Leveraging functional-annotation data in trans-ethnic fine-mapping studies. Am. J. Hum. Genet., 97, 260–271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kichaev G. et al. (2014) Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet., 10, e1004722.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kote-Jarai Z. et al. (2013) Fine-mapping identifies multiple prostate cancer risk loci at 5p15, one of which associates with TERT expression. Hum. Mol. Genet., 22, 2520–2528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kundaje A. et al. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518, 317–330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lindström S. et al. (2014) Genome-wide association study identifies multiple loci associated with both mammographic density and breast cancer risk. Nat. Commun., 5, 5303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Liu J.Z. et al. (2015) Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet., 47, 979–986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Locke A.E. et al. (2015) Genetic studies of body mass index yield new insights for obesity biology. Nature, 518, 197–206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Maller J.B. et al. (2012) Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat. Genet., 44, 1294–1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Meyer K.B. et al. (2013) Fine-scale mapping of the fgfr2 breast cancer risk locus: putative functional variants differentially bind foxa1 and e2f1. Am. J. Hum. Genet., 93, 1046–1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Musunuru K. et al. (2010) From noncoding variant to phenotype via sort1 at the 1p13 cholesterol locus. Nature, 466, 714–719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Okada Y. et al. (2014) Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature, 506, 376–381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Pasaniuc B. et al. (2014) Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, btu416.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Solovieff N. et al. (2013) Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet., 14, 483–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Su Z. et al. (2011) Hapgen2: simulation of multiple disease SNPs. Bioinformatics., 27, 2304–2305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Type,A.G.E.N. et al. (2014) Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat. Genet., 46, 234–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Visscher P.M. et al. (2012) Five years of GWAS discovery. Am. J. Hum. Genet., 90, 7–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wood A.R. et al. (2014) Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet., 46, 1173–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wu Y. et al. (2013) Trans-ethnic fine-mapping of lipid loci identifies population-specific signals and allelic heterogeneity that increases the trait variance explained. PLoS Genet., 9, e1003379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Yang J. et al. (2011) Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet., 43, 519–525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhou V.W. et al. (2011) Charting histone modifications and the functional organization of mammalian genomes. Nat. Rev. Genet., 12, 7–18. [DOI] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES