Efficient Integrative Multi-SNP Association Analysis via Deterministic Approximation of Posteriors

Xiaoquan Wen; Yeji Lee; Francesca Luca; Roger Pique-Regi

doi:10.1016/j.ajhg.2016.03.029

. 2016 May 26;98(6):1114–1129. doi: 10.1016/j.ajhg.2016.03.029

Efficient Integrative Multi-SNP Association Analysis via Deterministic Approximation of Posteriors

Xiaoquan Wen ^1,^∗, Yeji Lee ¹, Francesca Luca ^2,³, Roger Pique-Regi ^2,³

PMCID: PMC4908152 PMID: 27236919

Abstract

With the increasing availability of functional genomic data, incorporating genomic annotations into genetic association analysis has become a standard procedure. However, the existing methods often lack rigor and/or computational efficiency and consequently do not maximize the utility of functional annotations. In this paper, we propose a rigorous inference procedure to perform integrative association analysis incorporating genomic annotations for both traditional GWASs and emerging molecular QTL mapping studies. In particular, we propose an algorithm, named deterministic approximation of posteriors (DAP), which enables highly efficient and accurate joint enrichment analysis and identification of multiple causal variants. We use a series of simulation studies to highlight the power and computational efficiency of our proposed approach and further demonstrate it by analyzing the cross-population eQTL data from the GEUVADIS project and the multi-tissue eQTL data from the GTEx project. In particular, we find that genetic variants predicted to disrupt transcription factor binding sites are enriched in cis-eQTLs across all tissues. Moreover, the enrichment estimates obtained across the tissues are correlated with the cell types for which the annotations are derived.

Introduction

Association analysis has become a powerful tool for identifying genetic variants that impact complex traits at both the organismal and molecular levels: in the past decade, genome-wide association studies (GWASs) have successfully identified a rich catalog of genetic variants that are linked to many human diseases. Most recently, molecular QTL mapping has revealed an abundance of quantitative trait loci (QTLs) for cellular phenotypes such as gene expression,¹^,² chromatin accessibility,³ histone modifications,⁴ and DNA methylation.⁵ Nevertheless, the causal molecular pathways from genetic variants to complex phenotypes remain poorly understood.⁶ This is mainly because a good proportion of identified trait-associated variants are located in the non-coding regions of the genome, and our knowledge of the functional roles of non-coding variants is generally lacking. With the recent advancements in high-throughput experimental technologies, functional annotations for regulatory variants have become increasingly available.¹^,⁷^,⁸ As a consequence, it is now feasible to perform association analysis incorporating functional genomic annotations. The integrative analysis strategy presents two obvious advantages: first, it improves the power of association analysis by prioritizing functional variants; second, it helps to reveal the underlying molecular mechanisms that lead to the observed associations.

In the past, integrative analysis was typically performed by searching for overlaps between putative association signals and SNP annotations. This analysis strategy implicitly assumes that a SNP with specific genomic annotations is probably causal. To justify the results from the post hoc overlapping analysis, quantitatively validating this implicit assumption from the observed association data, which essentially requires estimating the enrichment levels of the annotations in the association signals, is critical. This point becomes particularly crucial when multiple types of annotations are used, and a rigorous quantitative enrichment analysis should help to determine which annotations are relevant and how much we should weigh each annotation. The availability of functional annotations also enables high-resolution multi-SNP genetic association analysis. From both GWAS and molecular QTL mapping studies, it is increasingly evident that multiple independent association signals can co-exist in a relatively small genomic region. Multi-SNP fine-mapping analysis has now become a standard procedure to tease out potential multiple association signals. It is only natural that genomic annotations are integrated into this process.

Recently, a few computational approaches for integrative enrichment and association analysis have been proposed and successfully demonstrated in molecular QTL mapping⁹^,¹⁰ and GWASs.¹¹^,¹² However, these existing approaches make simplifying assumptions for either enrichment analysis¹² or multi-SNP fine-mapping analysis.⁹^,¹¹ Therefore, the power of integrative analysis has not been maximized and can be further improved. In addition, computational efficiency has always been a hurdle in terms of applying probabilistic integrative analysis approaches to genetic data at the genome-wide scale.

In this paper, we propose a probabilistic hierarchical model that is generalized from our recent work¹³ to describe multi-SNP genetic associations while accounting for functional genomic annotations. Based on this model, we consider analyzing genetic association data in two settings: traditional GWASs and molecular cis-QTL mapping studies. Note that a distinct feature of molecular QTL mapping is that tens of thousands (or hundreds of thousands) of molecular phenotypes (e.g., gene expression, DNA methylation) are simultaneously measured and analyzed, which imposes some unique statistical challenges. In addition, the candidate genomic region for each molecular phenotype is typically defined in the proximity of relevant genomic landmarks of the corresponding molecular phenotypes (e.g., transcription start site of a target gene for expression phenotypes) and is much smaller in length (usually spanning 1 to 2 Mb) compared to GWASs. We outline a three-stage inference procedure to sequentially perform enrichment analysis, QTL discovery, and multi-SNP fine mapping. One of our main contributions is a computationally efficient algorithm for Bayesian multi-SNP association analysis. This fast fitting algorithm, named deterministic approximation of posteriors (DAP), facilitates the proposed rigorous integrative inference procedure. Compared to the alternative fitting algorithm, i.e., the Markov Chain Monte Carlo (MCMC) algorithm, we show that the DAP is several hundred times faster and more accurate for genetic association analysis. Taking full advantage of the DAP algorithm, we lay out the analytic strategies for analyzing genetic association data from GWASs and molecular cis-QTL mapping studies, and we demonstrate the proposed procedures through a series of simulation studies and real data applications.

Material and Methods

Model and Notation

First, we consider a generic setting of association analysis of a single quantitative trait and p SNPs, both measured for n unrelated individuals. We model the genotype-phenotype association using a multiple linear regression model,

\vec{y} = μ \vec{1} + \sum_{i = 1}^{p} β_{i} {\vec{g}}_{i} + \vec{e}, \vec{e} \sim N (0, σ^{2} I) .

(Equation 1)

For each SNP i, we denote its binary association status, γ_i, by dichotomizing its corresponding genetic effect β_i, i.e., γ_i = 1 if β_i ≠ 0 and 0 otherwise. In particular, we refer to the causal SNPs for which γ_i = 1 as the quantitative trait nucleotides (QTNs).⁹ Our primary interest for association analysis is the inference of $\vec{γ} : = (γ_{1}, \dots, γ_{p})$ . To integrate genomic annotation into the association analysis, we assume that having certain genomic features will increase (or decrease) the odds that a particular SNP is a QTN. Equivalently, certain genomic features are enriched (or depleted) in QTNs. We quantitatively represent this assumption using an a priori independent logistic model for each γ_i, i.e.,

log [\frac{Pr (γ_{i} = 1)}{Pr (γ_{i} = 0)}] = α_{0} + \sum_{k = 1}^{q} α_{k} d_{i k},

(Equation 2)

where ${\vec{d}}_{i} : = (d_{i 1}, \dots, d_{i q})$ denotes q genomic annotations that are specific to SNP i at a particular locus and α₁, …, α_q are referred to as the enrichment parameters. Note that the annotations can be either categorical or continuous in this framework. We assume that the phenotype data, $\vec{y}$ , the genotype data, $G : = ({\vec{g}}_{1}, \dots, {\vec{g}}_{p})$ , and the annotation data, $D : = ({\vec{d}}_{1}, \dots, {\vec{d}}_{p})$ , are observed, whereas the enrichment parameters, $\vec{α} : = (α_{0}, α_{1}, \dots, α_{q})$ , are unknown.

For molecular QTL mapping, tens of thousands of phenotypes are simultaneously measured, and we denote the collection of all measured phenotypes by $Y : = ({\vec{y}}_{1}, \dots, {\vec{y}}_{L})$ . For each phenotype, a small genomic region, typically spanning 1 to 2 Mb and on average containing a few thousand SNPs, is pre-defined as the candidate locus in the proximity of relevant genomic landmarks of the corresponding molecular phenotypes, and we denote the union of the SNP genotypes from all candidate loci by $G : = (G_{1}, \dots, G_{L})$ . Similarly, we use $D : = (D_{1}, \dots, D_{L})$ and $Γ : = ({\vec{γ}}_{1}, \dots, {\vec{γ}}_{L})$ to denote the collections of annotations and latent association status, respectively.

In GWASs, there is usually only one phenotype of interest, which can be viewed as a special case of molecular QTL mapping. Nevertheless, it is important to note that the candidate region for GWASs spans the whole genome.

Inference Procedure

We propose an inference procedure consisting of three inter-related stages to fit the proposed hierarchical model. Sequentially, these stages are as follows:

1.
Estimating the enrichment parameter $\vec{α}$ using the full data $Y, G$ , and $D$ for enrichment analysis
2.
Screening candidate loci for QTL discovery
3.
Performing multi-SNP fine mapping for the high-priority loci identified in step 2

The maximum likelihood estimate (MLE) of $\vec{α}$ can be obtained by the EM algorithm proposed in our recent work.¹³ In brief, the EM algorithm treats $Γ$ as missing data and pools information across all available loci. In the E-step, the posterior inclusion probability (PIP) for each SNP i at each locus l (namely, $Pr (γ_{l_{i}} = 1 | {\vec{y}}_{l}, G_{l}, {\vec{α}}^{(t)})$ ) is computed given the current estimate of $\vec{α}$ ; in the M-step, a logistic regression model is fit by plugging in the PIPs as the response variables and SNP annotations as predictors. The estimate of $\vec{α}$ is subsequently updated by the corresponding fitted regression coefficients.

Given the MLE of the enrichment parameter, $\hat{\vec{α}}$ , we then attempt to identify genomic loci that are likely to harbor causal QTNs. This is achieved by testing the null hypothesis, $H_{0} : {\vec{γ}}_{l} = 0$ , for each candidate locus l via a Bayesian false discovery rate (FDR) control procedure. Specifically, the null hypothesis is rejected if the locus-level posterior probability $Pr ({\vec{γ}}_{l} = 0 | {\vec{y}}_{l}, G_{l}, \hat{\vec{α}})$ is smaller than the pre-defined threshold determined by the observed data and desired FDR control level.¹⁴ At the end of this stage, we gather a list of potential QTLs for fine mapping.

Finally, we perform multi-SNP fine-mapping analysis for the identified QTLs. In particular, we compute the posterior distribution for each locus l, namely, $Pr ({\vec{γ}}_{l} | {\vec{y}}_{l}, G_{l}, \hat{\vec{α}})$ , to (1) identify potentially multiple independent association signals within locus l and (2) assess the importance of each SNP by computing its PIP, i.e., $Pr (γ_{l_{i}} = 1 | {\vec{y}}_{l}, G_{l}, \hat{\vec{α}})$ . A credible set of potential causal SNPs for each independent signal can then be constructed from the resulting PIPs in a manner similar to previously proposed methods.¹³^,¹⁵ This Bayesian approach for multi-SNP analysis has been known to present some unique advantages over the traditional conditional analysis approach. For example, it fully accounts for patterns of linkage disequilibrium (LD) and shows superior power in discovering independent association signals.¹³^,¹⁶

This three-stage procedure represents a coherent empirical Bayes strategy to fit the proposed hierarchical model for inference. In all three stages, the computational difficulty lies in the efficient evaluation of the posterior probability $Pr ({\vec{γ}}_{l} | {\vec{y}}_{l}, G_{l}, \vec{α})$ . We propose an algorithm to tackle this problem in the following sections. The software package implementing the computational approaches (in C++ programming language) is freely available (Web Resources).

Deterministic Approximation of Posteriors

The computation of the target posterior probability $Pr ({\vec{γ}}_{l} | {\vec{y}}_{l}, G_{l}, \vec{α})$ is conceptually straightforward by applying the Bayes theorem, i.e.,

Pr ({\vec{γ}}_{l} = \vec{γ} | {\vec{y}}_{l}, G_{l}, \vec{α}) = \frac{Pr (\vec{γ} | \vec{α}) BF (\vec{γ})}{\sum_{\vec{γ}'} Pr ({\vec{γ}}^{'} | \vec{α}) BF ({\vec{γ}}^{'})},

(Equation 3)

where the Bayes factor

BF (\vec{γ}) : = \frac{P ({\vec{y}}_{l} | G_{l}, {\vec{γ}}_{l} = \vec{γ})}{P ({\vec{y}}_{l} | G_{l}, {\vec{γ}}_{l} \equiv 0)}

represents the marginal likelihood function of ${\vec{γ}}_{l}$ evaluated at $\vec{γ}$ . Based on Equation 3, the PIP of each candidate SNP can be subsequently marginalized from $Pr ({\vec{γ}}_{l} | {\vec{y}}_{l}, G_{l}, \vec{α})$ .

For any given $\vec{γ}$ value, both the Bayes factor (whose computation involves integrating out the nuisance parameters μ, β, and σ²) and the prior probability can be analytically evaluated.¹⁷^,¹⁸ The difficulty lies in evaluating the normalizing constant

C : = \sum_{\vec{γ}} Pr ({\vec{γ}}_{l} = \vec{γ} | \vec{α}) BF (\vec{γ}) .

For a locus consisting of p candidate SNPs, the exact computation requires enumerating all 2^p possible $\vec{γ}$ values; hence, it is intractable even for modest p. Previously, the only feasible solution was to employ a Markov Chain Monte Carlo (MCMC) algorithm.¹³^,¹⁶^,¹⁹ However, the MCMC algorithm is computationally too costly in our grand scheme for integrative genetic association analysis: the evaluation of $Pr ({\vec{γ}}_{l} | {\vec{y}}_{l}, G_{l}, \vec{α})$ for every locus is required for each E-step in the EM algorithm for enrichment analysis. Furthermore, the inherent stochastic variation in the MCMC algorithm can affect the performance and reproducibility of the overall analysis.

Here, we present an alternative algorithm to perform deterministic approximation of posteriors (DAP) for each locus and efficiently compute PIPs for all candidate SNPs. This algorithm is mainly motivated by two observations in genetic association analysis. First, in almost all genetic applications, the number of convincing QTLs (i.e., those have relatively large effect sizes) discovered from the association data are typically small compared with the number of candidate SNPs within a candidate locus (typically 1 to 2 Mb). In molecular QTL mapping, this observation is also supported by many recent experimental works.²⁰^,²¹^,²² It implies that the vast majority of the posterior probability mass in the space of all possible combinations of SNPs must be concentrated in a much lower-dimensional subspace. That is, only association models containing a few SNPs are likely to have non-negligible posterior probabilities within a locus. Second, noteworthy QTL SNPs, as reflected by their non-negligible PIP values, are thought to typically show modest to strong marginal association signals in either single-SNP or conditional analysis. Based on the above observations, we design the DAP algorithm to adaptively select a small subset of noteworthy candidate QTL SNPs and thoroughly explore the low-dimensional model space composed by these SNPs within each candidate locus. In addition, the DAP algorithm applies a combinatorial approximation to estimate the posterior probability mass from the unexplored model space. Unlike the MCMC, the DAP algorithm is highly parallelizable, and our implementation takes full advantage of this property. More specifically, the proposed DAP algorithm approximates the normalizing constant C by

C^{*} = \sum_{{\vec{γ}}^{'} \in Ω} Pr ({\vec{γ}}_{l} = {\vec{γ}}^{'} | \vec{α}) BF ({\vec{γ}}^{'}) + \in,

(Equation 4)

where Ω denotes a subset of the selected most plausible models to be explored explicitly and $\in$ is an estimate of the approximation error $C - \sum_{{\vec{γ}}^{'} \in Ω} Pr ({\vec{γ}}_{l} = {\vec{γ}}^{'} | \vec{α}) BF ({\vec{γ}}^{'})$ . The key to the DAP algorithm is the construction of the set Ω: it is desirable that models in Ω capture the vast majority of the posterior probability mass; on the other hand, Ω should be compact enough for efficient exploration. In this paper, we propose two different approaches to construct Ω. In both cases, we define the size of the association model, $‖ {\vec{γ}}_{l} ‖$ , as the number of assumed QTNs (also known as the 0-norm of the vector ${\vec{γ}}_{l}$ ), i.e., $‖ {\vec{γ}}_{l} ‖ = \sum_{i = 1}^{p} γ_{l_{i}}$ , and partition the complete model space of ${{\vec{γ}}_{l}}$ by the size of association models, i.e., ${{\vec{γ}}_{l}} = {‖ {\vec{γ}}_{l} ‖ = 0} \cup {‖ {\vec{γ}}_{l} ‖ = 1} \cup \dots \cup {‖ {\vec{γ}}_{l} ‖ = p}$ .

Adaptive DAP Algorithm

The first approach, named adaptive DAP, includes the null model and all the single SNP association models in the candidate set Ω. For a larger size of candidate models, it approximates $C_{s} : = \sum_{‖ \vec{γ} ‖ = s} Pr (\vec{γ} | \vec{α}) BF (\vec{γ})$ by a corresponding estimate $C_{s}^{*} = \sum_{\vec{γ} \in Ω_{s}} Pr (\vec{γ} | \vec{α}) BF (\vec{γ})$ , where Ω_s consists of a subset of association models with size s but is constructed only from a set of adaptively selected high-priority SNPs. The adaptive selection of the high-priority SNPs is similar to a Bayesian version of conditional analysis²³ that naturally accounts for LD. More specifically, suppose that a “best” model with the maximum posterior probability for $‖ \vec{γ} ‖ = s - 1$ has been identified. The SNP selection procedure then goes through all candidate SNPs, adding a single SNP at a time to the existing best model, and evaluates their posterior probabilities of being the sole additional QTN (see details in Appendix A). Note that this procedure is similar to single-SNP analysis and is computationally trivial. The candidate SNPs whose posterior probabilities in the conditional analysis are greater than a pre-defined threshold λ, which is a valid probability measure (by default, we set λ = 0.01), are then added to the existing subset of high-priority SNPs. Finally, the DAP algorithm enumerates the updated subset of priority SNPs for all combinations of $‖ \vec{γ} ‖ = s$ to compute $C_{s}^{*}$ and, in the process, records the “best” posterior model with the increased model size.

Additionally, the adaptive DAP extensively explores only the model partitions with relatively small sizes. Suppose that there are truly K QTLs in p candidate SNPs. It should be clear that {C_s} becomes a (sharply) decreasing sequence as s > K and that the behavior of this decreasing sequence is mathematically predictable (Appendix B). This behavior occurs because the marginal likelihood becomes saturated as the model size exceeds the number of true associations and because the additional prior term imposes a hefty penalty on the overall product. Utilizing this fact, we derive an approximate recursive relationship between C_s and C_s+1 as s ≥ K (Appendix B). Based on this relationship, the stopping rule for explicit exploration is determined, and we estimate $\in$ by

\in = \sum_{s = t + 1}^{p} R_{s}^{*} with R_{s + 1}^{*} = \frac{p - s}{s + 1} ω R_{s}^{*} for s = t + 1, \dots, p,

(Equation 5)

where t is the stopping point of the extensive exploration, $R_{t}^{*} = C_{t}^{*}$ , and $ω = (1 / p) \sum_{i = 1}^{p} exp (α_{0} + \sum_{l = 1}^{q} α_{l} d_{i l})$ represents the average prior odds ratio across SNPs. This estimation essentially assumes that the marginal likelihood is completely saturated for the partitions with s > t, and the overall contribution to the normalizing constant from each size partition can be roughly estimated by re-calibrating the prior changes (see details in Appendix B). To ensure a high accuracy for the approximation, we also build in an optional criterion on top of the stopping rule by monitoring the convergence of the partial sum $S_{k} = \sum_{i}^{k} C_{i}^{*}$ and enforcing the exploration until

{log}_{10} [\frac{S_{t}}{S_{t - 1}}] < κ, κ > 0,

or, equivalently $(C_{t}^{*} / \sum_{i}^{t - 1} C_{i}^{*}) < 10^{κ} - 1$ . By default, we set $κ = 0.01$ . This additional criterion makes a difference only for the partitions whose model sizes barely exceed the estimated size of the saturated models: instead of using the combinatorial estimate of the corresponding $C_{s}^{*}$ , it enforces additional DAP explorations for more accurate evaluations.

Finally, it should be recognized that the built-in tuning parameters $(λ, κ)$ enable great flexibility to run the adaptive DAP. As both $λ \to 0$ and $κ \to 0$ , the adaptive DAP enumerates all models and becomes an exact calculation with no loss of precision, whereas when λ is very large, the behavior of the DAP algorithm becomes very similar to the commonly applied stepwise conditional analysis that has very high computational efficiency. In practice, we attempt to strike a good balance between the precision and efficiency.

DAP-K Algorithm

Instead of adaptively selecting a subset of high-priority SNPs from all the model size partitions, the DAP algorithm can also be applied by pre-fixing the maximum model size (namely, K) while allowing the exploration of all possible SNP combinations under the restriction. We refer to this variant of the algorithm as the DAP-K algorithm. In the special case of K = 1 (DAP-1), the algorithm essentially assumes that at most one causal QTL exists in the region of interest. Although this very assumption has been successfully utilized by many other approaches,⁹^,¹¹^,¹⁷^,²³ it has always been formulated as an explicit prior assumption and hence requires a somewhat non-natural parameterization that also complicates the maximization step when used in the EM algorithm for enrichment analysis (Appendix C). The DAP-1 algorithm provides the advantage of considerably faster computation, even when compared with the adaptive version of the DAP algorithm. More importantly, it can be applied using only summary statistics from single-SNP association analysis (in the form of the marginal estimate of the genetic effect and its standard error for each SNP). This feature is particularly attractive, especially when the individual-level genotype and phenotype information is difficult to access. We provide the derivation and other technical details for the DAP-K algorithm in the Appendix C.

Applying DAP in Inference

We use both variants of the DAP algorithms in our inference procedure. Specifically, we propose applying the DAP-1 algorithm in the EM algorithm for enrichment analysis and the adaptive DAP for multi-SNP fine mapping at the last stage.

The performance of the enrichment analysis mostly relies on the average accuracy of the PIP estimates. We show, both theoretically (Appendix E) and numerically (Figure 2), that the DAP-1 algorithm provides on average precise estimates suitable for enrichment analysis. Most importantly, the DAP-1 algorithm exhibits the best computational efficiency among the appropriate alternatives (e.g., adaptive DAP, MCMC).

Assessment of the Accuracy of the Adaptive DAP Algorithm at Different Threshold Values

In the top panel, the individual PIP approximations from the DAP are compared to the exact calculations. In the bottom panel, the distribution of C^∗/C is plotted. The simulation results are obtained for threshold values λ = 0.01, 0.02, 0.05 for the DAP algorithm.

For the multi-SNP analysis in the final fine-mapping stage, we strongly recommend applying the adaptive DAP algorithm. Although the DAP-1 algorithm yields only inferior results for a small proportion of the loci that harbor multiple QTNs, we argue that identifying multiple independent association signals from those loci is of particular importance for the overall analysis. To achieve better accuracy for all loci, the adaptive DAP seems a logical choice for multi-SNP fine-mapping analysis.

Application to GWASs

In practice, the DAP works well for small genomic regions harboring a handful of QTNs. This is typically the case in molecular QTL mapping, where candidate loci usually span no more than 2 Mb. When there are more QTNs (e.g., >5) in a locus, the adaptive DAP exploration with high precision may become time consuming because the size of the candidate set Ω grows exponentially fast with the increasing number of independent signals. Nevertheless, in applications of GWASs, we essentially consider a single locus that spans the whole genome, and for a single trait, the number of independent association signals can range from hundreds to thousands.

To apply the DAP to GWASs (or molecular QTL mapping with considerably larger candidate loci), we propose an additional approximation that factorizes $Pr ({\vec{γ}}_{l} | {\vec{y}}_{l}, G_{l}, \vec{α})$ (where locus l spans a much larger genomic region) into

Pr ({\vec{γ}}_{l} | {\vec{y}}_{l}, G_{l}, \vec{α}) \approx \prod_{k = 1}^{K} Pr ({\vec{γ}}_{[k]} | {\vec{y}}_{l}, G_{l}, \vec{α}),

(Equation 6)

where ${{\vec{γ}}_{[k]} : k = 1, \dots, K}$ represents a partition of ${\vec{γ}}_{l}$ by sets of non-overlapping LD blocks. This factorization is based on previous theoretical results.¹⁸^,²⁴ Recently, Berisa and Pickrell²⁵ provided a working recipe to segment the full genome based on the population-specific LD structures. Based on these results, we provide mathematical arguments to justify the factorization (Appendix D). In brief, applying the analytic approximation of the Bayes factors,¹⁸ it can be shown that

BF (\vec{γ}) \approx \prod_{k = 1}^{K} BF ({\vec{γ}}_{[k]}) .

This result, along with the fact that our priors are independent across SNPs, naturally leads to the approximate factorization of the posterior probability. As an important consequence, the factorization (Equation 6) suggests that the DAP can be applied to each LD block independently.

Results

First, we perform a series of simulation studies to examine the accuracy and efficiency of the proposed DAP algorithms in our inference procedure. We then apply the proposed approach to analyze two large-scale eQTL datasets.

Simulation Studies

Enrichment Analysis with DAP

The integration of DAP into the EM algorithm enables the efficient estimation of enrichment parameters using large-scale QTL datasets. To investigate the performance of the enrichment analysis, we simulate a modest-scale eQTL dataset to mimic the genome-wide investigation of cis-eQTLs. Specifically in each simulation, we select a subset of 1,500 random genes from the GEUVADIS data.² For each gene, the real genotypes of 50 cis-SNPs from 343 European individuals are used in the simulation. We annotate 20% of the SNPs with a binary feature. For each SNP, we determine its binary association status by performing a Bernoulli trial with the success rate $p = exp (- 4 + α_{1} d) / (1 + exp (- 4 + α_{1} d)) .$ Given the QTNs, we then simulate the expression levels according to a multiple linear regression model with residual error variance set to 1. More specifically, the genetic effect of each QTN is drawn from an independent normal distribution N(0,0.6²). As a result, the simulated datasets resemble the practically observed cis-eQTL data (Figure S1).We vary the α₁ values from 0.00 to 1.00, and we generate 100 datasets for each α₁ value.

We analyze the simulated datasets using two different implementations of the EM algorithm with the E-step approximated by the DAP-1 and the adaptive DAP. For evaluation, we also estimate α₁ by fitting a logistic regression model using the true association status of each SNP. This analysis represents a theoretical best-case scenario, and its results should be regarded as the bound of the most optimal outcome from any analysis that infers the latent association status (Γ) from observed data.

Figure 1 shows that the estimates from the adaptive DAP and DAP-1 are both seemingly unbiased. As expected, the variability of the point estimates from both DAP implementations is higher than that from the best-case method because of the uncertainty in determining the true association status of each SNP. The estimates of the 95% confidence intervals from the individual simulations also confirm this finding (Figure S2). Although the adaptive DAP seemingly generates more accurate estimates on average, we conclude that the numerical performance of DAP-1 is very comparable. Importantly, DAP-1 provides superior computational efficiency: the average running time for the DAP-1-embedded EM algorithm (with 10 parallel threads in the E-step) is 65.05 s; in comparison, the adaptive DAP-embedded EM runs for 387.30 s on average (which is a combination of slightly longer iterations and longer running times per iteration).

Point Estimates of the Enrichment Parameter Produced using Various Analysis Methods in Different Simulation Settings

The point estimate of the α₁ ± standard error (obtained from 100 simulated datasets) for each method is plotted for each simulation setting. The “best case” method uses the true association status and represents the optimal performance for any enrichment analysis method. Both the adaptive DAP and DAP-1 methods yield unbiased estimates in all settings, although the adaptive DAP-embedded EM algorithm generates slightly smaller standard errors.

Finally, we note that both the adaptive DAP and DAP-1 algorithms underestimate the α₀ parameter: on average, DAP-1 estimates ${\hat{α}}_{0} = - 4.62$ , and the adaptive DAP yields ${\hat{α}}_{0} = - 4.32$ (recall that the truth is α₁ = −4.00). This is fully expected, largely because of the limitation of the statistical power in detecting weak association signals. The practical consequence is that the empirical Bayes priors constructed for the final stage of multi-SNP fine mapping analysis are slightly conservative. However, we argue that the conservative priors generally lead to reduced false discoveries and may be welcomed in practice for fine-mapping analysis.

Accuracy of the Adaptive DAP Algorithm

In the second numerical experiment, we compare the performance of the adaptive DAP algorithm with the exact Bayesian computation. In particular, we are interested in evaluating the accuracy of the approximation $Pr ({\vec{γ}}_{l} | {\vec{y}}_{l}, G_{l}, \vec{α})$ and the induced SNP-level PIP values from the adaptive DAP algorithm. The simulation setting mimics multi-SNP fine-mapping analysis at the final stage of our proposed inference procedure.

For the exact Bayesian computation with reasonable computational cost, we have to limit the number of candidate SNPs in a locus. Specifically, in each simulation, we randomly select genotypes of p = 15 neighboring cis-SNPs of a gene from the GEUVADIS dataset. We then uniformly select one to five QTNs and generate the phenotype measure using a multiple linear regression model.

We apply both the adaptive DAP algorithm and the exact Bayesian posterior computation on a total of 1,250 simulated datasets using the identical prior specification. The exact computation evaluates all 2¹⁵ = 32,768 association models for each simulated dataset. We apply the adaptive DAP algorithm by varying the threshold value for selecting high-priority candidate SNPs, λ, from 0.01 to 0.05.

First, we compare the true normalizing constant C with the estimated value C^∗ from the adaptive DAP by computing the ratio C^∗/C in each simulated dataset. Utilizing all SNPs of all the simulated datasets, we also calculate the root-mean-square error (RMSE) to characterize the precision of the PIP approximations. The results indicate that for stringent λ values, the DAP can indeed estimate the normalizing constant with very high accuracy (Table 1 and Figure 2), which ensures the high precision of the estimated PIPs. As the λ threshold is relaxed, the approximation of C becomes less accurate in some cases; nevertheless, we observe that the overall precision level of the approximate PIPs is still reasonably high.

Table 1.

Numerical Comparison of the Exact Calculation and the Adaptive DAP Algorithm at Different Threshold Values in the Second Simulation Study

λ	Mean of C^∗/C	RMSE of Approximate PIP
0.01	0.994	2.36 × 10⁻³
0.02	0.986	5.32 × 10⁻³
0.03	0.963	9.83 × 10⁻³
0.04	0.921	1.40 × 10⁻²
0.05	0.854	2.42 × 10⁻²

Open in a new tab

Next, we examine the derived stopping rule and the analytic estimation of the approximation error. Overall, we find that the stopping rule and the error approximation work extremely well for these simulations, and we summarize the results in Figure S3.

Using the simulated dataset, we also benchmark the average computational time for each simulation/analysis setting and present the results in Table 2. All runs are performed with 10 parallel threads using the OpenMP library. For the exact calculation, the average time remains constant regardless of the number of true QTNs. The DAP algorithm represents a much reduced computational time compared to the exact calculation. The general trend of the DAP running time is also clear (albeit a few small deviations): with an increasing number of true QTNs, the running time increases, and with more relaxed λ values, the running time decreases.

Table 2.

Benchmark of the Average Computational Time Required for the DAP and Exact Computation

Method	Running Time (s)
Number of True QTLs
1	2	3	4	5
DAP (λ = 0.01)	0.097 (0.234)	0.275 (1.180)	0.733 (3.704)	1.276 (7.140)	2.527 (13.181)
DAP (λ = 0.02)	0.093 (0.268)	0.208 (0.776)	0.663 (3.128)	1.275 (6.816)	2.368 (12.965)
DAP (λ = 0.03)	0.087 (0.238)	0.133 (0.408)	0.252 (1.060)	0.844 (4.644)	1.422 (7.876)
DAP (λ = 0.04)	0.063 (0.116)	0.122 (0.312)	0.230 (0.732)	0.615 (3.064)	0.571 (2.596)
DAP (λ = 0.05)	0.050 (0.072)	0.120 (0.280)	0.139 (0.320)	0.184 (0.448)	0.180 (0.276)
Exact	19.8 (121.4)

Open in a new tab

The running time is measured in seconds by the UNIX utility program “time.” In each cell, we show the actual running time (“real” time), which is greatly reduced by parallel processing with ten threads; in the parentheses, the “user” time is reported, which objectively reflects the actual computational cost, i.e., this measurement is not reduced by the parallelization.

Power Comparison of the Multi-SNP Analysis Algorithms

In the final simulation study, we compare the performance of the adaptive DAP with other existing algorithms in identifying multiple association signals. Specifically, we directly use the simulated multiple-population eQTL datasets from Wen et al.,¹³ where a genomic locus consisting of 100 relatively independent LD blocks (with 25 neighboring SNPs per block) is artificially assembled using real genotype data from the GEUVADIS project and 1 to 4 QTNs are randomly assigned to different LD blocks per simulation.

In Wen et al.,¹³ we compared three competing approaches: (1) a single SNP analysis method, (2) a conditional analysis method, and (3) a multi-SNP analysis method based on an MCMC algorithm, regarding their abilities to correctly identify the QTN-harboring LD blocks. We run the adaptive DAP algorithm on the simulated datasets and compare the results with the three existing methods. Our results indicate that the adaptive DAP algorithm presents a significant improvement in performance (Figure 3) and a remarkable reduction in computational time compared with the MCMC algorithm (Table S1), and both approaches outperform the single SNP analysis and conditional analysis approaches. In addition, Figure 3 also shows that with prolonged sampling steps, the MCMC outputs seemingly “converge” to the DAP results. We also run a fast version of the adaptive DAP algorithm with tuning parameter λ = 0.05 (Figure S4), and the results indicate that the decrease in performance from the default setting (λ = 0.01) is minimal.

Comparison of DAP and MCMC Algorithms in Simulation Study III

(A) Performance comparisons for multi-SNP QTL mapping. We apply different analytical approaches to a simulated dataset reported in Wen et al.^¹³ to evaluate their abilities to identify multiple independent LD blocks harboring true QTLs. The methods compared include a single-SNP analysis approach (navy blue line), a forward selection-based conditional analysis approach, the MCMC algorithm described in Wen et al.,^¹³ and the DAP algorithm. Each plotted point represents the number of true positive findings (of LD blocks) versus the false positives obtained by a given method at a specific threshold. The MCMC algorithm and the DAP algorithm are based on the Bayesian hierarchical model and clearly outperform the other two commonly applied approaches. Most importantly, the DAP algorithm presents a significant performance improvement compared with the MCMC in both accuracy and computational efficiency.

(B–E) Comparison of PIP values estimated by adaptive DAP and MCMC with various running lengths. We randomly selected 10 simulated datasets and ran MCMC with 4 different lengths of sampling steps, ranging from 15,000 to 1 million (the results shown in A are based on 75,000 sampling steps for each dataset). With the prolonged MCMC runs, the MCMC outcomes seemingly “converge” to the DAP results.

Re-analysis of the GEUVADIS Data

We re-analyze the cross-population eQTL dataset generated from the GEUVADIS project (Web Resources) via the proposed 3-stage inference procedure. In this re-analysis, we focus on examining two types of genomic annotations that are known to impact the enrichment of eQTNs: the SNP distance to the transcription start site (TSS) of the target gene and annotations assessing the ability of a point mutation to disrupt transcription factor (TF) binding. Following Wen et al.,¹³ we group all SNPs within 100 kb of a gene into 1 kb non-overlapping bins according to their distances from the TSS and use the label of the corresponding bin for each SNP to represent its distance to TSS (DTSS) as a categorical variable. In addition, a SNP is classified as a binding SNP if it is computationally predicted to strongly disrupt TF binding by the CENTIPEDE model using the ENCODE DNaseI data²⁶^,²⁷ (Web Resources). If a SNP is located in a DNaseI footprint region but there is no strong evidence for disrupting TF binding, it is classified as a footprint SNP; otherwise, the SNP is labeled as a baseline SNP. Due to the computational restraint, our previous enrichment analysis reported in Wen et al.¹³ was based on a single iteration of the MCMC-within-EM (or EM-MCMC) algorithm (i.e., the E-step is carried out by the MCMC algorithm), because our main goal was enrichment testing. Although the evidence is sufficiently strong for testing purposes, the enrichment parameters were known to be severely underestimated.

We ran the complete DAP-1-embedded EM algorithm to perform the enrichment analysis. The full EM algorithm runs for 25 iterations to meet our convergence criteria, which require an increment ≤0.01 in the log-likelihood between two consecutive iterations (Figure S5). The complete EM run takes 21 min on a Linux box with a single 8-core Intel Xeon 2.13 GHz CPU. In comparison, the MCMC algorithm takes approximately 84 hr of computational time to fully process all 11,838 genes in a single E-step on the same computing system.

After a single iteration, the DAP-1-embedded EM algorithm yields point estimates for the TF binding annotations that are very similar to our previous results reported in Wen et al.¹³ (Table 3). As expected, the final estimates from the complete EM run have very high enrichment values: the binding SNPs have an estimated log odds ratio ${\hat{α}}_{1} = 0.94$ , or fold change of 2.56, with the 95% CI [0.84,1.05], whereas the footprint SNPs have a much lower enrichment estimate (log odds ratio ${\hat{α}}_{1} = 0.53$ or fold change of 1.70, with the 95% CI [0.40,0.67]). Note that the two confidence intervals are non-overlapping. In comparison, our previously reported estimates of the corresponding enrichment parameters are 0.40 (95% CI [0.32,0.49]) and 0.14 (95% CI [0.04,0.24]) for binding and footprint SNPs, respectively.

Table 3.

Comparison of Enrichment Estimates by EM-DAP1 and EM-MCMC after a Single Iteration in Analysis of GEUVADIS Data

Method	Footprint SNPs		Binding Variants
α	95% C.I.	α	95% C.I.
EM-MCMC	0.14	(0.04, 0.24)	0.39	(0.32, 0.49)
EM-DAP1	0.12	(−0.01, 0.25)	0.41	(0.30, 0.51)

Open in a new tab

The binding SNPs refer to the genetic variants that are computationally predicted to disrupt TF binding, and the footprint SNPs are those simply located in the DNaseI footprint region but not predicted to affect TF binding. The enrichment estimates from both methods are very similar. The MCMC algorithm accounts for multiple independent association signals and yields slightly tighter confidence intervals, as expected. However, the EM-DAP1 is much more computationally efficient: it runs almost one thousand times faster than the EM-MCMC algorithm.

Next, we repeat the multi-SNP fine-mapping analysis using the adaptive DAP algorithm and the new set of the empirical Bayes priors obtained from the enrichment analysis. For most genes, the results (i.e., the number of independent signals for each gene) are qualitatively unchanged compared to the previous MCMC results. Nevertheless, we find that fine-mapping with the adaptive DAP is much more efficient, and the annotated SNPs, especially the binding SNPs, are further prioritized in the new fine-mapping results (Figure S6).

Analysis of the GTEx Data

We analyze the cis-eQTL data from the GTEx project (Web Resources). One of the most unique advantages of the GTEx data is that they enable the study of the commonality and specificity of the eQTLs in multiple tissues. Taking advantage of the high computational efficiency of the EM-DAP1 algorithm, we perform the enrichment analysis of the TF binding annotations, derived from the ENCODE data and the CENTIPEDE model, in eQTLs across 44 human tissues while controlling for the SNP distance to TSS. More specifically, for each gene, we consider a 2 Mb cis region centered at the transcription start site. For each tissue, we perform the enrichment analysis using two sets of TF binding annotations, one derived from the ENCODE LCL cell line and the other from the ENCODE liver-related HepG2 cell line²⁷ (Web Resources). This exercise aims to assess the impact of the cell-type-specific annotations on the proposed integrative analysis.

Our results indicate that the binding variants are significantly enriched in eQTLs in all tissues regardless of the origin of the annotations. Furthermore, the point estimates of enrichment levels for binding variants are consistently higher than those for footprint SNPs, except in one occasion (small intestine tissue with LCL-derived annotations) where the two estimates are indistinguishable. Importantly, we find that the enrichment estimates in specific tissues are quantitatively correlated with the origins of the annotations. Figure 4 shows the results of the enrichment level estimates $({\hat{α}}_{1})$ of the binding variants in each tissue using the LCL- and HepG2-derived TF binding annotations. Most interestingly, the LCL-derived annotations yield the highest enrichment estimates in LCLs and whole blood from the GTEx datasets, whereas the liver-related HepG2-derived annotations obtain the highest enrichment estimate in the GTEx liver tissue. Overall, our results suggest that TF binding annotations derived from different tissues must have substantial overlaps; nevertheless, the annotations from the relevant tissues may provide better functional interpretations for expression-altering causal SNPs in a specific tissue.

Enrichment Estimates for Binding Variants in GTEx Tissues

The estimates in (A) are based on the annotations derived from the DNaseI data of the ENCODE LCLs, whereas the estimates in (B) are based on annotations derived from the ENCODE liver-related HepG2 DNaseI data. In each panel, we plot the point estimate of the enrichment parameter and its 95% confidence interval in each tissue. The tissues are ranked in descending order according to the magnitude of the point estimates. All estimates are obtained controlling for the SNP distance from TSS. All estimates are significantly far from 0 (at the 5% level). Interestingly, when the tissue and origin of the annotations match, the point estimates for enrichment are the highest.

We then proceed to identify genes that harbor QTNs (i.e., eGenes) using a Bayesian FDR control procedure that we recently developed.¹⁴ Subsequently, we perform multi-SNP fine-mapping analysis for the identified eGenes incorporating the enrichment estimates using the adaptive DAP algorithm. We present the analysis results for the liver (sample size 97), lung (sample size 278), and whole blood (sample size 338). There are 2,788, 8,605, and 7,937 eGenes that are identified from the lung, liver, and whole blood, respectively. We suspect that the number of differences in eGenes discovery is largely attributed to the sample sizes but is also correlated with the levels of experimental noise in measuring the gene expression in each tissue. For each fine-mapped eGene l in each tissue, we compute the posterior expected number of independent signals using $\sum_{i = 1}^{p} Pr (γ_{l_{i}} | {\vec{y}}_{l} . G_{l}, \hat{\vec{α}})$ and plot the histogram for each tissue in Figure 5. In all three tissues, we identify single eQTL signals for the vast majority of eGenes. Nonetheless, for a non-trivial number of genes, we are able to confidently identify multiple independent signals. Comparing the fine-mapping results among the three tissues, we find that the ability to identify additional independent signals is also seemingly correlated with the sample sizes.

Posterior Expected Number of *cis*-eQTL Signals per eGene in GTEx Liver, Lung, and Whole Blood Tissues

The top, middle, and bottom panels display the histogram of the posterior expected number of *cis*-eQTLs from all the eGenes in the liver, lung, and blood tissues, respectively. For most genes, we can identify only a single association signal. However, for a non-trivial number of eGenes, multiple independent association signals can be confidently identified by the adaptive DAP algorithm. The sample size is seemingly an important factor related to the ability to identify multiple independent signals in a *cis* region.

We further examine some known individual genes to validate our integrative analysis results. In particular, we examine SORT1 (MIM: 602458), whose function is related to plasma low-density lipoprotein cholesterol (LDL-C [MIM: 613589]) metabolism through modulation of hepatic VLDL secretion. Through GWAS meta-analysis and extensive functional analysis,²⁸ a single SNP, rs12740374, is identified to cause variations in LDL-C. More specifically, the major allele disrupts the binding site of C/EBP transcription factors in human hepatocytes. Our integrative fine-mapping analysis using the GTEx liver data yields a Bayesian 95% credible set, narrowed down to only two potential causal eQTNs for SORT1: rs12740374 (PIP = 0.473) ranks second very closely only to SNP rs7528419 (PIP = 0.526). Moreover, the direction of the genetic effect for rs7528419 fits the description provided in Musunuru et al.²⁸ The two SNPs in the credible set are in high LD (r² > 0.95), except that the genotypes of rs12740374 in the GTEx samples are not directly genotyped but imputed. Upon further investigation, we find that the binding site reported by Musunuru et al.²⁸ is not captured by the ENCODE DNaseI experiments in HepG2, and hence, rs12740374 is not correctly annotated. We then include the annotation of rs12740374 as a binding SNP based on the functional study of Musunuru et al.²⁸ and re-run the fine-mapping analysis using the adaptive DAP. We find that rs12740374 yields the highest PIP value (PIP = 0.752) among all the candidate SNPs (the PIP for rs7528419 drops to 0.247). The lesson learned here is that the completion of the genomic annotations may have a profound impact on the integrative analysis, and efforts should be made to generate a more comprehensive set of genomic annotations by both accumulating new experimental data and integrating them with all the existing data.

Discussion

The proposed EM-DAP1 algorithm provides an efficient and flexible framework to perform enrichment analysis with respect to genomic annotations using genetic association data—there is no restriction on the types of annotations (categorical or continuous) or the number of annotations that can be simultaneously investigated. Some of the commonly applied ad hoc enrichment analysis methods in the same context attempt to first classify the binary latent association status Γ for all candidate SNPs based on their single SNP testing results. However, it is worth noting that the classification based on hypothesis testing typically has very stringent controls over type I errors but is much more tolerant (in practice, it may be too tolerant) and has little control over type II errors, which are a major source of the overall mis-classification errors for Γ.¹³ As a consequence, most ad hoc procedures of this type provide poor quantification of enrichment levels. Recently, probabilistic model-based enrichment analysis approaches have been proposed based on the “one QTN per locus” assumption and applied to both molecular QTL mapping and GWASs.¹¹ A common feature of these approaches is that they treat each locus as the exchangeable/comparable unit in the analysis: in the simplest case, each locus has the common prior probability, π₁, of harboring causal QTNs. Although the DAP-1 algorithm implicitly also makes the same assumption and enjoys the benefit of fast and efficient computation using only summary statistics, it presents some significant differences/improvements compared to the aforementioned approaches. The DAP-1 algorithm, built on the proposed hierarchical model, considers each SNP as the unit of analysis. This modeling strategy leads to a straightforward EM algorithm for parameter estimation, where the target function in the M-step is convex with well-known optimization solutions. In comparison, with the parameterization including π₁, the target function in the M-step is no longer guaranteed to be convex, which can cause convergence issues in EM estimation and prevent the simultaneous investigations of many annotations (see the details in the Appendix C). Furthermore, π₁ parameterization essentially assumes that genetic loci consisting of many SNPs are equally likely to harbor causal QTNs as loci consisting of only a few SNPs. From the empirical evidence produced by eQTL analysis, we find that this assumption is probably false: the genes with more cis candidate SNPs are more likely to harbor eQTNs.¹³ In summary, the proposed hierarchical model and the EM-DAP1 algorithm represent better alternatives.

The proposed Bayesian hierarchical model does not explicitly consider potential polygenic background. To evaluate the performance of the proposed enrichment analysis method under an explicit polygenic model, we modify the simulation settings for enrichment analysis by imposing a small yet non-zero genetic effect on every candidate SNP. Under such setting, γ_i should be interpreted as an indicator of whether the genetic effect of SNP i is significantly larger than the polygenic background. The simulation results (Figure S7) indicate that the estimates of the enrichment parameters are biased toward 0 in the presence of polygenic background, although the bias is negligible when the polygenic effects are small. We plan to extend our current work to fully account for polygenic background in our future work by considering a more appropriate model like the Bayesian sparse linear mixed model (BSLMM).²⁹

Our analysis of multi-tissue eQTL data yields many interesting findings that are worthy of in-depth follow-up investigation. In particular, our results suggest that the cell type specificity and the completeness/accuracy of the genomic annotations might have profound impacts on the integrative association analysis in terms of different aspects as follows: the cell type specificity of the annotations affects the global enrichment estimates and the multi-SNP analysis results of every subsequently fine-mapped locus, whereas mis-annotations of certain variants probably impact functional interpretations of specific loci but are not likely to alter the global enrichment estimates as long as the annotations are accurate on average. These findings should motivate efforts to generate a more comprehensive and accurate catalog of genomic annotations to improve the overall quality of genetic association analysis. Furthermore, it should be noted that all the annotations could have additional levels of complexity (e.g., cis regulatory grammar) and still can be consistently analyzed within the same framework by extending our logistic prior model in a straightforward manner to allow interactions. To aid these efforts, our proposed genome-wide scale enrichment analysis has provided a principled way of assessing the tissue/cell type specificity of the genomic annotations.

Acknowledgments

We thank the GTEx consortium and the GEUVADIS RNA sequencing project for releasing valuable data in a timely fashion. This work is supported by NIH grants MH101825, HG007022, and GM109215.

Published: May 26, 2016

Footnotes

Supplemental Data include seven figures and one table and can be found with this article online at http://dx.doi.org/10.1016/j.ajhg.2016.03.029.

Appendix A: Selection of Priority SNPs in Adaptive DAP

We give a detailed account of the Bayesian conditional analysis procedure for selecting high-priority SNPs in the adaptive DAP algorithm. For a given locus l, the procedure starts with model size partition s = 1. Let ${\vec{γ}}^{*}$ denote the model with the highest posterior probability in the size partition s − 1 in locus l, i.e.,

{\vec{γ}}^{*} = {argmax}_{{‖ \vec{γ} ‖ = s - 1}} Pr ({\vec{γ}}_{l} = \vec{γ}) BF (\vec{γ}) .

For each SNP i that is not included in the current best model, we compute a Bayes factor for the expanded model, ${\vec{γ}}_{i}^{†} = {\vec{γ}}^{*} \cup {γ_{l_{i}} = 1}$ . Assuming that there is exactly one additional QTL and that each candidate SNP i is equally likely to be the additional causal association a priori, the corresponding conditional posterior probability for SNP i can be computed by

{PIP}_{i}^{*} = \frac{BF ({\vec{γ}}_{i}^{†}) / BF ({\vec{γ}}^{*})}{\sum_{j} BF ({\vec{γ}}_{j}^{†}) / BF ({\vec{γ}}^{*})} = \frac{BF ({\vec{γ}}_{i}^{†})}{\sum_{j} BF ({\vec{γ}}_{j}^{†})} .

(Equation A1)

The resulting quantity is a well-defined posterior probability and is solely determined by the relative likelihood values of the expanded models. In particular, it should be noted that Equation A1 fully accounts for LD between SNPs: e.g., if two SNPs are in perfect LD, they would possess identical values that correctly reflect the uncertainty (i.e., they are indistinguishable). The procedure requires p − s evaluations of Bayes factors that are computationally trivial for small s values. Given the pre-defined threshold λ, we add the SNP i into the existing set of high-priority SNPs if it is not already in the set and ${PIP}_{i}^{*} \geq λ$ . For s ≥ 2, we then enumerate all s-combinations from the resulting set of priority SNPs to compute $C_{s}^{*}$ . During this enumeration, we also record the new ${\vec{γ}}^{*}$ for the increased model size.

Intuitively, the threshold parameter λ is related to the precision of the approximate PIPs. The selection procedure roughly estimates the probability, $Pr (γ_{l_{i}} = 1 | \vec{y}, G_{l}, \vec{α}, ‖ {\vec{γ}}_{l} ‖ = s)$ , for SNP i. Note the relationship

Pr (γ_{l_{k}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}) = \sum_{s = 1}^{p} \frac{C_{i}}{C} \cdot Pr (γ_{l_{k}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}, ‖ {\vec{γ}}_{l} ‖ = s) .

The following can be concluded:

1.
If $Pr (γ_{l_{i}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}, ‖ {\vec{γ}}_{l} ‖ = s) < λ$ for a given SNP at all s values, then it must be the case that the overall PIP < λ.
2.
The loss of precision of the PIP of SNP i due to the selection screening for a particular size partition must be <λ.

Appendix B: Stopping Rule and Estimation of the Approximation Error in Adaptive DAP

When a non-associated SNP is added to an existing association model, the marginal likelihood of the model is typically non-increasing. In fact, the marginal likelihood measured by the corresponding Bayes factor usually decreases slightly due to the effect of Occam’s razor built into the Bayes factor computation.^³⁰ We utilize this property to reduce the computation of DAP by eliminating unnecessary explicit explorations of the model partitions once the sizes of the models are considered saturated. To achieve this goal, the DAP starts the exploration with model size partition s = 1 for increasing s values until a stopping rule is met. The contribution of the unexplored size partitions (i.e., the approximation error) is then estimated by an analytic combinatorial approximation.

To explain the stopping rule and the combinatorial approximation, we assume that there are K detectable true QTNs. In each model size partition where s > K, we can classify all models into (K + 1) mutually exclusive categories according to the number of true QTNs (0 to K) included in each association model. In the category including exactly m true QTLs, each member association model also includes (s − m) non-associated SNPs, and the total number of the association models in the category is given by $(\begin{matrix} p - K \\ s - m \end{matrix}) (\begin{matrix} K \\ m \end{matrix})$ . We estimate the contribution to $\sum_{\vec{γ}} Pr ({\vec{γ}}_{l} = \vec{γ}; ‖ {\vec{γ}}_{l} ‖ = s) BF (\vec{γ})$ from this particular category by the equation

(\begin{matrix} p - K \\ s - m \end{matrix}) (\begin{matrix} K \\ m \end{matrix}) \tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s) {\bar{BF}}_{{m}},

where $\tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s)$ represents the average prior value within the category and ${\bar{BF}}_{{m}}$ is the average Bayes factor across models including m out of K detectable QTNs. The use of ${\bar{BF}}_{{m}}$ is mainly based on the assumption that including non-associated SNPs in an association model does not, on average, increase the marginal likelihood/Bayes factor. Hence, we obtain

C_{s} \approx \sum_{m = 0}^{K} (\begin{matrix} p - K \\ s - m \end{matrix}) (\begin{matrix} K \\ m \end{matrix}) \tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s) {\bar{BF}}_{{m}} .

To relate C_s+1 to C_s, we note that

\begin{matrix} C_{s + 1} \approx \overset{K}{\sum_{m = 0}} (\begin{matrix} p - K \\ s + 1 - m \end{matrix}) (\begin{matrix} K \\ m \end{matrix}) \tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s + 1) {\bar{BF}}_{{m}} \\ = \overset{K}{\sum_{m = 0}} \frac{p - K + m - s}{s + 1 - m} (\begin{matrix} p - K \\ s - m \end{matrix}) (\begin{matrix} K \\ m \end{matrix}) \tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s + 1) {\bar{BF}}_{{m}} \\ \leq \frac{p - s}{s + 1 - K} \overset{K}{\sum_{m = 0}} [(\begin{matrix} p - K \\ s - m \end{matrix}) (\begin{matrix} K \\ m \end{matrix}) \tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s) {\bar{BF}}_{{m}}] \frac{\tilde{Pr} ({\vec{γ}}_{l}; ‖ \vec{γ} ‖ = s + 1)}{\tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s)} \\ \approx \frac{p - s}{s - K + 1} ω C_{s} . \end{matrix}

(Equation B1)

In the last step, we approximate the quantities $\tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s + 1) / \tilde{Pr} ({\vec{γ}}_{l}; ‖ {\vec{γ}}_{l} ‖ = s)$ in all K + 1 categories by the average prior odds $ω = (1 / p) \sum_{i = 1}^{p} exp (α_{0} + \sum_{l = 1}^{q} α_{l} d_{i l})$ . Similarly, we can derive an approximate lower bound for C_s+1

\frac{p - s - K}{s + 1} ω C_{s} .

(Equation B2)

Thus, we have shown

\frac{p - s}{s - K + 1} ω C_{s} ≳ C_{s + 1} ≳ \frac{p - s - K}{s + 1} ω C_{s} .

(Equation B3)

Because K is unknown, we estimate C_s+1 from C_s by the following approximation

C_{s + 1} \approx \frac{p - s}{s + 1} ω C_{s},

(Equation B4)

which does not depend on K and lies in the interval

(\frac{p - s - K}{s + 1} ω C_{s}, \frac{p - s}{s - K + 1} ω C_{s}) .

Our numerical experiment shows that this approximation is surprisingly accurate (Figure S3).

Our stopping rule is built upon the upper bound specified by the inequality (Equation B3). Specially, the adaptive DAP stops explicit exploration at partition size s = t if

C_{t}^{*} \leq (p - t + 1) ω C_{t - 1}^{*} .

(Equation B5)

The inequality essentially tests K ≥ t − 1. In addition to utilizing the combinatorial approximation, the DAP further monitors the increment of the partial sum $S_{k} = \sum_{i}^{k} C_{i}^{*}$ . To ensure a high accuracy of the approximation, we also add an optional criterion to the stopping rule on top of Equation B5, i.e.,

{log}_{10} [\frac{S_{t}}{S_{t - 1}}] < κ, κ > 0,

or, equivalently,

\frac{C_{t}^{*}}{\sum_{i}^{t - 1} C_{i}^{*}} < 10^{κ} - 1 .

By default, we set $κ = 0.01$ , which further ensures that the subsequent model size partitions make no substantial contributions to the normalizing constant. This additional criterion provides practical flexibility for running the DAP: as $κ \to 0$ , it enforces the DAP to explore all the model size partitions, whereas when κ is large, only the stopping rule (Equation B5) is effective.

Once the stopping rule is invoked, we estimate $\in$ by

\in = \sum_{s = t + 1}^{p} R_{s}^{*},

where we define $R_{t}^{*} = C_{t}^{*}$ and

R_{s + 1}^{*} = \frac{p - s}{s + 1} ω R_{s}^{*}, for s = t, \dots, p .

Appendix C: Derivation of the DAP-1 Algorithm

In this section, we provide a detailed derivation for the DAP-1 algorithm. It should be noted that the derivation can be generalized to the DAP-K algorithm with K > 1.

The key assumption of the DAP-1 is that posterior probabilities of single-QTL association models dominate the posterior probability space of ${\vec{γ}}$ , i.e.,

C - \sum_{‖ \vec{γ} ‖ \leq 1} Pr ({\vec{γ}}_{l} = \vec{γ}) BF (\vec{γ}) \to 0 .

(Equation C1)

Consequently, it follows that

Pr ({\vec{γ}}_{l} = \vec{γ} | {\vec{y}}_{l}, G_{l}, \vec{α}) \approx {\begin{matrix} \frac{Pr ({\vec{γ}}_{l} = \vec{γ} | \vec{α}) BF (\vec{γ})}{\sum_{‖ {\vec{γ}}^{'} ‖ \leq 1} Pr ({\vec{γ}}_{l} = {\vec{γ}}^{'}) BF ({\vec{γ}}^{'})} & if ‖ \vec{γ} ‖ \leq 1 \\ 0 & otherwise . \end{matrix}

The model space of ${\vec{γ} : ‖ \vec{γ} ‖ \leq 1}$ contains only the null model, $\vec{γ} = 0$ , and all single-SNP association models. For the null model, it is clear that $BF (\vec{γ} = 0) = 1$ , and we denote

π_{0} : = Pr (\vec{γ} = 0 | \vec{α}) = \prod_{i = 1}^{p} {(1 + exp ({\vec{α}}^{'} {\vec{d}}_{i}))}^{- 1} .

We use ${\vec{γ}}_{j}^{\circ}$ to denote the single-SNP association model where the j^th SNP is the assumed QTN. Clearly,

Pr ({\vec{γ}}_{j}^{\circ} | \vec{α}) = exp ({\vec{α}}^{'} {\vec{d}}_{j}) \prod_{i = 1}^{p} {(1 + exp ({\vec{α}}^{'} {\vec{d}}_{i}))}^{- 1} = π_{0} \cdot exp ({\vec{α}}^{'} {\vec{d}}_{j}),

and

BF ({\vec{γ}}_{j}^{\circ}) = {BF}_{j} .

We recall that BF_j denotes the Bayes factor based on the single-SNP analysis of SNP j. The computation of BF_j has been detailed by many authors.¹⁷, ³¹, ³² It typically requires only summary-level statistics, e.g., the estimated genetic effect of the target SNP and its standard error,³¹, ³² and it is computationally trivial.

Finally, we note that given the restrained model space, the PIP of SNP j, $Pr (γ_{j} | \vec{y}, G, \vec{α})$ , coincides with $Pr ({\vec{γ}}_{j}^{\circ} | \vec{α})$ . Given all of the above, it follows from simple algebra that

\begin{matrix} Pr (γ_{i} = 1 | \vec{y}, G, \vec{α}) = \frac{\sum_{k = 1}^{p} e^{α_{0} + \sum_{l = 1}^{q} α_{l} d_{k l}} {BF}_{k}}{1 + \sum_{k = 1}^{p} e^{α_{0} + \sum_{l = 1}^{q} α_{l} d_{k l}} {BF}_{k}} \cdot \frac{e^{\sum_{l = 1}^{q} α_{l} d_{i l}} {BF}_{i}}{\sum_{k = 1}^{p} e^{\sum_{l = 1}^{q} α_{l} d_{k l}} {BF}_{k}} \\ = [1 - Pr ({\vec{γ}}_{l} = 0 | \vec{y}, G, \vec{α})] \cdot \frac{e^{\sum_{l = 1}^{q} α_{l} d_{i l}} {BF}_{i}}{\sum_{k = 1}^{p} e^{\sum_{l = 1}^{q} α_{l} d_{k l}} {BF}_{k}}, \end{matrix}

(Equation C2)

where the first term assesses the probability that the p-SNP locus contains a QTL and the second term is the conditional probability that the i^th SNP is the sole QTL. Equation C2 bears great similarity to the previously proposed Bayesian approaches,⁹, ¹¹, ²³ which also impose the “single QTL per locus” assumption. However, all the aforementioned approaches formulate it as a prior assumption, which results in a very different parametrization. More specifically, they use a locus-level quantity, π₀, to denote the probability that a locus does not contain a QTL. Conditioning on the case that the locus does contain a QTL, the prior for SNP i being the causal SNP is assigned

Pr (γ_{i} = 1 | {\vec{γ}}_{l} \neq 0, \vec{δ}) = \frac{e^{\sum_{l = 1}^{q} δ_{l} d_{i l}}}{\sum_{k = 1}^{p} e^{\sum_{l = 1}^{q} δ_{l} d_{k l}}},

(Equation C3)

where the parameter $\vec{δ}$ is similar to our enrichment parameter. As a result, this parametrization yields a similar expression for the PIP of SNP i,

Pr (γ_{i} = 1 | \vec{y}, G_{l}, π_{0}, \vec{δ}) = [1 - Pr ({\vec{γ}}_{l} = 0 | \vec{y}, G_{l}, π_{0})] \cdot \frac{e^{\sum_{l = 1}^{q} δ_{l} d_{i l}} {BF}_{i}}{\sum_{k = 1}^{p} e^{\sum_{l = 1}^{q} δ_{l} d_{k l}} {BF}_{k}} .

(Equation C4)

Despite the algebraic similarity, the parameters (π₀ and $\vec{δ}$ ) in Equation C4 cannot be directly interpreted as $\vec{α}$ in our logistic priors, partly due to the conditional nature of the prior specification (Equation 3). Furthermore, in enrichment analysis, the M-step of the EM algorithm becomes much more involved for optimizing the objective function jointly with respect to $(π_{0}, \vec{δ})$ . In comparison, we have shown that under the parametrization of DAP-1, the maximization in the M-step is equivalent to fitting a logistic regression model for which the solutions are well known.

Appendix D: Factorization of the Posterior Probability by LD Blocks

For integrative association analysis for loci spanning very large genomic regions, especially in GWAS settings, we recommend an additional approximate factorization, $Pr (\vec{γ} | \vec{y}, G, \vec{α}) \approx \sum_{k = 1}^{L} Pr ({\vec{γ}}_{[k]} | \vec{y}, G, \vec{α})$ , before applying the DAP to each genomic region independently. We provide the necessary mathematical justification for this factorization.

It is sufficient to show that

Pr (\vec{γ} | \vec{α}) BF (\vec{γ}) \approx \prod_{k = 1}^{L} Pr ({\vec{γ}}_{[k]} | \vec{α}) \cdot \prod_{k = 1}^{L} BF ({\vec{γ}}_{[k]}) .

Recall that ${{\vec{γ}}_{[k]} : k = 1,2,3 \dots}$ are non-overlapping segments of the vector $\vec{γ}$ . Because the prior probabilities are assumed to be independent across SNPs, it follows trivially that $Pr (\vec{γ} | \vec{α}) = \prod_{k = 1}^{L} Pr ({\vec{γ}}_{[k]} | \vec{α})$ .

To show that $BF (\vec{γ}) \approx \prod_{k = 1}^{L} BF ({\vec{γ}}_{[k]}),$ we note the previous result on the Bayes factors,^¹⁸

BF (\vec{γ}) = \int P (\vec{β} | \vec{γ}) BF (\vec{β}) d \vec{β},

where the probability $P (\vec{β} | \vec{γ})$ defines the prior effect size given association status $\vec{γ}$ . Furthermore, note the independent relationship of the prior effect sizes across SNPs,

P (\vec{β} | \vec{γ}) = \prod_{i = 1}^{p} P (β_{i} | γ_{i}) .

If γ_i = 1, β_i is assigned a normal prior, whereas if γ_i = 0, β_i = 0 with probability 1 (or is represented by a degenerated normal distribution, $β_{i} \sim N (0,0)$ ). Equivalently, we write

\vec{β} | \vec{γ} \sim N (0, W),

where W is a diagonal prior variance-covariance matrix, and for $\vec{γ} \neq 1$ , W is singular.

Without loss of generality, we assume that both the phenotype vector $\vec{y}$ and the genotype vectors ${\vec{g}}_{1}, \dots, {\vec{g}}_{p}$ are centered, i.e., the intercept term in the association model is exactly 0. Furthermore, we also assume that the residual error variance parameter τ is known. It then follows from the result of Wen^¹⁸ that

BF (\vec{β}; W) = {| I + τ G^{'} GW |}^{- \frac{1}{2}} \cdot exp (\frac{1}{2} {\vec{y}}^{'} G [W {(I + τ G' GW)}^{- 1}] G^{'} \vec{y}) .

(Equation D1)

This expression provides the theoretical basis for the factorization. In particular, the p × p sample covariance matrix $(1 / n) G^{'} G$ is a well-known estimate of Var(G). In other words, $G^{'} G$ can be viewed as a noisy observation of nVar(G). Using population genetic theory, Wen and Stephens^²⁴ show that Var(G) is extremely banded. Based on this result, Berisa and Pickrell^²⁵ recently provided an algorithm to segment the genome into L non-overlapping loci utilizing the population parameter of the recombination rate, i.e.,

G = (G_{[1]}, \dots, G_{[L]}),

and we approximate $G^{'} G$ by a block diagonal matrix

\hat{G^{'} G} = G_{[1]}^{'} G_{[1]} \oplus \dots \oplus G_{[L]}^{'} G_{[L]},

(Equation D2)

where “ $\oplus$ ” denotes the direct sum of the matrices. It is important to note that Equation D2 should be viewed as a de-noised version of $G^{'} G$ with non-zero entries outside the LD blocks shrunk to exactly 0. By plugging Equation D2 into Equation D1, it follows that

BF (\vec{β}; W) = \prod_{k = 1}^{L} {BF}_{[k]},

(Equation D3)

where

{BF}_{[k]} = {| I + τ G_{[k]}^{'} G_{[k]} W_{[k]} |}^{- \frac{1}{2}} \cdot exp (\frac{1}{2} {\vec{y}}^{'} G_{[k]} [W_{[k]} {(I + τ G_{[k]}^{'} G_{[k]} W_{[k]})}^{- 1}] G_{[k]}^{'} \vec{y}) .

(Equation D4)

In particular, $(W_{[1]}, \dots, W_{[L]})$ is a decomposition of the diagonal matrix W compatible with the decomposition of G.

Finally, we integrate out the residual error variance parameter τ for each BF_[k] by applying the Laplace approximation.^¹⁸ This step results in plugging in a point estimate of τ (e.g., based on $\vec{y}$ and G_[k] for each block k) into Equation D4. Taken together, we have shown that

BF (\vec{γ}) \approx \prod_{k = 1}^{L} \int P ({\vec{β}}_{[k]} | {\vec{γ}}_{[k]}) {BF}_{[k]} d {\vec{β}}_{[k]},

and consequently,

Pr (\vec{γ} | \vec{y}, G, \vec{α}) \approx \prod_{k = 1}^{L} Pr ({\vec{γ}}_{[k]} | {\vec{y}}_{l}, G_{l}, \vec{α}) .

Appendix E: Average Accuracy of PIP Estimates using DAP-1

In this section, we provide some mathematical arguments to justify that DAP-1 (or adaptive DAP with less stringent threshold values) algorithm can provide on average accurate estimate. Specifically, we write the expression for the exact calculation of the PIP for SNP k at locus l as

Pr (γ_{l_{k}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}) = \sum_{s = 1}^{p} \frac{C_{i}}{C} \cdot Pr (γ_{l_{k}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}, ‖ {\vec{γ}}_{l} ‖ = s) .

(Equation E1)

In the case of DAP-1, we essentially use the following expression to approximate the PIP,

Pr (γ_{l_{k}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}) \approx \frac{C_{1}}{C_{0} + C_{1}} \cdot Pr (γ_{l_{k}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}, ‖ {\vec{γ}}_{l} ‖ = 1) .

(Equation E2)

Note that in genetic association analysis, the vast majority of SNPs have overall PIPs $\to 0$ within any given locus; hence, it must be the case that for such a SNP k,

Pr (γ_{l_{k}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}, ‖ {\vec{γ}}_{l} ‖ = s) \to 0, for all s .

Therefore, even C₁ + C₀ approximates C poorly, and Equation E2 still provides an adequately accurate PIP estimation for the majority of SNPs that are not QTNs. The same argument can also be applied to candidate QTNs with very strong evidence for associations, especially when the “primary” association signals have strengths of associations that are orders of magnitude higher than the remaining candidate SNPs within a locus (e.g., $Pr (γ_{l_{k}} = 1 | {\vec{y}}_{l}, G_{l}, \vec{α}, ‖ {\vec{γ}}_{l} ‖ = s) \to 1$ for all s). Therefore, the only SNPs whose PIPs are poorly approximated by DAP-1 are those secondary QTL signals (if there are any), but in most practical cases, it can be assured that such SNPs are small in number.

Web Resources

DAP software and tutorial, http://github.com/xqwen/dap/
Geuvadis Project, http://www.geuvadis.org/web/geuvadis/rnaseq-project
GTEx Portal, http://www.gtexportal.org/home/
Re-analyzed multi-SNP fine-mapping results of the GUEVADIS data, http://www-personal.umich.edu/∼xwen/geuvadis/new_fm_rst/
Transcription factor binding site annotations by the extended CENTIPEDE model, http://genome.grid.wayne.edu/centisnps/

Supplemental Data

Document S1. Figures S1–S7 and Table S1

mmc1.pdf^{(154.2KB, pdf)}

Document S2. Article plus Supplemental Data

mmc2.pdf^{(1.1MB, pdf)}

References

1.Ardlie K.G., Deluca D.S., Segrè A.V., Sullivan T.J., Young T.R., Gelfand E.T., Trowbridge C.A., Maller J.B., Tukiainen T., Lek M., GTEx Consortium Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Lappalainen T., Sammeth M., Friedländer M.R., ’t Hoen P.A., Monlong J., Rivas M.A., Gonzàlez-Porta M., Kurbatova N., Griebel T., Ferreira P.G., Geuvadis Consortium Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–511. doi: 10.1038/nature12531. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Degner J.F., Pai A.A., Pique-Regi R., Veyrieras J.-B., Gaffney D.J., Pickrell J.K., De Leon S., Michelini K., Lewellen N., Crawford G.E. DNasecI sensitivity QTLs are a major determinant of human expression variation. Nature. 2012;482:390–394. doi: 10.1038/nature10808. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.McVicker G., van de Geijn B., Degner J.F., Cain C.E., Banovich N.E., Raj A., Lewellen N., Myrthil M., Gilad Y., Pritchard J.K. Identification of genetic variants that affect histone modifications in human cells. Science. 2013;342:747–749. doi: 10.1126/science.1242429. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Banovich N.E., Lan X., McVicker G., van de Geijn B., Degner J.F., Blischak J.D., Roux J., Pritchard J.K., Gilad Y. Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels. PLoS Genet. 2014;10:e1004663. doi: 10.1371/journal.pgen.1004663. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Albert F.W., Kruglyak L. The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 2015;16:197–212. doi: 10.1038/nrg3891. [DOI] [PubMed] [Google Scholar]
7.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J., Roadmap Epigenomics Consortium Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Veyrieras J.-B., Kudaravalli S., Kim S.Y., Dermitzakis E.T., Gilad Y., Stephens M., Pritchard J.K. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 2008;4:e1000214. doi: 10.1371/journal.pgen.1000214. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Gaffney D.J., Veyrieras J.-B., Degner J.F., Pique-Regi R., Pai A.A., Crawford G.E., Stephens M., Gilad Y., Pritchard J.K. Dissecting the regulatory architecture of gene expression QTLs. Genome Biol. 2012;13:R7. doi: 10.1186/gb-2012-13-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Pickrell J.K. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet. 2014;94:559–573. doi: 10.1016/j.ajhg.2014.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Kichaev G., Yang W.-Y., Lindstrom S., Hormozdiari F., Eskin E., Price A.L., Kraft P., Pasaniuc B. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10:e1004722. doi: 10.1371/journal.pgen.1004722. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wen X., Luca F., Pique-Regi R. Cross-population joint analysis of eQTLs: fine mapping and functional annotation. PLoS Genet. 2015;11:e1005176. doi: 10.1371/journal.pgen.1005176. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wen X. Effective qtl discovery incorporating genomic annotations. bioRxiv. 2015 [Google Scholar]
15.Maller J.B., McVean G., Byrnes J., Vukcevic D., Palin K., Su Z., Howson J.M., Auton A., Myers S., Morris A., Wellcome Trust Case Control Consortium Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat. Genet. 2012;44:1294–1301. doi: 10.1038/ng.2435. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Guan Y., Stephens M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 2011;5:1780–1815. [Google Scholar]
17.Servin B., Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. doi: 10.1371/journal.pgen.0030114. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Wen X. Bayesian model selection in complex linear systems, as illustrated in genetic association studies. Biometrics. 2014;70:73–83. doi: 10.1111/biom.12112. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Wilson M.A., Iversen E.S., Clyde M.A., Schmidler S.C., Schildkraut J.M. Bayesian model search and multilevel inference for snp association studies. Ann. Appl. Stat. 2010;4:1342–1364. doi: 10.1214/09-aoas322. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Patwardhan R.P., Lee C., Litvin O., Young D.L., Pe’er D., Shendure J. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat. Biotechnol. 2009;27:1173–1175. doi: 10.1038/nbt.1589. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Findlay G.M., Boyle E.A., Hause R.J., Klein J.C., Shendure J. Saturation editing of genomic regions by multiplex homology-directed repair. Nature. 2014;513:120–123. doi: 10.1038/nature13695. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Savic D., Park S.Y., Bailey K.A., Bell G.I., Nobrega M.A. In vitro scan for enhancers at the TCF7L2 locus. Diabetologia. 2013;56:121–125. doi: 10.1007/s00125-012-2730-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Flutre T., Wen X., Pritchard J., Stephens M. A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. 2013;9:e1003486. doi: 10.1371/journal.pgen.1003486. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Wen X., Stephens M. Using linear predictors to impute allele frequencies from summary or pooled genotype data. Ann. Appl. Stat. 2010;4:1158–1182. doi: 10.1214/10-aoas338. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Berisa T., Pickrell J.K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32:283–285. doi: 10.1093/bioinformatics/btv546. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Pique-Regi R., Degner J.F., Pai A.A., Gaffney D.J., Gilad Y., Pritchard J.K. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21:447–455. doi: 10.1101/gr.112623.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Moyerbrailean G.A., Kalita C.A., Harvey C.T., Wen X., Luca F., Pique-Regi R. Which genetics variants in dnase-seq footprints are more likely to alter binding? PLoS Genet. 2016;12:e1005875. doi: 10.1371/journal.pgen.1005875. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Musunuru K., Strong A., Frank-Kamenetsky M., Lee N.E., Ahfeldt T., Sachs K.V., Li X., Li H., Kuperwasser N., Ruda V.M. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature. 2010;466:714–719. doi: 10.1038/nature09266. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zhou X., Carbonetto P., Stephens M. Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 2013;9:e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Berger J.O., Pericchi L.R. The intrinsic bayes factor for model selection and prediction. J. Am. Stat. Assoc. 1996;91:109–122. [Google Scholar]
31.Wakefield J. Bayes factors for genome-wide association studies: comparison with P-values. Genet. Epidemiol. 2009;33:79–86. doi: 10.1002/gepi.20359. [DOI] [PubMed] [Google Scholar]
32.Wen X., Stephens M. Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene–environment interactions. Ann. Appl. Stat. 2014;8:176–203. doi: 10.1214/13-AOAS695. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S7 and Table S1

mmc1.pdf^{(154.2KB, pdf)}

Document S2. Article plus Supplemental Data

mmc2.pdf^{(1.1MB, pdf)}

[bib1] 1.Ardlie K.G., Deluca D.S., Segrè A.V., Sullivan T.J., Young T.R., Gelfand E.T., Trowbridge C.A., Maller J.B., Tukiainen T., Lek M., GTEx Consortium Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Lappalainen T., Sammeth M., Friedländer M.R., ’t Hoen P.A., Monlong J., Rivas M.A., Gonzàlez-Porta M., Kurbatova N., Griebel T., Ferreira P.G., Geuvadis Consortium Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–511. doi: 10.1038/nature12531. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Degner J.F., Pai A.A., Pique-Regi R., Veyrieras J.-B., Gaffney D.J., Pickrell J.K., De Leon S., Michelini K., Lewellen N., Crawford G.E. DNasecI sensitivity QTLs are a major determinant of human expression variation. Nature. 2012;482:390–394. doi: 10.1038/nature10808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.McVicker G., van de Geijn B., Degner J.F., Cain C.E., Banovich N.E., Raj A., Lewellen N., Myrthil M., Gilad Y., Pritchard J.K. Identification of genetic variants that affect histone modifications in human cells. Science. 2013;342:747–749. doi: 10.1126/science.1242429. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Banovich N.E., Lan X., McVicker G., van de Geijn B., Degner J.F., Blischak J.D., Roux J., Pritchard J.K., Gilad Y. Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels. PLoS Genet. 2014;10:e1004663. doi: 10.1371/journal.pgen.1004663. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Albert F.W., Kruglyak L. The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 2015;16:197–212. doi: 10.1038/nrg3891. [DOI] [PubMed] [Google Scholar]

[bib7] 7.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J., Roadmap Epigenomics Consortium Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Veyrieras J.-B., Kudaravalli S., Kim S.Y., Dermitzakis E.T., Gilad Y., Stephens M., Pritchard J.K. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 2008;4:e1000214. doi: 10.1371/journal.pgen.1000214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Gaffney D.J., Veyrieras J.-B., Degner J.F., Pique-Regi R., Pai A.A., Crawford G.E., Stephens M., Gilad Y., Pritchard J.K. Dissecting the regulatory architecture of gene expression QTLs. Genome Biol. 2012;13:R7. doi: 10.1186/gb-2012-13-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Pickrell J.K. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet. 2014;94:559–573. doi: 10.1016/j.ajhg.2014.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Kichaev G., Yang W.-Y., Lindstrom S., Hormozdiari F., Eskin E., Price A.L., Kraft P., Pasaniuc B. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10:e1004722. doi: 10.1371/journal.pgen.1004722. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Wen X., Luca F., Pique-Regi R. Cross-population joint analysis of eQTLs: fine mapping and functional annotation. PLoS Genet. 2015;11:e1005176. doi: 10.1371/journal.pgen.1005176. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Wen X. Effective qtl discovery incorporating genomic annotations. bioRxiv. 2015 [Google Scholar]

[bib15] 15.Maller J.B., McVean G., Byrnes J., Vukcevic D., Palin K., Su Z., Howson J.M., Auton A., Myers S., Morris A., Wellcome Trust Case Control Consortium Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat. Genet. 2012;44:1294–1301. doi: 10.1038/ng.2435. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Guan Y., Stephens M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 2011;5:1780–1815. [Google Scholar]

[bib17] 17.Servin B., Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. doi: 10.1371/journal.pgen.0030114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Wen X. Bayesian model selection in complex linear systems, as illustrated in genetic association studies. Biometrics. 2014;70:73–83. doi: 10.1111/biom.12112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Wilson M.A., Iversen E.S., Clyde M.A., Schmidler S.C., Schildkraut J.M. Bayesian model search and multilevel inference for snp association studies. Ann. Appl. Stat. 2010;4:1342–1364. doi: 10.1214/09-aoas322. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Patwardhan R.P., Lee C., Litvin O., Young D.L., Pe’er D., Shendure J. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat. Biotechnol. 2009;27:1173–1175. doi: 10.1038/nbt.1589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Findlay G.M., Boyle E.A., Hause R.J., Klein J.C., Shendure J. Saturation editing of genomic regions by multiplex homology-directed repair. Nature. 2014;513:120–123. doi: 10.1038/nature13695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Savic D., Park S.Y., Bailey K.A., Bell G.I., Nobrega M.A. In vitro scan for enhancers at the TCF7L2 locus. Diabetologia. 2013;56:121–125. doi: 10.1007/s00125-012-2730-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Flutre T., Wen X., Pritchard J., Stephens M. A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. 2013;9:e1003486. doi: 10.1371/journal.pgen.1003486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Wen X., Stephens M. Using linear predictors to impute allele frequencies from summary or pooled genotype data. Ann. Appl. Stat. 2010;4:1158–1182. doi: 10.1214/10-aoas338. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Berisa T., Pickrell J.K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32:283–285. doi: 10.1093/bioinformatics/btv546. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Pique-Regi R., Degner J.F., Pai A.A., Gaffney D.J., Gilad Y., Pritchard J.K. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21:447–455. doi: 10.1101/gr.112623.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Moyerbrailean G.A., Kalita C.A., Harvey C.T., Wen X., Luca F., Pique-Regi R. Which genetics variants in dnase-seq footprints are more likely to alter binding? PLoS Genet. 2016;12:e1005875. doi: 10.1371/journal.pgen.1005875. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Musunuru K., Strong A., Frank-Kamenetsky M., Lee N.E., Ahfeldt T., Sachs K.V., Li X., Li H., Kuperwasser N., Ruda V.M. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature. 2010;466:714–719. doi: 10.1038/nature09266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Zhou X., Carbonetto P., Stephens M. Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 2013;9:e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Berger J.O., Pericchi L.R. The intrinsic bayes factor for model selection and prediction. J. Am. Stat. Assoc. 1996;91:109–122. [Google Scholar]

[bib31] 31.Wakefield J. Bayes factors for genome-wide association studies: comparison with P-values. Genet. Epidemiol. 2009;33:79–86. doi: 10.1002/gepi.20359. [DOI] [PubMed] [Google Scholar]

[bib32] 32.Wen X., Stephens M. Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene–environment interactions. Ann. Appl. Stat. 2014;8:176–203. doi: 10.1214/13-AOAS695. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Efficient Integrative Multi-SNP Association Analysis via Deterministic Approximation of Posteriors

Xiaoquan Wen

Yeji Lee

Francesca Luca

Roger Pique-Regi

Abstract

Introduction

Material and Methods

Model and Notation

Inference Procedure

Deterministic Approximation of Posteriors

Adaptive DAP Algorithm

DAP-K Algorithm

Applying DAP in Inference

Figure 2.

Application to GWASs

Results

Simulation Studies

Enrichment Analysis with DAP

Figure 1.

Accuracy of the Adaptive DAP Algorithm

Table 1.

Table 2.

Power Comparison of the Multi-SNP Analysis Algorithms

Figure 3.

Re-analysis of the GEUVADIS Data

Table 3.

Analysis of the GTEx Data

Figure 4.

Figure 5.

Discussion

Acknowledgments

Footnotes

Appendix A: Selection of Priority SNPs in Adaptive DAP

Appendix B: Stopping Rule and Estimation of the Approximation Error in Adaptive DAP

Appendix C: Derivation of the DAP-1 Algorithm

Appendix D: Factorization of the Posterior Probability by LD Blocks

Appendix E: Average Accuracy of PIP Estimates using DAP-1

Web Resources

Supplemental Data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases