Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2015 Mar 21;16(4):701–712. doi: 10.1093/biostatistics/kxv009

Bayesian model comparison in genetic association analysis: linear mixed modeling and SNP set testing

Xiaoquan Wen 1
PMCID: PMC4570575  PMID: 25796429

Abstract

We consider the problems of hypothesis testing and model comparison under a flexible Bayesian linear regression model whose formulation is closely connected with the linear mixed effect model and the parametric models for Single Nucleotide Polymorphism (SNP) set analysis in genetic association studies. We derive a class of analytic approximate Bayes factors and illustrate their connections with a variety of frequentist test statistics, including the Wald statistic and the variance component score statistic. Taking advantage of Bayesian model averaging and hierarchical modeling, we demonstrate some distinct advantages and flexibilities in the approaches utilizing the derived Bayes factors in the context of genetic association studies. We demonstrate our proposed methods using real or simulated numerical examples in applications of single SNP association testing, multi-locus fine-mapping and SNP set association testing.

Keywords: Bayes factor, Genetic association, Linear mixed model, Model comparison, SNP set analysis

1. Introduction

In the past decades, genetic association studies have taken a prominent position in uncovering the role of genetic variants in disease etiology. Most recently, two related statistical approaches have become especially important in the analysis of genetic association data: the use of linear mixed models (LMMs) to control for confounding factors and account for polygenic effects and the application of Single Nucleotide Polymorphism (SNP) set analysis for regions of (rare) genetic variants. As demonstrated by many authors (Kang and others, 2010; Segura and others, 2012; Zhou and Stephens, 2012; Zhou and others, 2013), LMMs effectively thwart the identification of false positive associations caused by relatedness or population structures (e.g. cryptic relatedness) in the samples while at the same time increase the power of detecting genuine genetic association signals. SNP set testing (Madsen and Browning, 2009; Wu and others, 2011; Lee and others, 2012) is emerging as a method of choice in detecting associations of rare genetic variants, which may be critical in explaining the phenomenon of “missing heritability”. Recent studies have also shown the necessity of jointly applying both approaches when analyzing the genetic association of rare variants to control for population stratification or using pedigree data.

Currently, the majority of the methodological work employing LMM and/or SNP set analysis in genetic association studies has focused on reporting Inline graphic-values for hypothesis testing. In this paper, we discuss a Bayesian alternative to address both topics within the model comparison framework in which hypothesis testing is regarded as a special case. We first show that both problems can be naturally formulated by a unified Bayesian parametric model, and we then derive a class of analytic approximate Bayes factors for use as our primary statistical device for model comparison. We establish the connections between the approximate Bayes factors and various commonly applied frequentist test statistics in a similar fashion, as reported by Wakefield (2009), Wen (2014), and Wen and Stephens (2014).

Despite its similarities in performance to the frequentist approaches in traditional hypothesis testing settings, the Bayesian approach exhibits great convenience and flexibility in dealing with complicated practical settings within and beyond hypothesis testing. One of the most significant advantages of the Bayesian comparison method is its acceptance of explicitly modeling various alternative scenarios (which are not necessarily nested) and the fluidity with which it combines the evidence from the data via Bayesian model averaging. Beyond single unit (i.e. either an SNP or an SNP set) association testing, we show that the Bayesian model comparison approach can be straightforwardly extended to a joint analysis of multiple association signals, especially when dealing with linkage disequilibrium (LD) among SNPs commonly present in the genetic data. We illustrate a highly efficient multi-locus fine-mapping approach that is facilitated by our results based on approximate Bayes factors.

2. Model and notations

We consider a general form of the LMM,

2. (2.1)

where Inline graphic is an Inline graphic-vector of quantitative response measurements, Inline graphic is an Inline graphic matrix of covariate variables to be controlled as fixed effects, and their coefficients are encoded in the Inline graphic-vector Inline graphic. Inline graphic is an Inline graphic matrix of covariates whose effect, represented by the Inline graphic-vector Inline graphic, is of primary interest for inference. Finally, the Inline graphic-vectors Inline graphic and Inline graphic represent the random effects and the i.i.d residual errors, respectively. In the general LMM inference framework, the random effects vector, Inline graphic, is assumed to be drawn from a multivariate normal (MVN) distribution, i.e.

2. (2.2)

where the Inline graphic matrix Inline graphic is assumed known (while the variance component parameter Inline graphic is typically unknown). In typical genetic applications, Inline graphic represents the genotypes of Inline graphic candidate SNPs, Inline graphic includes intercept term and factors like age, sex that need to be controlled for, and Inline graphic usually represents the random effects due to cryptic genetic relatedness or population structure. The ultimate goal is to make inference of the genetic effect Inline graphic.

We now present a Bayesian counterpart of the LMM, the likelihood part of which is identical to (2.1). From the Bayesian perspective, it is natural to regard the “random effect” assumption (2.2) as a standard MVN prior on Inline graphic. For controlled “fixed” effect coefficient Inline graphic, we assume the MVN prior:

2. (2.3)

where Inline graphic is a diagonal matrix. When performing inference, we take the limit Inline graphic, which essentially assigns independent flat priors to each fixed effect coefficient. A flat prior might be interpreted as an assumption that the a priori effects of Inline graphic are extremely large. This assumption intuitively leads to a conservative inference on Inline graphic. However, for variables that must be controlled for, such conservative assumptions are most likely welcome.

We also assign an MVN prior for the parameter of interest, Inline graphic, such that

2. (2.4)

The variance–covariance matrix Inline graphic fully characterizes a distinct candidate model in our model comparison framework. The choice of Inline graphic is context-dependent and has critical implications on the inference results. In practice, we recommend modeling the effect size on the unit-free scales of signal-noise ratios (Wen, 2014; Wen and Stephens, 2014) by assigning an MVN prior on the standardized effect, i.e. Inline graphic, which induces a prior variance matrix on the original scale of Inline graphic as Inline graphic. (Note that the prior on the random effect Inline graphic is formulated in the same scale.).

Finally, we assume a general joint prior distribution, Inline graphic, for the variance component parameters. As we will show later, the actual functional form of Inline graphic has little impact on our asymptotic approximations of Bayes factors. To emphasize the connection with the frequentist linear mixed effect model, we will henceforth call the above Bayesian linear regression model the Bayesian linear mixed effect model (BLMM).

3. Model Comparison in the BLMM

We derive Bayes factors for the BLMM in order to perform Bayesian model comparisons. More specifically, we consider a space of candidate BLMMs that only differ in their specifications of Inline graphic. We denote Inline graphic as the trivial null model, in which Inline graphic (or equivalently Inline graphic), and we define a null-based Bayes factor for an alternative model characterized by its prior variance on Inline graphic as

3. (3.1)

To present our results regarding the Bayes factors, we begin by introducing several necessary additional notations. We denote Inline graphic and Inline graphic as the MLEs of the full LMM model (2.1) by treating Inline graphic as a fixed effect parameter. In addition, we denote Inline graphic. Correspondingly, we use Inline graphic and Inline graphic to represent the MLEs of the null model, where Inline graphic is restricted to Inline graphic. Furthermore, provided that parameter Inline graphic is known, we note that Inline graphic can be analytically computed as a function of Inline graphic (Appendix A.1 of supplementary material available at Biostatistics online), which we denote by Inline graphic. Accordingly, we use Inline graphic to represent the corresponding variance of Inline graphic (specifically, when Inline graphic, Inline graphic and Inline graphic). Finally, we consider a class of general estimators of Inline graphic, denoted by Inline graphic, for which a tuning parameter Inline graphic is built-in. The statistical details of this class of estimators are explained in Appendix A.2 of supplementary material available at Biostatistics online. Most importantly, it follows that Inline graphic and Inline graphic for the two extreme Inline graphic values. Deriving from Inline graphic, we establish a corresponding estimator of Inline graphic, denoted by Inline graphic, which can be analytically expressed in terms of Inline graphic (see Appendix A.2 of supplementary material available at Biostatistics online) and also shares a similar property such that Inline graphic, and Inline graphic. Finally, we use the notations Inline graphic. In the case that Inline graphic is specified as a function of Inline graphic and/or Inline graphic, we denote Inline graphic. With these additional notations, we show that the desired Bayes factor can be approximated analytically. We summarize the main result in Proposition 1, whose formal proof is given in Appendix A.2 of supplementary material available at Biostatistics online.

Proposition 1 —

Under the BLMM, the Bayes factor can be approximated by

graphic file with name M93.gif (3.2)

It follows that

graphic file with name M94.gif

Remark 1 —

The approximate Bayes factors in the BLMM share the same functional form as the ABFs discussed in Wen (2014) and enjoy some of the computational properties discussed therein. In particular, the computation of the ABF is robust to the potential collinearity presented in the data matrix Inline graphic. Furthermore, Inline graphic is allowed to be rank-deficient.

Remark 2 —

For single SNP analysis, i.e. Inline graphic, both Inline graphic and Inline graphic degenerate to scalars (which we denote by Inline graphic and Inline graphic, respectively). The expression of (3.2) is reduced to

graphic file with name M102.gif (3.3)

which has the same functional form as the ABF discussed in Wakefield (2009).

Although all suitable Inline graphic values yield the same asymptotic error bound, they have practical implications on the approximation accuracy for finite samples. Our numerical experiments (Appendix B of supplementary material available at Biostatistics online) indicate that with sample size around hundreds, the Inline graphics with Inline graphic and Inline graphic both become quite accurate.

3.1. Connection with frequentist test statistics

3.1.1. Connection with fixed effect test statistics

Consider a specific class of prior, Inline graphic, for which the Inline graphic can be simplified to

3.1.1.

Consequently, the Inline graphic becomes a monotonic transformation of the quadratic form Inline graphic. We note that, in the following two special cases, the quadratic form corresponds to some popular frequentist statistics to test Inline graphic as a fixed effect. Particularly, when Inline graphic, the quadratic form becomes the (multivariate) Wald statistic Inline graphic; as Inline graphic is set to 0, it coincides with the Rao's score statistic (Appendix C.1 of supplementary material available at Biostatistics online).

The monotonic correspondence between the Inline graphic and these two popular frequentist test statistics indicates that, under the prior specified, the Inline graphic ranks candidate models (or SNP associations in single SNP analysis) exactly the same way as both the Wald statistic (for Inline graphic) and the score statistic (for Inline graphic). Furthermore, applying the strategy of Bayes/non-Bayes compromise (Good, 1992; Servin and Stephens, 2007) by treating the ABF as a regular test statistic, it becomes obvious that the ABF possesses a Inline graphic-value identical to that of the corresponding Wald or score statistic, depending on the Inline graphic values. Wakefield (2009) first named the prior specification of the kind Inline graphic as the implicit p-value prior. In the special case of single SNP association testing and assuming Hardy–Weinberg equilibrium, it follows that Inline graphic (where Inline graphic represents the allele frequency of a target SNP). As a consequence, the implicit Inline graphic-value prior essentially assumes a larger a priori effect for SNPs that are less informative (either due to a smaller sample size or minor allele frequency). Although, from the Bayesian point of view, there seems to be a lack of proper justification for such prior assumptions (Wakefield, 2009; Wen and Stephens, 2014), we often note that the overall effect of the implicit Inline graphic-value prior on the final inference may be negligible in practice, especially when the sample size is large (see Section 5.1.1 for illustration).

3.1.2. Connection with the variance component score statistic

In SNP set analysis, it has become common practice to construct a variance component score test for the genetic effect Inline graphic (Wu and others, 2011; Lee and others, 2012; Schifano and others, 2012). That is, for a set of Inline graphic SNPs, the genetic effects are assumed to be random and follow the distribution Inline graphic, where the matrix Inline graphic is pre-defined. To test Inline graphic vs. Inline graphic, the score statistic is given by Inline graphic where Inline graphic. In the special case that the random effect Inline graphic is ignored (i.e. Inline graphic), Inline graphic is reduced to the form of the original SKAT statistic (Wu and others, 2011). By re-parameterizing Inline graphic, we show that Inline graphic can be represented as a function of Inline graphic (Appendix C.2 of supplementary material available at Biostatistics online). In particular, as Inline graphic, it follows that

3.1.2.

That is, Inline graphic becomes monotonic to the variance component score statistic. Interestingly, the condition Inline graphic represents a local alternative scenario (i.e. Inline graphic only slight deviates from Inline graphic), for which score tests are known to be most powerful.

4. Genetic association analysis with Bayes factors

4.1. Bayesian hypothesis testing

Bayes factors present two major advantages in the hypothesis testing of genetic association signals: namely, the convenience of Bayesian model averaging and the flexibility of utilizing useful prior information. Before we delve into the details of the advantages of Bayesian models in hypothesis testing, it is worth noting that the practical usage of Bayesian model comparison in hypothesis testing is limited, mostly due to the difficulty involved in determining significance thresholds based on Bayes factors. Traditionally, this issue has been addressed by treating a Bayes factor as a regular test statistic and deriving its Inline graphic-value accordingly (Good, 1992; Servin and Stephens, 2007). Because the null distribution of a Bayes factor is generally non-trivial, most practical implementations rely on permutation procedures. Recently, Wen (2013) proposed a robust Bayesian false discovery rate (FDR) control procedure that directly uses the Bayes factors as inputs. This procedure ensures FDR control, even under the mis-specification of alternative models, a property resembling the behavior of Inline graphic-value based procedures under similar circumstances. Most importantly, this procedure is highly computationally efficient and generally does not require extensive permutations.

4.1.1. Model averaging

In hypothesis testing, there often exist multiple alternative scenarios, and a single parametric model (or its corresponding test statistic) can hardly accommodate all cases. For example, in SNP set testing of rare-variant genetic associations, there exist two primary types of competing approaches that target different alternative scenarios. The first type, represented by the burden tests (Madsen and Browning, 2009), collapses the genetic variants in a region to form a single characteristic genetic unit, with respect to which the association test is then performed. This approach is ideal for a particular alternative scenario in which most of the variants considered are either consistently deleterious or consistently protective. The second type of the approach, represented by the C-alpha (Neale and others, 2011) and SKAT tests, targets a complementary scenario in which the variants included in the SNP set can have bi-directional effects on the phenotype of interest. In practice, because the true alternative model is never known a priori, it remains a challenge to reconcile/combine the results from the two distinct approaches into the frequentist testing paradigm. Bayesian model averaging provides a principled way to naturally address this issue. Suppose that there are Inline graphic possible alternative models in consideration, and for each model Inline graphic, a Bayes factor Inline graphic can be computed and a prior probability/weight, Inline graphic is assigned. An overall Bayes factor then can be computed by Inline graphic, which summarizes the overall evidence from the data compared with the null model while accounting for the uncertainty of the true alternative scenario.

In the context of SNP set analysis, Lee and others (2012) showed that the alternative scenarios considered in the burden and SKAT tests can both be represented in the LMM framework with different specification of random effect Inline graphic matrix. In brief, let the column vector Inline graphic denote the marginal prior effect sizes for Inline graphic SNPs in a set. The burden test assumes Inline graphic, whereas the SKAT model assumes Inline graphic. Given these results and within the framework of BLMM, we can straightforwardly average the evidence over the two competing alternative models by computing an overall Bayes factor, Inline graphic where the probability Inline graphic denotes the relative prior frequency of the burden model. Without prior preference over the two alternatives, a natural “objective” choice is to set Inline graphic.

Lee and others (2012) provided an alternative interpretation by connecting the two models. They considered a class of Inline graphic matrices indexed by a non-negative correlation coefficient Inline graphic: namely,

4.1.1. (4.1)

which we will refer to as the SKAT-O prior. It should be noted that the prior distribution for Inline graphic assumed by Bayesian model averaging is essentially a normal mixture, which itself is not necessarily normal and hence differs from the SKAT-O prior. Nevertheless, the SKAT-O prior can be viewed as a normal approximation of this mixture distribution (to the first two moments).

4.1.2. Informative prior

The explicit specification of the prior distribution on Inline graphic for alternative models is seemingly a distinct feature of Bayesian hypothesis testing. However, as we have shown, even the most commonly applied frequentist test statistics can be viewed as resulting from some implicit Bayesian priors. Therefore, it is only natural to regard the prior specification of Inline graphic as an integrative component in alternative modeling. This fact should encourage practitioners to explicitly formulate appropriate informative priors in Bayesian hypothesis testing: if the prior does capture some essence of reality, it improves the overall statistical power; even if the prior is mis-specified, testing with Bayes factors using the procedures, such as either the Bayes/non-Bayes compromise or the robust Bayesian FDR control, only results in a reduction in power but no inflation of type I error.

For SNP set analysis, it has become common practice to pre-define some “weight” for each individual participating SNP in both the burden and SKAT types of approaches (i.e. the aforementioned Inline graphic vector). Most commonly, these priors are set up to prioritize genetic variants with low allele frequencies. When performing genetic association analysis, it is now becoming increasingly popular to incorporate genomic annotation and/or pathway information. In all of these examples, BLMM provides a convenient way to formally integrate the prior information into the hypothesis testing.

Finally, we note that there exist practical settings, especially in the studies of genome-wide scale, in which the information of the desired priors can be sufficiently “learned” from data facilitated by the Bayes factors. Take, for example, the problem of SNP set analysis with two competing alternatives, and consider inferring the weights of the burden and the SKAT models (Inline graphic) from the data. Hypothetically, if (i) many SNP sets are investigated (in a single or multiple studies) and (ii) a sufficient amount of modest to strong signals are presented in the data, it should be intuitive that Inline graphic can be accurately estimated by pooling the information across all SNP sets. More specifically, for each SNP set, we can augment a latent indicator to represent the true generative model of the observed data. Subsequently, a straightforward EM algorithm (where the complete data likelihood can be evaluated via Bayes factors) can be used to estimate Inline graphic​.

4.2. Bayesian variable selection in the BLMM

Beyond hypothesis testing, many practical problems in genetic association studies can be tackled using model comparison/selection techniques via Bayes factors. Here, we consider the problem of multi-locus fine-mapping analysis. In practice, the fine-mapping analysis usually focuses on relatively small genomic regions flagged by SNP association signals, with the aim of identifying multiple potential signals and narrowing down the candidate causal variants within a region while accounting for LD.

Consider a region of Inline graphic candidate variants whose genetic effects are jointly modeled by the Inline graphic-vector Inline graphic. Ultimately, we are interested in making an inference on the binary vector Inline graphic. Under the BLMM, we assume the following spike-and-slab prior for variable selection, namely,

4.2. (4.2)

where the parameter Inline graphic denotes the prior inclusion probability of an SNP and the parameter Inline graphic represents the prior genetic effect size of each SNP. The posterior distribution of Inline graphic can be computed by

4.2. (4.3)

where the Bayes factor can be further approximated by Inline graphic. It is then conceptually straightforward to design an MCMC algorithm to perform Bayesian variable selection. We note that, in the case of setting Inline graphic, there are substantial computational savings in the proposed MCMC computation. We give the detailed description and explanation of the MCMC algorithm in Appendix D of supplementary material available at Biostatistics online.

5. Numerical illustration

5.1. Application of BLMM to an A. thaliana data set

In this example, we apply the BLMM to study the genetic associations between the genotypes of an inbred A. thaliana line and the quantitative phenotype of sodium concentration in the leaves using the data described in Baxter and others (2010). The data set consists of 336 inbred individuals, and each individual is genotyped at 214K SNP positions genome-wide. The data set was previously analyzed by Segura and others (2012) under the LMM setting. We conduct an additional quantile normalization step for the original phenotype measurements to prevent the influence of potential outliers.

5.1.1. Single SNP association analysis

We first perform single SNP association tests using the approximate Bayes factors of the BLMM and compare the results with the analyses based on Inline graphic-values. To specify the alternative models in the BLMM, we consider a natural exchangeable prior on the standardized effect scale, i.e. Inline graphic. Unlike the implicit Inline graphic-value prior, this prior does not assume a relationship between the genetic effect size and the features of a target SNP. Furthermore, instead of fixing a single Inline graphic value, we assume that Inline graphic is uniformly drawn from the set Inline graphic, where the various levels of Inline graphic values cover a range of small, modest to large potential effect sizes. The use of multiple Inline graphic values forms a mixture normal prior, which is helpful for describing a longer-tailed distribution of effect sizes (Servin and Stephens, 2007; Wen, 2014). The range of the Inline graphic values is selected following the suggestion of Stephens and Balding (2009). We use the software package GEMMA (Zhou and Stephens, 2012) to estimate the kinship matrix, Inline graphic, for the random effect, and obtain the MLEs, Inline graphic, along with their standard errors for all the SNPs. Applying (3.3), we then compute the approximate Bayes factors at Inline graphic and Inline graphic for each Inline graphic value. Finally, we compute an overall Bayes factor by averaging over all the prior effect size models, i.e. Inline graphic.

We first investigate the ranking of the association signals by the ABFs under the natural Bayesian prior and the Inline graphic-values based on the score and Wald test statistics. To this end, we compute the Spearman's rank correlation coefficient (Inline graphic) of the Inline graphic and Inline graphic(Inline graphic-value). The overall rank correlation (from all 214K association tests) between Inline graphic(Inline graphic-value) based on the score statistic and Inline graphic is 0.817. However, we note that the majority of the discordance in ranking comes from the unlikely association signals (see Figure 1), which are generally not of interest. Focusing on the subset of 10 913 SNPs with Inline graphic, the rank correlation becomes nearly perfect (Inline graphic). Similarly, the Inline graphic(Inline graphic-value) based on the Wald statistic has an overall rank correlation of 0.821 with Inline graphic, and for the subset of 11 379 SNPs with corresponding Inline graphic, Inline graphic. The direct comparison between the approximate Bayes factors and corresponding Inline graphic-values is shown in Figure 1.

Fig. 1.

Fig. 1.

Direct comparison of the Inline graphics and Inline graphic-values on the log scale. The plot shows that the rankings of the association signals based on the Bayes factor and the Inline graphic-value are largely in agreement, especially for SNPs showing modest to strong signs of association.

As an illustration, we further apply the Bayesian and the frequentist FDR control procedures for the Bayes factors and Inline graphic-values to determine the significance cut-offs, ignoring correlations among the tests. Ultimately, both the Benjamini–Hochberg and the Storey procedures using the score statistic Inline graphic-values select 17 significant SNPs (denoted by set Inline graphic). In comparison, the standard Bonferroni procedure selects 12 SNP (denoted by set Inline graphic). The Bayesian FDR control procedure (i.e. the EBF procedure, described in Wen, 2013) based on Inline graphic selects 14 significant SNPs (denoted by set Inline graphic). Importantly, we note that Inline graphic. The results from the Inline graphic and Wald statistic Inline graphic-values are nearly identical.

Based on this result, we conclude that, under this particular GWAS setting with a very modest sample size, there is no obvious practical difference in applying the Bayes factors and the Inline graphic-values in single SNP hypothesis testing. We view this result as a numerical validation of our theoretical results discussed in Section 3.1.

5.1.2. Fine-mapping analysis

Following Segura and others (2012), we further perform a multi-locus fine-mapping analysis of a 200 kb genomic region centered around the top single SNP association signal at chr4:6392280, where 508 SNPs are included. Using the MCMC algorithm described in Section 4.2, we assign the prior inclusion probability Inline graphic for each candidate SNP, which conservatively sets the prior expected number of signals in the region to 1. Conditional on a SNP having a non-zero effect (i.e. Inline graphic), we use the same normal mixture prior for the effect size Inline graphic described in the single SNP association analysis. We obtain the posterior samples from 300 000 MCMC repeats after 150 000 burn-in steps, and the convergence of the MCMC algorithm is diagnosed using the procedure described in Brooks and others (2003).

The analysis based on the posterior samples clearly indicates that there are multiple independent association signals residing in this relatively small genomic region. There is zero probability mass on those posterior models containing fewer than 3 SNPs; the probabilities for the posterior models having 3, 4, 5, and 6 independent signals are 0.175, 0.452, 0.350, and 0.023, respectively. Inspecting individual SNPs, we summarize the top 5 associated SNPs according to their posterior inclusion probabilities in Table 1. The correlations among the top 5 SNPs are very modest. Thus, far, our result has been largely consistent with what is reported in Segura and others (2012), in which a stepwise variable selection scheme with a Bayesian Information Criterion-like model selection criteria is employed. Nevertheless, we note a great deal of uncertainty within the individual models from our analysis. The details of the top 10 models ranked by their posterior probabilities are shown in Table 2. The maximum a posterior model only has a probability of 0.05, and all of the top models have similar complexities and very comparable likelihoods. In addition, we find that 61% of the posterior models contain both of the top two SNPs, and 32% of the posterior models contain a combination of the top 3 SNPs. One may naturally suspect that the uncertainty in relative large models (i.e. with more SNPs included) is partially due to the stringent Inline graphic prior. To this end, we modify the prior distribution to Inline graphic (the two end points correspond to Inline graphic equaling Inline graphic and Inline graphic, respectively), but the results do not qualitatively change. Biologically, it might be the case that the true causal variants are not directly genotyped and the observed signals are only partially correlated with them. It is then worth following up with dense genotyping experiments or genotype imputations. Statistically, it seems evident that, in this particular case, reporting a single “best” model from the variable selection procedure yields an over-simplified picture and can be misleading for the follow-up analysis.

Table 1.

Top Inline graphic associated SNPs according to their marginal inclusion probabilities in the Bayesian fine-mapping analysis

SNP Posterior inclusion prob. Marginal Inline graphic
chr4:6414956 0.795 4.98
chr4:6392280 0.741 7.96
chr4:6420777 0.528 6.03
chr4:6455695 0.451 5.30
chr4:6391204 0.405 7.92

The last column shows the values of Inline graphic from the single SNP association testing. Only SNP chr4:6392280 and SNP chr4:6391204 show a very modest LD, whereas all of the other pairs of SNPs are very weak in LD.

Table 2.

Top Inline graphic posterior models in the Bayesian fine-mapping analysis

Model Posterior Prob. Inline graphic
chr4:6392280 + chr4:6394774 + chr4:6414956 + chr4:6421034 0.052 19.55
chr4:6392280 + chr4:6414956 + chr4:6420777 + chr4:6455695 0.039 18.89
chr4:6391204 + chr4:6392280 + chr4:6414956 + chr4:6420777 0.032 18.47
chr4:6380552 + chr4:6391204 + chr4:6414956 + chr4:6455695 0.028 18.67
chr4:6391204 + chr4:6414956 + chr4:6420777 + chr4:6455695 0.026 18.72
chr4:6392280 + chr4:6414956 + chr4:6418442 + chr4:6420777 0.024 18.68
chr4:6391286 + chr4:6392280 + chr4:6414956 + chr4:6420777 0.022 18.35
chr4:6392280 + chr4:6414956 + chr4:6418442 0.018 16.75
chr4:6380552 + chr4:6391204 + chr4:6392280 + chr4:6420777 0.017 18.19
chr4:6380552 + chr4.6392280 + chr4:6394774 + chr4:6414956 + chr4:6421034 0.016 21.25

The models are ranked according to their posterior probabilities Inline graphicsecond columnInline graphic. The last column shows the values of Inline graphic of the corresponding models. Our prior specification encourages sparse models: complicated models with more predictors are penalized more severely by the prior inclusion probability. The most important feature of these results is that there is not a unique simple model that is clearly better than the others.

5.2. Simulation study of SNP set analysis

In this section, we perform simulation studies to illustrate the effectiveness of the proposed Bayesian model comparison approach in SNP set analysis. In each simulated data set, we generate 5000 phenotype-SNP set pairs that mimics the data structure from genome-wide investigation of expression quantitative trait loci. We randomly select 3500 SNP sets and simulate their phenotypes from a null model. For the remaining SNP sets, we use two types of alternative models described in Lee and others (2012) to generate their phenotypes: one model assumes consistent directional effects of rare variants, whereas the other allows inconsistent directional effects. We use Inline graphic to denote the relative frequency of the sign-consistent models in all the alternative models, and vary this parameter in different simulation sets. We give a detailed account of the simulation schemes in Appendix E.1 of supplementary material available at Biostatistics online.

We analyzed the simulated data sets using the proposed Bayesian model comparison approach and the SKAT-O method implemented in the R package SKAT (version 0.95) to examine their controls of FDR and powers. For both approaches, we again follow the previous work (Wu and others, 2011; Lee and others, 2012) and assign the marginal weight for each SNP as a function of their allele frequencies. In Bayesian analysis, we apply two strategies in choosing the prior weights for Bayes factor computation. The first strategy assumes an “objective” uniform prior setting Inline graphic, and the second strategy estimates Inline graphic and the distribution of genetic effect sizes from the data by pooling information across all phenotype-SNP set pairs using a hierarchical model. The details of the analysis procedure are provided in Appendix E.2 of supplementary material available at Biostatistics online.

We summarize the simulation results in Table 3. The FDRs in all the methods are well controlled. The performance of the Bayesian procedure with the default uniform prior weights is very similar to that of the SKAT-O, and the Bayesian procedure based on informative priors achieves the best power in all settings defined by different true Inline graphic values. These results are well expected because the Bayesian method with estimated weights has the unique advantage of effectively borrowing information across genes through the use of Bayes factors and hierarchical modeling. In addition, we want to emphasize that all of the Bayesian models assumed in the analysis are indeed very “wrong” comparing to the true data-generating model; nevertheless, the robust Bayesian FDR control procedure using Bayes factors ensures the targeted FDR level.

Table 3.

Realized FDR and power in simulation studies of SNP set analysis

FDR
Power
Setting (Inline graphic) SKAT-O Bayesian-D Bayesian-E SKAT-O Bayesian-D Bayesian-E
0.20 0.024 0.027 0.023 0.768 0.741 0.821
0.40 0.046 0.030 0.028 0.791 0.773 0.828
0.50 0.051 0.045 0.041 0.836 0.825 0.869
0.60 0.049 0.046 0.045 0.909 0.908 0.919
0.80 0.050 0.049 0.048 0.933 0.943 0.948

The first column (setting) indicates the percentage of the SNP sets with sign-consistent effects among all of the non-null SNP sets in the simulated data. For the SKAT-O procedure, the resulting Inline graphic-values are further processed by the Storey procedure for FDR controls. “Bayesian-D” indicates the Bayesian testing procedure with the default uniform weights. “Bayesian-E ” indicates the Bayesian procedure that estimates Inline graphic from the data. The FDR control for the Bayes factors is performed using the EBF procedure described in Wen (2013).

Going beyond SNP set testing targeting rare-variant associations, in Appendix F of supplementary material available at Biostatistics online, we further demonstrate that our Bayesian model averaging framework can be conveniently extended to integrate models for detecting common variant associations into SNP set testing. We envision that this approach will have a profound impact in studies of expression trait quantitative loci at genome-wide scale.

6. Discussion

In this paper, we have presented a unified Bayesian framework to perform model comparisons in the contexts of an LMM and SNP set analysis. Although our statistical results are presented exclusively for the quantitative response variables, it is possible to extend them to the generalized LMMs context to incorporate binary outcomes and count data using a quadratic approximation of the corresponding log-likelihood functions.

Primarily based on the results of the approximate Bayes factors, we have demonstrated an efficient Bayesian sparse variable selection algorithm to perform multi-locus association analysis using the BLMM. Recently, Zhou and others (2013) also proposed an elegant Bayesian solution for multiple SNP association analysis under the LMM model on the genome-wide scale. It should be noted that their method also has a primary focus on estimating the heritability, whereas our method is designed for fine-mapping analysis. In addition, by treating SNP sets as selection units, our approach can be straightforwardly extended to the identification of multiple associated genes/SNP sets, which may be attractive for biological pathway analysis. Previous studies (Guan and Stephens, 2011; Wen, 2014) have shown that Bayesian methods generally hold advantages over penalized regression approaches in variable selection problems with correlated covariates (e.g. SNPs in LD) and/or non-i.i.d. residual error structures. More importantly, as we have demonstrated, there can be great uncertainty regarding any single “best fitting” model. As a practical consequence, reporting a single “best” model but ignoring appropriate uncertainty assessments could hinder follow-up scientific investigation.

Finally, we want to note that Bayesian model comparison approaches have been successfully assessed in other areas of genetic association studies, e.g. meta-analysis (Wen and Stephens, 2014), association mapping of multiple-traits, and detecting gene–environment interactions (Flutre and others, 2013; Wen and Stephens, 2014). Our results can be conveniently integrated into those existing tools, and their usages can be naturally extended to incorporate LMM and SNP set analysis.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

This work was supported by NIH grants R01-MH101825 and R01-HG007022.

Supplementary Material

Supplementary Data

Acknowledgments

We thank Seunggeun Lee and Xiang Zhou for helpful discussions. Conflict of Interest: None declared.

References

  1. Baxter I., Brazelton J. N., Yu D., Huang Y. S., Lahner B., Yakubova E., Li Y., Bergelson J., Borevitz J. O., Nordborg M., Vitek O., Salt D. E. (2010). A coastal cline in sodium accumulation in arabidopsis thaliana is driven by natural variation of the sodium transporter athkt1; 1. PLoS Genetics 6(11), e1001193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Brooks S. P., Giudici P., Philippe A. (2003). Nonparametric convergence assessment for MCMC model selection. Journal of Computational and Graphical Statistics 12(1), 1–22. [Google Scholar]
  3. Flutre T., Wen X., Pritchard J., Stephens M. (2013). A statistical framework for joint eqtl analysis in multiple tissues. PLoS Genetics 9(5), e1003486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Good I. J. (1992). The Bayes/non-Bayes compromise: a brief review. Journal of the American Statistical Association 87(419), 597–606. [Google Scholar]
  5. Guan Y., Stephens M. (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics 5(3), 1780–1815. [Google Scholar]
  6. Kang H. M., Sul J. H., Service S. K., Zaitlen N. A., Kong S. Y., Freimer N. B., Sabatti C., Eskin E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42(4), 348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Lee S., Wu M. C., Lin X. (2012). Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13(4), 762–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Madsen B. E., Browning S. R. (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genetics 5(2), e1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Neale B. M., Rivas M. A., Voight B. F. (2011). Testing for an unusual distribution of rare variants. PLoS Genetics 7(3), e1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Schifano E. D., Epstein M. P., Bielak L. F., Jhun M. A., Kardia S. L., Peyser P. A., Lin X. (2012). SNP set association analysis for familial data. Genetic Epidemiology 36(8), 797–810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Segura V., Vilhjálmsson B. J., Platt A., Korte A., Seren Ü., Long Q., Nordborg M. (2012). An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nature Genetics 44(7), 825–830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Servin B., Stephens M. (2007). Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genetics 3(7), e114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Stephens M., Balding D. J. (2009). Bayesian statistical methods for genetic association studies. Nature Reviews Genetics 10(10), 681–690. [DOI] [PubMed] [Google Scholar]
  14. Wakefield J. (2009). Bayes factors for genome-wide association studies: comparison with p-values. Genetic Epidemiology 33(1), 79–86. [DOI] [PubMed] [Google Scholar]
  15. Wen X. (2013) Robust Bayesian FDR control with Bayes factors. arXiv preprint .
  16. Wen X. (2014). Bayesian model selection in complex linear systems, as illustrated in genetic association studies. Biometrics 70(1), 73–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Wen X., Stephens M. (2014). Bayesian methods for genetic association analysis with heterogeneous subgroups: from meta-analyses to gene-environment interactions. Annals of Applied Statistics 8(1), 176–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Wu M. C., Lee S., Cai T., Li Y., Boehnke M., Lin X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics 89(1), 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Zhou X., Carbonetto P., Stephens M. (2013). Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9(2), e1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Zhou X., Stephens M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44(7), 821–824. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES