Abstract
While genome-wide association studies (GWASs) have been widely used to uncover associations between diseases and genetic variants, standard SNP-level GWASs often lack the power to identify SNPs that individually have a moderate effect size but jointly contribute to the disease. To overcome this problem, pathway-based GWASs methods have been developed as an alternative strategy that complements SNP-level approaches. We propose a Bayesian method that uses the generalized fused hierarchical structured variable selection prior to identify pathways associated with the disease using SNP-level summary statistics. Our prior has the flexibility to take in pathway structural information so that it can model the gene-level correlation based on prior biological knowledge, an important feature that makes it appealing compared to existing pathway-based methods. Using simulations, we show that our method outperforms competing methods in various scenarios, particularly when we have pathway structural information that involves complex gene-gene interactions. We apply our method to the Wellcome Trust Case Control Consortium Crohn’s disease GWAS data, demonstrating its practical application to real data.
Keywords: generalized fused lasso, group lasso, hierarchical variable selection, pathway-based GWAS, summary statistics
1 |. INTRODUCTION
Genome-wide association studies (GWASs) have been widely used to detect associations between complex diseases and common genetic variants, such as single-nucleotide polymorphisms (SNPs) with a minor allele frequency (MAF)>5%. The standard approach in case-control GWASs to detecting associations between a disease and SNPs usually consists of genotyping hundreds of thousands of SNPs in thousands of participants with the disease (cases) or without the disease (controls), and analyzing the genotype data in statistical frameworks. Using this approach, GWASs have successfully identified SNPs that are associated with many human diseases such as breast cancer1,2 and type-2 diabetes.3,4
The statistical analyses in SNP-level GWASs usually start with a regression of the phenotype data onto each genotyped SNP so that we can obtain a p value for each SNP indicating its marginal strength of association with the disease phenotype. The SNPs with a p value less than a predefined threshold will then be considered as significantly associated with the disease and potentially for further investigation. The threshold is usually determined using multiple-comparison correction method such as the Bonferroni correction to control the overall type-I error rate and is often very stringent due to the large-scale multiple testing that results from the large number of SNPs. For example, if a GWAS has 1 million SNPs, the threshold for SNP-level analyses can be as low as 5 × 10−8, making it hard to detect SNPs that have a moderate effect but a true association with the disease. In addition, some of the few SNPs that pass the threshold can still be false positives, limiting the reproducibility of SNP-level GWASs.
On the other hand, the idea of testing groups of SNPs is appealing in that it reduces the scale of multiple testing as well as allowing us to borrow information from SNPs that are in linkage disequilibrium (LD) or from groups of SNPs, eg, genes in this case, that have epistatic interactions. As a result, we expect to have a less stringent threshold for genome-wide significance and to improve the power to identify SNPs that individually have a moderate effect but are collectively significant. By grouping SNPs into meaningful, higher-order genomic structures such as genes and pathways, we also expect to provide results with improved interpretability and reproducibility. In this regard, both gene- and pathway-based GWASs provide natural solutions to the aforementioned problems of SNP-level GWASs.
In this paper, we focus on pathway-based GWASs and refer to biological pathways annotated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database.5 Recognizing that genes do not work in isolation and are often functionally related through biological pathways and molecular networks,6 pathway-based analysis methods for GWASs aim to improve power for identifying pathways of genes that individually have a moderate effect but jointly make a significant contribution to the disease. Examples of such pathways include the IL-12-IL-23 pathway7 for Crohn’s disease and the PI3K/RAS pathway8 for ovarian cancer tumors, both of which are likely to contribute to disease susceptibility despite the fact that only a few or even no genes in them may reach genome-wide significance. This well demonstrates the possibility that genes tend to work together in their respective pathways, and we may therefore gain more power to detect them at the pathway level rather than the gene or SNP level.
Recognizing the importance of pathway-based GWASs, many pathway-based association tests have been proposed, mostly in the frequentist framework. Recently proposed methods, for example, include aSPUpath,8 HYST,9 and Gates-Simes.10 One criterion to classify these methods is whether they require the input of the individual-level phenotype and SNP genotype data.11 For example, aSPUpath requires the phenotype and genotype data to calculate the score test statistics as well as to implement permutations to obtain the null distribution, whereas Gates-Simes and HYST require the SNP genotype data to calculate the LD correlation matrix among SNPs when no prior LD information is available. However, the fact that individual-level genotype data are not always available poses a major challenge to the practicality of these methods in real applications. Limited pathway-based methods, such as aSPUsPath,12 can be applied by using only the summary statistics. Moreover, although the covariance matrix of SNP-level test statistics and the LD matrix take into account the correlation among SNPs, most of these methods do not consider the gene-level correlation, limiting their ability to detect pathways consisting of genes that may have a moderate marginal effect but work collectively to confer disease susceptibility.
In this study, we propose a Bayesian method to implement pathway-based GWASs using pathway structure, SNP-gene-pathway hierarchical structure, and SNP-level summary statistics, which are widely available in genomic databases from various studies. A biological pathway can be considered as a series of interactions among molecules that are coded by genes, and thus, a pathway structure is often represented as a graph that contains genes as nodes and interactions between genes as edges. These interactions between genes may induce correlation between gene-level effects on the phenotype. For example, “activation,” a common type of interaction between genes, refers to a process where an activator molecule converts an inactive molecule into an activated state so that the latter can perform its biological functions. Thus, if the gene that codes the activated molecule has a positive effect on the phenotype, the gene that codes the activator molecule is also likely to have a positive effect, ie, the gene-level effects of two genes that have an “activation” interaction tend to be correlated. Therefore, incorporating within-pathway gene-level interaction information and modeling corresponding correlations among genes borrow strength across genes within a pathway and could potentially boost power in pathway-based GWASs, which motivates our Bayesian method.
Most current Bayesian methods for GWASs use Bayes factors as a measurement of the association between diseases and SNPs.13–15 Our Bayesian method is based on a hierarchical structured variable selection (HSVS) prior proposed in the work of Zhang et al16 for group variable selection in linear regression when grouping structures are present among predictors. Yang et al17 adopted the HSVS prior for a gene-based GWAS analysis of a trio dataset from osterosarcoma patients. Built upon this work, we propose a generalized fused HSVS prior for pathway-based GWASs, which incorporates the SNP-gene-pathway hierarchical structure and the complex correlations at different levels.
The proposed generalized fused HSVS prior is a discrete mixture prior composed of a point mass at the vector of zeros and a multivariate scale-mixing normal distribution, with a binary indicator indicating which distributional component the random vector follows. We develop the prior for pathway-based GWASs in which the SNP-level summary statistics corresponding to a pathway are treated as a group. The posterior inference on the binary indicator can naturally serve as a hypothesis test to determine if any of the SNP- or gene-level effects of the pathway is zero or not, and the lasso-type scale-mixing normal distribution that leads to shrinkage on within-pathway elements is robust to high levels of noise such that it well controls the type-I errors. The Bayesian method generates posterior samples of the binary indicator that result in an estimate of the posterior probability of a null pathway, which can be interpreted as a Bayesian version of a p value to evaluate the significance of a pathway and has a close connection with the Bayes factor, as will be shown later. At the same time, this method also generates posterior estimates of the gene-level effects, making inference on the relative importance of the genes within a pathway, using SNP-level data. More importantly, our proposed hierarchical prior is flexible in that it incorporates the SNP-gene-pathway hierarchical structure as well as the complex correlation structures. In particular, it accounts for LD correlations at the SNP level using autoregressive models and uses the generalized fused lasso method to induce gene-level correlations given the pathway structural information obtained from prior biological knowledge. The incorporation of LD correlations and prior knowledge of pathway structure allows us to borrow information across the SNP set within a pathway and consequently boosts power in pathway selection. Although pathway information has been used in previous Bayesian GWASs to obtain the number of neighboring genes in the Markov random field prior,18 we use pathway information to obtain the pairs of interacting genes and model gene-level correlations in our proposed prior. In addition, our proposed prior leads to closed-form full conditional distributions of most of the model parameters, rendering a much faster computation of the Markov chain Monte Carlo (MCMC) posterior samples.
The rest of this paper is organized as follows. We propose the generalized fused HSVS prior in Section 2. The performance of the proposed Bayesian method is demonstrated using simulations in Section 3. We use it to identify susceptibility pathways using the Wellcome Trust Case Control Consortium (WTCCC) Crohn’s disease GWAS data19 in Section 4. Finally, a brief discussion will be included in Section 5.
2 |. METHODS
2.1 |. Generalized fused hierarchical structured variable selection model
Let denote the vector of z-scores from the Wald tests in logistic regressions for the association of the dichotomous disease status with each SNP in the mth pathway that has g genes with the ith gene having ni SNPs. The order of SNPs in a gene reflects the relative relationship of their genomic locations. The z-scores can also be converted from other statistics such as t-scores or χ2 statistics.17 We assume that
| (1) |
where θm = (θ1, … , θg)T with θi being the common mean for , ie, the SNP-level summary statistics of the ith gene, 1m = diag(11, … , 1g) with 1i being an ni vector of 1s for the ith gene, and with being a first-order autoregressive (AR-1) correlation matrix for the ith gene. Thus, we assume that the SNPs within a gene have a common mean effect with their correlations exhibiting an AR-1 pattern, whereas the SNPs of different genes are conditionally independent given the gene-level mean. Here, we assume an AR-1 correlation structure for all genes with a common ρ. However, the AR-1 correlation matrix can be replaced with other correlation structures, eg, the empirical LD matrix, given different assumptions of the SNP-level correlation.
Our interest is to test the null hypothesis H0 ∶ θm = 0, that is, there is no association between any of the genes in the mth pathway and the disease of interest under H0. We tested this hypothesis in a Bayesian framework by using a hierarchically structured prior, the generalized fused HSVS prior, for θm. Specifically, the generalized fused HSVS prior is a discrete mixture prior for group selection with a Bayesian generalized fused lasso hierarchy modeling the gene-level effects, which accounts for the gene-gene interactions within the pathway. The prior is structured as follows:
| (2) |
The prior uses a binary indicator, γm, on θm for the pathway-level selection so that, when γm = 0, we have θm = 0 supporting the null hypothesis for the mth pathway. On the other hand, γm = 1 indicates that the null hypothesis is rejected and the mth pathway is associated with the disease. Using such a discrete mixture prior in the Bayesian framework generates posterior samples of the binary indicator γm, which can be used to obtain the posterior probability , ie, the posterior probability of the null hypothesis. We then take this probability as a Bayesian version of a p value to evaluate the significance of the pathway and to conduct posterior inference using multiple-comparison correction methods.
Note that there is a close connection between our inference and the Bayes factor (BF), given that
| (3) |
when P(H0) = P(H1), ie, the null and the alternative hypotheses are equally likely in the prior. Our inference is based on , which is on the right-hand side of Equation (3). We used the Gibbs sampler to approximate the marginal distribution of γm and, thus, obtained the marginal likelihood .
Under the alternative hypothesis, we assume that θm follows a multivariate normal distribution and we incorporate prior knowledge of gene-gene interaction within the pathway in modeling the inverse covariance matrix. Specifically, we specify the (i, j)th element of as follows:
| (4) |
where ki is the number of nonzero elements in the ith row or column of , ie, ki − 1 is the number of genes that interact with gene i within the pathway. Note that gives the general form of the Bayesian generalized fused lasso formulation20 by allowing negative and no associations between two off-diagonal elements in addition to positive associations. Thus, we model the off-diagonal elements of to introduce correlations between interacting genes given prior biological information. If two genes are positively associated with each other, they should have similar effects on the disease and a positive correlation is introduced. If two genes are negatively associated with each other, they should have opposite effects on the disease and, thus, a negative correlation is specified. By adapting to different pathway structures, our proposed prior has the flexibility to take in any pathway structure information so that it can borrow strength from interacting genes and boost power in pathway selection based on prior biological knowledge.
We further specify the hyperpriors for and in as
| (5) |
The exponential hyperprior for leads to an exponential-scale mixture normal prior under the alternative hypothesis, which is equivalent to a Bayesian lasso formulation21 that is robust to high levels of noise. The gamma hyperprior for , on the other hand, completes the gamma-scale mixture normal prior, which leads to a Bayesian group lasso formulation22 on the differences between θi and θj for positively interacting genes i and j, and on the sums of θi and θj for negatively interacting gene pairs.
To further illustrate our method, we collapse our Bayesian generalized fused HSVS prior, which leads to the marginal prior
| (6) |
where E+ indicates the set of gene pairs that positively regulate each other, and E− indicates the set of negatively interacting gene pairs. This marginal prior indicates that, under the alternative hypothesis, our hierarchical prior has introduced two shrinkage effects: one independently shrinks the individual gene-level effect estimates, and the other simultaneously shrinks the differences between positively associated gene pairs and the sums of negatively associated gene pairs. The first shrinkage effect, the L1 norm regularization with a penalty parameter λ1m, is realized through the Bayesian lasso formulation. Specifically, a scale mixture of normals with exponential hyperpriors leads to independent marginal Laplace priors for θi, i = 1, … , g. The second shrinkage effect, the L2 norm regularization with a penalty parameter λ2m, is realized through the Bayesian generalized fused group lasso formulation. Compared to the typical Bayesian fused lasso formulation in which an L1-type penalty is used only for pairs of neighboring coefficients with different strengths of correlations, which results in a tridiagonal inverse variance matrix in its normal prior, we introduce a common latent parameter for a uniform correlation strength for all pairs of interacting genes within a pathway. Using a gamma hyperprior on leads to a penalty on the L2 norm of the difference/sum of all positively/negatively associated gene pairs, which we call the Bayesian generalized fused group lasso prior.
We call this hierarchical prior in Equation (2), (4), and (5) the generalized fused HSVS prior. Compared to the fused HSVS prior,16 it has the following novel properties. (1) It generalizes the fused HSVS prior to allowing shrinkage in the difference between or the sum of any two correlated coefficients instead of neighboring coefficients only. In practice, this allows us to apply this prior to pathway-based GWASs because any pair of genes in a pathway could be correlated. (2) By integrating the Bayesian generalized fused lasso with the Bayesian group lasso, it shrinks the difference between or the sum of two coefficients for all pairs of correlated coefficients simultaneously rather than independently. This will produce smoother coefficient estimates for correlated coefficients, which will, in turn, lead to stronger correlations among the correlated coefficient estimates. (3) It is more versatile in that any off-diagonal element in the inverse variance matrix can be set to zero to indicate independence between the two corresponding coefficients. This feature is especially useful in pathway-based GWASs because not all genes in a pathway are correlated and it provides a way to integrate the prior gene-gene correlation information of a pathway into our model to increase power as well as control type-I errors.
For the remaining priors and hyperpriors, we specify the prior for γm as Bernoulli(p), and the hyperprior for p as Beta(a, b) with constant parameters a and b. We also specify the hyperprior for as Gamma(r1, δ1) and the hyperprior for as Gamma(r2, δ2) with constant shape parameters r1 and r2 and constant rate parameters δ1 and δ2 . For σ2, we use the improper prior . These choices of priors and hyperpriors lead to closed-form full conditional posterior distributions, as shown in Appendix, rendering an efficient Gibbs sampler. Lastly, we specify the prior for ρ as Beta(a0, b0) with constant parameters a0 and b0. As ρ does not have a closed-form full conditional posterior distribution, we use the within-Gibbs Metropolis-Hastings (MH) algorithm to obtain its posterior samples. Alternatively, we can use an empirical estimate of ρ in place of the MH sampling, which will result in a more efficient computation while having a similar performance to the MH sampling, as shown in the simulation studies in Section 3. The full generalized fused HSVS model is therefore formulated as follows:
2.2 |. Choice of hyperparameters
For the beta hyperprior on p, we set (a, b) = (0.1, 0.1) to introduce a vague hyperprior for γm. Different parameterizations of (a, b) can be used for more informative choices; for example, a sparse hyperprior parameterization can be used with an empirical Bayes estimate of b.17 For the gamma hyperprior on , we set (r1, δ1) = (0.1, 0.1) to again introduce a vague hyperprior for , whereas, for the gamma hyperprior on , we set with , where is the number of pairs of interacting genes in the mth pathway, to introduce an informative hyperprior for . By empirically estimating r2 using the pathway structure information, we have which leads to the nonzero off-diagonal element in having a mean absolute value greater than 1, encouraging appropriate correlations among interacting genes.
3 |. SIMULATION STUDIES
In this section, we conduct simulation studies to examine the performance of our proposed Bayesian method and compare it to that of commonly used frequentist pathway-based GWAS methods. Note that we generate genotype data in the simulations because most competing frequentist methods that we include require the genotype data for permutation or calculating the LD matrices. However, our method works with summary statistics that are more often available in publicly accessible genomic databases.
We start with a simulation where we compare the methods in regular conditions as a benchmark against which other scenarios can be compared. In this scenario, the benchmark pathway, as shown in Figure 1, has 20 genes and each gene has 10 SNPs. Among the 20 genes, 10 genes are causal, and all their SNPs are causal too. The mean effect size, ie, the log odds ratio (LOR), of SNPs within a gene is the same, and we use it to represent the gene-level effect size. We set the gene-level effect size of interacting genes close to each other to reflect the gene-level correlation. The ρ of the AR-1 structure for SNPs within a gene is 0.4. In the following simulations, we compare the methods in different scenarios by varying the benchmark pathway in terms of its effect size, percent of causal genes, number of SNPs per gene, percent of causal SNPs, percent of known pathway structural information, and ρ of the AR-1 structure for SNPs.
FIGURE 1. Structure and gene-level effect size of the benchmark pathway.

A circle represents a gene that has 10 single-nucleotide polymorphisms (SNPs). An edge between two circles indicates that two genes are interacting. The number in a circle represents the gene-level effect size, ie, log odds ratio, for SNPs in the gene. The ρ of the AR-1 structure for SNPs in a gene is 0.4
To simulate the genotypes, we use the simPathAR1Snp function from the R package aSPU with slight modifications to accommodate the pathway structure. We first generate a latent vector from a multivariate normal distribution with mean 0 and a block diagonal covariance matrix with each block matrix being an AR-1 structure for each gene. The latent vector is then dichotomized to a haplotype of 1 or 0 using cutoffs that correspond to MAFs between 5% and 40%. We generate two such haplotypes and add them up to obtain the number of minor alleles Xi for the ith participant. Using the inverse logit function , where βj is the predefined LOR that represents the effect size for the jth SNP and β0 is the LOR of background disease rate, we obtain the probability of the ith participant having the disease and use it to sample the binary disease status. We iterate these steps until we have 500 cases and 500 controls.
One of the novelties of our method is that we use the pathway structural information to borrow strength across interacting genes in pathway-based tests in addition to modeling the correlation among SNPs. We consider comparing our method with Gates-Simes, aSPUpath, and HYST because these methods take into account the SNP-level correlations but not the gene-level correlations. We implement the three competing methods using R package aSPU. We also include the minimum p value method, where we use the smallest p value of all SNPs to indicate the significance of a pathway. Since there are multiple approaches for estimation of ρ in our method, we include three of them in the simulations: (1) the empirical method, where we use the summary statistics of neighboring SNPs to estimate the ρ and use it as a constant throughout the MCMC posterior sampling; (2) the noninformative prior method, where we set ρ ∼ Beta(a0 = 1, b0 = 1); and (3) the informative prior method, where we adapt a0 and b0 so that Beta(a0, b0) has a mean of the ρ in the simulations. For (2) and (3), we use the MH algorithm to obtain the posterior samples of ρ because the full conditional of ρ is not available in closed form.
We simulate both the causal and null pathways and use the receiver operating characteristic (ROC) curve as the performance measurement that displays the trade-off between true and false positive rates for each method in these simulations. We check the true and false positive rate by the following steps: (1) in each simulation, generate randomly 500 causal and 500 null pathways; (2) apply each method to these pathways so that, for each method, we obtain 500 p values for causal pathways and 500 p values for null pathways; and (3) for a given threshold, the true or false positive rate is equal to the percentage of the corresponding 500 p values that are smaller than the given threshold. We then vary the threshold from 0 to 1 with an increment of 0.001 to calculate the corresponding true and false positive rates so that we can generate the ROC curves, which give a fair comparison of true positive rates for all methods at any given false positive rates.
We first investigate the effect of pathway effect size on the performance. We adjust the pathway effect size by either increasing or decreasing the gene-level effect size in the pathway simultaneously. As shown in Figure 2, we observe a consistent pattern of performance across different scenarios of varying effect sizes with our method outperforming the others. We also note that our method has a greater advantage over aSPUpath in the low effect size scenarios, as shown in Figure 2A and 2B. We consider this as a consequence of our method borrowing strength across the genes even when the gene-level effect size is very small.
FIGURE 2. Comparison of receiver operating characteristic (ROC) curves among methods in four scenarios with various effect sizes.

“Empirical”, “Non-info”, and “Info” represent empirical, noninformative prior, and informative prior, the three approaches for the estimation of ρ in the generalized fused hierarchical structured variable selection (HSVS) method, respectively. LOR, log odds ratio
We then adopt a different approach to adjusting the pathway effect size by either increasing or decreasing the number of causal genes in the pathway. We randomly choose 2 and 6 causal genes from the 10 causal genes in the benchmark pathway and set the remaining genes as noncausal (ie, LOR = 0) to produce the pathways that have 10% and 30% causal genes, respectively. The benchmark pathway contains 50% causal genes. To construct the pathway that has 100% causal genes, we set the 10 noncausal genes in the benchmark pathway as causal by giving them an effect size of 0.01 LOR. As shown in Figure 3, we again observe a consistent pattern in which our method outperforms others with greater advantage over aSPUpath in scenarios where we have a lower percentage of causal genes, as in Figure 3A and 3B.
FIGURE 3. Comparison of receiver operating characteristic (ROC) curves among methods in four scenarios with various percentages of causal genes.

“Empirical”, “Non-info”, and “Info” represent empirical, noninformative prior, and informative prior, the three approaches for the estimation of ρ in the generalized fused hierarchical structured variable selection (HSVS) method, respectively
We also investigate the effect of the number of SNPs per gene on the method performance by either increasing or decreasing the number of SNPs per gene in the pathway. The benchmark pathway contains a total of 200 SNPs with 10 SNPs per gene. We then simulate the same pathway using 7, 13, and 16 SNPs per gene, respectively. As shown in Figure 4, the advantage of our method increases as the number of SNPs per gene increases. This is expected because the estimation of gene-level effects benefits from the increased number of SNPs per gene, facilitating more accurate estimation of the gene-level correlation. In addition, although the general performance of our method is closely followed by that of aSPUpath in all four scenarios, we note that, in the elbow areas where the false positive rate is close to 0.05, the gain in the true positive rate of our method compared to aSPUpath becomes greater in scenarios where we have more SNPs per gene, as shown in Figure 4C and 4D.
FIGURE 4. Comparison of receiver operating characteristic (ROC) curves among methods in four scenarios with various numbers of single-nucleotide polymorphism (SNPs) per gene.

“Empirical”, “Non-info”, and “Info” represent empirical, noninformative prior, and informative prior, the three approaches for the estimation of ρ in the generalized fused hierarchical structured variable selection (HSVS) method, respectively. LOR, log odds ratio
In the previous simulation set-ups, we have mostly assumed that all SNPs within a causal gene are also causal with the same effect size. We are then interested in reducing the percent of causal SNPs per causal gene and comparing the method’s performance. The benchmark pathway contains 100% causal SNPs per causal gene because all 10 SNPs in a causal gene are causal. To reduce the percent of causal SNPs, we set 3, 5, and 7 causal SNPs in a causal gene to noncausal (ie, LOR = 0), producing pathways that contain 70%, 50%, and 30% causal SNPs per causal gene, respectively. As shown in Figure S1 of the Supplemental Material, the performance gain of our method becomes smaller as the percent of causal SNPs per causal gene decreases. Our method slightly underperforms aSPUpath in both the 50% and 30% causal SNPs per causal gene scenarios, as shown in Figure S1C and S1D. The reduced performance is expected because the partial causal SNPs violate our model’s assumption and result in less accurate estimation of the gene-level effects. That being said, our method is still powerful when the model assumption is not exactly satisfied given that our method outperforms others except for aSPUpath in both the 70% and 50% causal SNPs per causal gene scenarios.
Another assumption we have made in the previous simulations is that the pathway structure information we pass to our model is complete. However, incomplete pathway information can occur in real data.23 We are then interested in the effect of incomplete pathway information on the performance of our method. The benchmark pathway has a total of 19 pairs of interacting genes. To make the pathway information incomplete, we randomly choose 5, 10, and 15 pairs to be excluded from the list of pairs we pass to our model so that it takes in 75%, 50%, and 25% information on pathway structure, respectively. As shown in Figure S2 of the Supplemental Material, the performance gain of our method becomes smaller as the percentage of pathway information decreases. Our method underperforms aSPUpath in both the 50% and 25% pathway information scenarios though it still outperforms other methods. However, despite the reduced performance, in general, the discrepancy between our method and aSPUpath in the true positive rate is little at a false positive rate of 0.05 as long as we have a decent amount of pathway information, as shown in Figures S2A, S2B, and S2C. The reduced performance is expected because the partial pathway information violates our model’s assumption and results in less accurate estimation of the gene-level correlation. As a consequence, our method may not be able to fully exploit the gene-level correlation and borrow strength from interacting genes due to the lack of pathway structural information.
We further investigate the effect of ρ of the AR-1 structure for SNPs in a gene on the method performance. In addition to the benchmark pathway that has ρ=0.4, we simulate the same pathway using ρ of 0.3, 0.5, and 0.6, respectively. As shown in Figure S3 of the Supplemental Material, the performance of each method becomes better and the discrepancy in performance becomes smaller as ρ increases.
In addition to using the ROC curves as the performance measurement, we compare the true positive rates for all methods evaluated at a false positive rate of 0.05, which corresponds to a significance level that is commonly used for flagging significance. Table 1 shows the true positive rates for all methods evaluated at a false positive rate of 0.05 in all simulated scenarios corresponding to Figures 2 to 4 and Figures S1 to S3 of the Supplemental Material. Our method has noticeably higher true positive rates than Gates-Simes, HYST, and the minimum p value method in all scenarios except for the one of 30% causal SNPs. Our method has comparable true positive rates with aSPUpath in many scenarios and has more gain in true positive rates when the number of SNPs per gene becomes larger. These results are consistent with our previous findings.
TABLE 1.
True positive rates evaluated at a false positive rate of 0.05 in all simulated scenarios for generalized fused hierarchical structured variable selection (HSVS), Gates-Simes, aSPUpath, HYST, and Minimum P value method
| Scenario | Empa | Noninfob | Infoc | Gates | aSPUpath | HYST | MinP |
|---|---|---|---|---|---|---|---|
| Effect Size | |||||||
| 0.5BMd | 0.194 | 0.192 | 0.192 | 0.046 | 0.206 | 0.078 | 0.052 |
| 0.75BM | 0.502 | 0.490 | 0.520 | 0.082 | 0.458 | 0.126 | 0.094 |
| BM | 0.786 | 0.774 | 0.818 | 0.134 | 0.748 | 0.204 | 0.142 |
| 1.25BM | 0.940 | 0.928 | 0.946 | 0.188 | 0.908 | 0.334 | 0.192 |
| Percentage of Causal Genes | |||||||
| 10% | 0.222 | 0.260 | 0.272 | 0.080 | 0.276 | 0.116 | 0.090 |
| 30% | 0.670 | 0.676 | 0.714 | 0.106 | 0.594 | 0.188 | 0.118 |
| 50% | 0.786 | 0.774 | 0.818 | 0.134 | 0.748 | 0.204 | 0.142 |
| 100% | 0.836 | 0.830 | 0.846 | 0.142 | 0.834 | 0.224 | 0.138 |
| Number of SNPs per Gene | |||||||
| 7 | 0.566 | 0.508 | 0.542 | 0.066 | 0.538 | 0.182 | 0.070 |
| 10 | 0.786 | 0.774 | 0.818 | 0.134 | 0.748 | 0.204 | 0.142 |
| 13 | 0.914 | 0.882 | 0.910 | 0.168 | 0.814 | 0.254 | 0.204 |
| 16 | 0.958 | 0.956 | 0.948 | 0.098 | 0.860 | 0.258 | 0.124 |
| Percentage of Causal SNPs | |||||||
| 100% | 0.786 | 0.774 | 0.818 | 0.134 | 0.748 | 0.204 | 0.142 |
| 70% | 0.372 | 0.364 | 0.400 | 0.088 | 0.436 | 0.124 | 0.090 |
| 50% | 0.162 | 0.156 | 0.176 | 0.068 | 0.238 | 0.106 | 0.070 |
| 30% | 0.072 | 0.074 | 0.076 | 0.074 | 0.126 | 0.072 | 0.076 |
| Percentage of Pathway Information | |||||||
| 100% | 0.786 | 0.774 | 0.818 | 0.134 | 0.748 | 0.204 | 0.142 |
| 75% | 0.764 | 0.726 | 0.762 | 0.134 | 0.716 | 0.204 | 0.142 |
| 50% | 0.660 | 0.690 | 0.678 | 0.134 | 0.726 | 0.204 | 0.142 |
| 25% | 0.634 | 0.620 | 0.604 | 0.134 | 0.726 | 0.204 | 0.142 |
| ρ of SNP AR-1 | |||||||
| 0.3 | 0.728 | 0.696 | 0.700 | 0.126 | 0.692 | 0.218 | 0.126 |
| 0.4 | 0.786 | 0.774 | 0.818 | 0.134 | 0.748 | 0.204 | 0.142 |
| 0.5 | 0.824 | 0.746 | 0.864 | 0.104 | 0.748 | 0.274 | 0.152 |
| 0.6 | 0.802 | 0.726 | 0.862 | 0.234 | 0.860 | 0.394 | 0.224 |
Generalized fused HSVS using the empirical method for estimation of ρ
Generalized fused HSVS using the noninformative prior method for estimation of ρ
Generalized fused HSVS using the informative prior method for estimation of ρ
BM denotes the effect size of the benchmark pathway
Abbreviation: SNP, single-nucleotide polymorphism
To examine the performance of our method in scenarios that mimic the real data, we also perform simulations based on the WTCCC Crohn’s disease GWAS data. Specifically, we randomly choose four pathways from the WTCCC data, and for each pathway, we use bootstrapping to resample individuals’ genotypes with replacement from the entire sample population, generate the new case/control phenotypes, and continue until 100 different sets of 500 cases and 500 controls are generated. We generate the binary phenotype for the ith patient using the logistic model, with the logit of the probability of having the disease to be . We equate βj to the scaled LOR estimates from the regular logistic regression analysis of the real WTCCC data and let βj = 0 when to generate four pathways in which about 35% of SNPs per causal gene are causal. In addition, we further let some βj = 0 to generate another 4 pathways in which the average percent of causal SNPs per causal gene varies approximately from 5% to 20%. The real-data simulation complements our previous simulations in that (1) not all SNPs in a gene have equal effect sizes, (2) not all SNPs in a causal gene are causal, and (3) correlations among SNPs in a gene keep the original LD as in the real genomics data. Furthermore, we generate noncausal pathways with all βj = 0, which were used to calibrate the significance threshold so that we control the false positive rate at 0.05.
Table 2 shows the true positive rates for the generalized fused HSVS, aSPUpath, Gates-Simes, and HYST evaluated at a false positive rate of 0.05 in simulations using the WTCCC data. The effect sizes are scaled so that the maximum true positive rate for each pathway is around 85%. For the generalized fused HSVS, we use only the empirical method to estimate ρ because of the similar performance among the three ρ estimation approaches as we have observed in previous simulations. As shown in Table 2, our method outperforms other methods in 3 out of the 4 pathways when about 35% of SNPs per causal gene are causal and in 2 out of the 4 pathways when fewer than 30% of SNPs per causal gene are causal, demonstrating the power of our method in more realistic scenarios. We note that our method generally has higher power when the average number of SNPs per causal gene is larger. This is in line with our previous finding that our method tends to have a higher power for pathways with more SNPs per gene because a larger number of SNPs per gene may lead to more accurate estimation of gene-level effects.
TABLE 2.
True positive rates evaluated at a false positive rate of 0.05 in simulations using the Wellcome Trust Case Control Consortium (WTCCC) data for generalized fused hierarchical structured variable selection (HSVS), aSPUpath, Gates-Simes, and HYST
| Pathway | % of Causal SNPs/Pathway | Avg. % of Causal SNPs/Causal Gene | Avg. # of SNPs/Causal Gene | GFHSVS | aSPUpath | Gates | HYST |
|---|---|---|---|---|---|---|---|
| 1 | 2.7 | 6.1 | 19.1 | 0.54 | 0.32 | 0.82 | 0.44 |
| 2 | 6.7 | 12.3 | 13.6 | 0.56 | 0.26 | 0.84 | 0.36 |
| 3 | 5.3 | 17.6 | 25.0 | 0.85 | 0.76 | 0.71 | 0.54 |
| 4 | 11.1 | 21.7 | 21.7 | 0.87 | 0.83 | 0.79 | 0.71 |
| 5 | 23.1 | 34.7 | 13.2 | 0.64 | 0.21 | 0.82 | 0.22 |
| 6 | 21.6 | 35.2 | 22.3 | 0.85 | 0.82 | 0.78 | 0.21 |
| 7 | 25.0 | 36.3 | 20.9 | 0.88 | 0.84 | 0.83 | 0.27 |
| 8 | 21.8 | 38.8 | 21.7 | 0.88 | 0.75 | 0.67 | 0.19 |
To summarize, our method outperforms other methods in most simulation scenarios, particularly in pathways that contain more SNPs per gene and more complete pathway structural information, for our method to borrow strength from complex gene-gene interactions. The advantage diminishes when the pathway structural information is very incomplete or the percentage of causal SNPs per causal gene decreases to less than 30% in pathways that contain fewer SNPs per causal gene.
4 |. APPLICATION TO WTCCC CROHN’S DISEASE GWAS DATA
We applied the generalized fused HSVS method, together with aSPUpath, Gates-Simes, and HYST, to the WTCCC Crohn’s disease GWAS data. For our method, we used the empirical estimates of ρ because of the robust performance of our method with regard to the estimation methods for ρ as seen in the simulation studies. The dataset contains the genotype information of 500 568 SNPs for 5009 participants comprised of 2005 cases and 3004 controls. In the preprocessing stage, we followed the WTCCC’s quality control guideline, resulting in 1748 cases, 2938 controls, and 469 612 SNPs. Using the SNP information from Ensembl, a BioMart database,24,25 we translated the genotype data to the number of minor alleles and obtained the MAFs and SNP-gene mapping information. Using the KEGG REST server,26 we obtained the gene-pathway mapping and the pathway structural information, which we further used to obtain the pairs of interacting genes and, thus, specify the inverse covariance matrix for each pathway as that in Equation (4). In line with other studies,7,8,27 we filtered out SNPs that have a MAF less than 1% and the pathways that have less than 5 genes or more than 500 genes to facilitate the interpretation of results. As a result, our final dataset contains 323 pathways harboring 75 909 unique SNPs that are mapped to 5396 unique genes.
The proposed method is fully Bayesian. However, in real-data analysis, we need to determine a threshold to flag significance of a pathway based on the posterior probabilities of γm. We conducted permutations to generate “null” pathways that were then used to calibrate the threshold so as to control the type-I error rate at 0.05 in the WTCCC data analysis. Specifically, we randomly chose 30 KEGG pathways from the WTCCC data and recalculated the p values after permuting the disease status for all participants. For each method, we obtained 1500 such p values by implementing 50 permutations for each of the 30 pathways and treated them as the p values of the null pathways. The four methods achieved a false positive rate of 0.05 at a calibrated threshold of 0.085, 0.038, 0.044, and 0.034, respectively. As a result, we applied the generalized fused HSVS method, aSPUpath, Gates-Simes, and HYST to the WTCCC data for pathway-based association tests, with a significance threshold of 0.00026, 0.00012, 0.00014, and 0.00011, respectively, to control the family-wise type-I error rate at 0.05 after the Bonferroni correction for 323 pathways. Table 3 shows the 29 pathways that our method identified and the 5 pathways that have been confirmed in other meta-analysis studies of inflammatory bowel disease and Crohn’s disease.28,29 On the other hand, aSPUpath identified 31 pathways with 11 pathways mutually identified by both methods. In addition, the number of identified pathways is 18 for Gates-Simes and 22 for HYST.
TABLE 3.
KEGG pathways identified by generalized fused hierarchical structured variable selection (HSVS) using Wellcome Trust Case Control Consortium (WTCCC) Crohn’s disease GWAS data
| KEGG ID | Pathway Name | SNPs/Genea | p Values GFHSVSb | aSPUpath | Gates | HYST |
|---|---|---|---|---|---|---|
| hsa04060* | Cytokine-cytokine receptor interaction | 8.4 | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
| hsa04066 | HIF-1 signaling pathway | 15.4 | <0.0001 | <0.0001 | .0064 | .0078 |
| hsa04217 | Necroptosis | 8.1 | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
| hsa04310 | Wnt signaling pathway | 16.9 | <0.0001 | <0.0001 | .0286 | .0028 |
| hsa04380* | Osteoclast differentiation | 12.6 | <0.0001 | <0.0001 | <0.0001 | .0013 |
| hsa04622 | RIG-I-like receptor signaling pathway | 8.3 | <0.0001 | <0.0001 | <0.0001 | .0006 |
| hsa04630* | JAK-STAT signaling pathway | 9.6 | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
| hsa04668* | TNF signaling pathway | 11.0 | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
| hsa05145 | Toxoplasmosis | 12.0 | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
| hsa05166 | Human T-cell leukemia virus 1 infection | 10.0 | <0.0001 | <0.0001 | .0031 | <0.0001 |
| hsa05416 | Viral myocarditis | 20.0 | <0.0001 | <0.0001 | .0008 | <0.0001 |
| hsa00860 | Porphyrin and chlorophyll metabolism | 11.0 | <0.0001 | .6940 | .4738 | .2496 |
| hsa04068 | FoxO signaling pathway | 12.3 | <0.0001 | .0150 | .0076 | .0251 |
| hsa04120 | Ubiquitin mediated proteolysis | 11.5 | <0.0001 | .0060 | .0480 | .0399 |
| hsa04510 | Focal adhesion | 20.9 | <0.0001 | .0950 | .1861 | .2138 |
| hsa04540 | Gap junction | 28.7 | <0.0001 | .0052 | .0737 | .0392 |
| hsa04730 | Long-term depression | 36.5 | <0.0001 | .0530 | .0563 | .0336 |
| hsa04912* | GnRH signaling pathway | 24.6 | <0.0001 | .0025 | .0800 | .0265 |
| hsa04931 | Insulin resistance | 17.1 | <0.0001 | .0080 | .0065 | .0245 |
| hsa04934 | Cushing syndrome | 20.2 | <0.0001 | .0011 | .0821 | .0077 |
| hsa05010 | Alzheimer disease | 19.0 | <0.0001 | .0680 | .1050 | .2382 |
| hsa05034 | Alcoholism | 16.9 | <0.0001 | .1220 | .0430 | .4048 |
| hsa05160 | Hepatitis C | 11.0 | <0.0001 | .0470 | .0065 | .2614 |
| hsa05205 | Proteoglycans in cancer | 18.3 | <0.0001 | .0056 | .0121 | .0097 |
| hsa05211 | Renal cell carcinoma | 13.5 | <0.0001 | .0190 | .0182 | .0612 |
| hsa05231 | Choline metabolism in cancer | 17.8 | <0.0001 | .0160 | .0001 | .0010 |
| hsa05410 | Hypertrophic cardiomyopathy (HCM) | 33.6 | <0.0001 | .0005 | .0054 | .0046 |
| hsa05414 | Dilated cardiomyopathy (DCM) | 34.4 | <0.0001 | .0110 | .0057 | .0204 |
| hsa05418 | Fluid shear stress and atherosclerosis | 10.7 | .0001 | .2100 | .1334 | .1630 |
Average number of single-nucleotide polymorphisms (SNPs) per gene
Generalized fused HSVS
Pathways that have been confirmed in other meta-analysis studies of inflammatory bowel disease and Crohn’s disease
We further investigated the features of those differential pathways of aSPUpath versus the generalized fused HSVS, which were identified by only aSPUpath or only the Bayesian method. The most distinctive feature is that the number of SNPs per gene is significantly lower in pathways that were flagged by aSPUpath but not our method than that in pathways identified by our method but not aSPUpath (mean: 11.5 vs 19.9, t-test p value: 0.001). This is consistent with the simulation results shown in Section 3, where our method tends to have a higher true positive rate for pathways with more SNPs per gene because a larger number of SNPs per gene may lead to more accurate estimation of the gene-level effect sizes and correlations.
5 |. DISCUSSION
In this study, we propose a Bayesian hierarchical prior, the generalized fused HSVS prior, to implement pathway-based GWASs using SNP-level summary statistics given the SNP-gene-pathway structure. Specifically, we use a discrete mixture prior for the vector of SNP-level summary statistics of a pathway, which is composed of a point mass at zeros, corresponding to the null hypothesis, and a multivariate scale-mixing normal distribution, corresponding to the alternative hypothesis. The hierarchical prior uses lasso-type regularizations such that it is robust to noise and controls type-I errors and flexibly accounts for correlations at both the SNP and gene levels, which leads to increased power in pathway-based association tests. By using the summary statistics, our method also has improved applicability since individual-level genotype data is not always available.
The KEGG database is a well-established database that contains rich information on biological pathways including gene-gene interaction detected by biochemical or genetic methods.30 It has been popularly used in pathway-based GWASs for grouping SNPs and genes that belong to the same pathway. However, the gene-gene interaction information has not been utilized via gene-level correlations by most existing methods. Because correlations may exist among genes when they interact with each other in a pathway through biological mechanisms such as activation and phosphorylation and causal SNPs within interacting genes are also likely to have correlated effects on the phenotype, it is desirable to borrow information from correlated genes by appropriately modeling the gene-level correlations, an important and distinctive feature of our method.
We illustrate our method through extensive simulations and are able to demonstrate its advantage in a wide range of scenarios. In particular, simulation results suggest that our method outperforms competing methods overall in pathways that contain more SNPs per gene and more complete pathway structural information for our method to borrow strength from complex gene-gene interactions. However, its advantage tends to diminish when we have only limited pathway structural information, or a low percentage of causal SNPs per causal gene in pathways that contain fewer SNPs per causal gene. This is not surprising because either condition would lead to less accurate estimates of gene-level correlations and effects so that there is little strength among correlated genes available to borrow from.
We apply our method to the WTCCC Crohn’s disease GWAS data. For the SNP-level correlation in a gene, we have assumed an AR-1 structure and estimate between-SNP correlation ρ either empirically or using the MH algorithm given a Beta prior on the ρ. In practice, an alternative approach worthy of consideration is to obtain the empirical LD matrix for SNPs in a gene from established databases such as the 1000 Genomes Project,31 with which we replace the AR-1 correlation matrix in the likelihood model. Other prior knowledge we may also exploit is the nature of gene-level biological mechanisms in a pathway because a few mechanisms, such as inhibition, may indicate a negative correlation between two genes. By incorporating additional prior knowledge, we would expect a more realistic representation of SNP- and gene-level correlations, which may, in turn, further improve the power of our method when applied to real data.
The runtime of our method to complete its computation of 5000 MCMC posterior samples for a simulated pathway of 20 genes and 200 SNPs is 18.10 seconds, whereas the runtime is 3.53, 0.08, and 0.08 seconds for aSPUpath, Gates-Simes, and HYST, respectively. The fact that our method has a longer runtime than other methods did not cause an intolerable time issue in the real data analysis because the real-data analysis requires each method to run only once for each pathway. In addition, we have the option to implement parallel computing for our method, which, as shown in our osteosarcoma trio data analyses in the work of Yang et al,17 can improve the computational efficiency by 25.7% for our method.
Our method is a self-contained test that tests whether a pathway is associated with a phenotype. It can be extended to a competitive test that tests whether a pathway is more associated with a phenotype than a pathway that contains other genes by incorporating in the likelihood model the summary statistics for the pathway that contains other genes, as suggested by de Leeuw et al,32 for example,
where Zm are the SNP-level summary statistics for the mth pathway of interest and Zc are the SNP-level summary statistics for the pathway that contains other genes, and test H0 ∶ θm = 0. We note that, although self-contained tests have been reported to be more powerful than competitive tests,33 the former may not be sufficient for biological inferences when heritability for strong polygenitc phenotypes is present in a large number of genes so that any sufficiently large gene set is likely to achieve significance in self-contained tests.34
We model the gene-level correlations in using equal magnitude. The advantages of using equal magnitude are mainly twofold. First, it allows us to use a single parameter λ2m to control the general smoothness and correlations in the effect estimates for all correlated genes. Second, it produces a more computationally efficient Gibbs sampler than using different magnitudes. On the other hand, it might be more realistic to model the gene-level correlations with different magnitudes. One approach is to replace the common in with weighted or different for each pair of interacting genes so that the gene-level correlations can have different magnitudes and the inference of gene-level epistatic interactions for any given pairs of genes becomes possible. The weights can be empirically determined, for example, by the effect estimates or the sum of squares from gene-level regression analyses.32 Similarly, SNP-level weights can also be applied to relax our model assumption that SNPs within a gene share the same gene-level effect size.
Our method is currently applicable to common variants. It can be extended to handle rare variants with a modified likelihood model that takes in the genotype data. To deal with the decreased power in the presence of a low percentage of causal SNPs per gene as seen in some simulations, another extension of our method is to use a likelihood model that may better account for the maximum effect size rather than the mean effect size; examples include the distributions of maxima related to extreme value theory, such as the Gumbel distribution. We leave these potential extensions to our future studies.
Supplementary Material
ACKNOWLEDGEMENTS
The authors are grateful to the editor and the reviewers for their suggestions and comments. This study is supported in part by grant to S.B. from the National Institutes of Health/National Institute on Drug Abuse R01DA033958 and R21DA046188. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the WTCCC data is available at www.wtccc.org.uk. Funding for the WTCCC project was provided by the Wellcome Trust under award 076113.
Funding information
National Institute of Health/National Institute on Drug Abuse, Grant/Award Number: R01DA033958 and R21DA046188
APPENDIX. POSTERIOR INFERENCE VIA MCMC
In this section, we present the full conditional distributions for the generalized fused HSVS prior. In the following, GIG denotes the generalized inverse Gaussian distribution, and IG denotes the inverse gamma distribution.
where and ;
where
Footnotes
DATA AVAILABILITY STATEMENT
Individual-level genotype data and summary genotype statistics for WTCCC1 collections are held within the European Genotype Archive, http://www.ebi.ac.uk/ega. Access is available by application to the Wellcome Trust Case Control Consortium Data Access Committee.
SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of the article.
REFERENCES
- 1.Michailidou K, Beesley J, Lindstrom S, et al. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet. 2015;47(4):373–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Michailidou K, Lindström S, Dennis J, et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 2017;551(7678):92–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Morris AP, Voight BF, Teslovich TM, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet. 2012;44(9):981–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mahajan A, Go MJ, Zhang W, et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat Genet. 2014;46(3):234–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Barabasi A-L, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011;12(1):56–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet. 2010;11(12):843–854. [DOI] [PubMed] [Google Scholar]
- 8.Pan W, Kwak IY, Wei P. A powerful pathway-based adaptive test for genetic association with common or rare variants. Am J Hum Genet. 2015;97(1):86–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li MX, Kwan JSH, Sham PC. Hyst: a hybrid set-based test for genome-wide association studies, with application to protein-protein interaction-based association analysis. Am J Hum Genet. 2012;91(3):478–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gui H, Li M, Sham PC, Cherny SS. Comparisons of seven algorithms for pathway analysis using the WTCCC Crohn’s disease dataset. BMC Res Notes. 2011;4:386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kao PYP, Leung KH, Chan LWC, Yip SP, Yap MKH. Pathway analysis of complex diseases for GWAS, extending to consider rare variants, multi-omics and interactions. Biochim Biophys Acta Gen Subj. 2017;1861(2):335–353. [DOI] [PubMed] [Google Scholar]
- 12.Kwak IY, Pan W. Adaptive gene- and pathway-trait association testing with GWAS summary statistics. Bioinformatics. 2016;32(8):1178–1184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wakefield J Bayes factors for genome-wide association studies: comparison with p-values. Genet Epidemiol. 2009;33(1):79–86. [DOI] [PubMed] [Google Scholar]
- 14.Maller JB, McVean G, Byrnes J, et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet. 2012;44(12):1294–1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.He X, Fuller CK, Song Y, et al. Sherlock: detecting gene-disease associations by matching patterns of expression QTL and GWAS. Am J Hum Genet. 2013;92(5):667–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zhang L, Baladandayuthapani V, Mallick BK, et al. Bayesian hierarchical structured variable selection methods with application to molecular inversion probe studies in breast cancer. J Royal Stat Soc Ser C Appl Stat. 2014;63(4):595–620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yang Y, Basu S, Mirabello L, Spector L, Zhang L. A Bayesian gene-based genome-wide association study analysis of osteosarcoma trio data using a hierarchically structured prior. Cancer Inform. 2018;17. [DOI] [PMC free article] [PubMed]
- 18.Stingo FC, Chen YA, Tadesse MG, et al. Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 2011;5(3):1978–2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Burton PR, Clayton DG, Cardon LR, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shimamura K, Ueki M, Kawano S, Konishi S. Bayesian generalized fused lasso modeling via NEG distribution. Commun Stat-Theory Methods. 2018;48:4132–4153. [Google Scholar]
- 21.Park T, Casella G. The Bayesian lasso. J Am Stat Assoc. 2008;103(482):681–686. [Google Scholar]
- 22.Kyung M, Gill J, Ghosh M, Casella G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010;5(2):369–411. [Google Scholar]
- 23.Demir E, Babur O, Dogrusoz U, et al. Patika: an integrated visual environment for collaborative construction and analysis of cellular pathways. Bioinformatics. 2002;18(7):996–1003. [DOI] [PubMed] [Google Scholar]
- 24.Durinck S, Moreau Y, Kasprzyk A, et al. Biomart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21(16):3439–3440. [DOI] [PubMed] [Google Scholar]
- 25.Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc. 2009;4(8):1184–1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Tenenbaum D KEGGREST: client-side rest access to KEGG. R Package Version 1220. 2018.
- 27.Chen LS, Hutter CM, Potter JD, et al. Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data. Am J Hum Genet. 2010;86(6):860–871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Franke A, McGovern DP, Barrett JC, et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat Genet. 2010;42(12):1118–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jostins L, Ripke S, Weersma RK, et al. Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature. 2012;491(7422):119–124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Walhout AJ, Vidal M. Protein interaction maps for model organisms. Nat Rev Mol Cell Biol. 2001;2(1):55–62. [DOI] [PubMed] [Google Scholar]
- 31.Altshuler DM, Durbin RM, Abecasis GR, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLOS Comput Biol. 2015;11(4):e1004219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Goeman JJ, Bühlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23(8):980–987. [DOI] [PubMed] [Google Scholar]
- 34.de Leeuw CA, Neale BM, Heskes T, Posthuma D. The statistical properties of gene-set analysis. Nat Rev Genet. 2016;17(6):353–364. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
