Abstract
While genome-wide association studies (GWAS) often collect data on multiple correlated traits for complex diseases, conventional gene-based analysis is usually univariate, and therefore, treating traits as uncorrelated. Multivariate analysis of multiple correlated traits can potentially increase the power to detect genes that affect some or all of these traits. In this study, we propose the multivariate hierarchically structured variable selection (HSVS-M) model, a flexible Bayesian model that tests the association of a gene with multiple correlated traits. With only summary statistics, HSVS-M can account for the correlations among genetic variants and among traits simultaneously and can also estimate the various directions and magnitudes of associations between a gene and multiple traits. Simulation studies show that HSVS-M substantially outperforms competing methods in various scenarios, particularly when variants in a gene are associated with a trait in similar directions and magnitudes. We applied HSVS-M to the summary statistics of a meta-analysis GWAS on four lipid traits from the Global Lipids Genetics Consortium and identified 15 genes that have also been confirmed as risk factors in previous studies.
Keywords: Gene-based GWAS, Hierarchical variable selection, Multiple traits, Multivariate GWAS, Summary statistics
1. Introduction
Genome-wide association studies (GWASs) have been widely used to detect associations between genetic variants and complex diseases. One of the popular paradigms of analysis in GWASs is the gene-based association test. Compared to the conventional approach that tests single-nucleotide polymorphisms (SNPs) individually, gene-based tests are often more powerful due to the reduced burden of multiple testing and the information borrowed from SNPs in linkage disequilibrium (LD). However, current gene-based tests are mostly limited to univariate analysis (Pan et al., 2014, 2015; Wu et al., 2011; Yang et al., 2018, 2020, 2021) that tests the association of a gene with only one trait at a time. Multiple correlated traits are often measured and studied together in the context of complex diseases. For example, total cholesterol, high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, and triglycerides jointly contribute to the risk factors for coronary artery disease (Teslovich et al., 2010). There can be genes that affect some or all of these traits, and as a result, analyzing multiple correlated traits simultaneously can potentially increase the power to detect associations of genes with a disease. Such joint analysis also has the potential to reveal pleiotropic genes involved in the biological mechanism of the disease.
Recognizing the increasing availability of multiple traits in GWASs, gene-based multivariate association methods have been proposed to jointly analyze multiple traits. For example, MSATS uses an adaptive procedure to combine the variance component test and the burden test based on a multivariate normal regression model for summary statistics (Guo and Wu, 2019). MTaSPUsSet uses an adaptive combination procedure based on a weighted sum of powered score statistics for each individual variant and trait (Kwak and Pan, 2017). USAT is a unified score-based method that adaptively combines the sum of squared score test (Pan, 2009) and the multivariate analysis of variance to adjust for different patterns of association between genetic variants and traits (Ray et al., 2016). MGAS combines top association signals from conventional univariate SNP-based tests into a multivariate gene-based p-value based on an extended Simes procedure (Van der Sluis et al., 2015). Some of these methods such as USAT require individual-level genotype data that are not always available. Moreoever, few of these methods can effectively estimate the directions and magnitudes of associations between a gene and different traits, which are often of primary interest in multivariate GWASs (Stephens, 2013).
In this study, we propose the multivariate hierarchically structured variable selection (HSVS-M) model for testing the association of a gene with multiple traits. The HSVS framework was previously considered for univariate gene- and pathway-based GWASs using summary statistics (Yang et al., 2018, 2020) and genotype data (Yang et al., 2021). We extend the HSVS framework in this study to multivariate gene-based GWASs that require only summary statistics, which are typically meant for common (minor allele frequency (MAF)≥5%) and low-frequency (1%≤MAF<5%) variants. Specifically, we propose in HSVS-M a discrete mixture Bayesian prior composed of a point mass at zero that supports the null hypothesis and a multivariate scale-mixing normal distribution that models the effects of SNPs in a gene under the alternative hypothesis while accounting for the linkage disequilibrium among the SNPs and the correlation among the traits. The posterior inference gives the posterior probability of the null hypothesis, which is closely related to Bayes factors and can be interpreted as a Bayesian analog to a p-value for hypothesis testing (Yang et al., 2020). We also introduce in HSVS-M a novel trait-adaptive weight that flexibly models the various associations between a gene and multiple traits. Because a gene can be associated with each trait differently, ignoring such differences and treating traits equally can weaken the power. The trait-adaptive weight allows HSVS-M to effectively estimate the directions and magnitudes of associations between a gene and each trait and thus increases the power of HSVS-M. Therefore, HSVS-M tackles not only whether but also how a gene is associated with multiple traits, which has not been effectively addressed by most competing methods.
The rest of the paper is organized as follows. We propose the HSVS-M prior in Section 2. We demonstrate through simulations that HSVS-M with its trait-adaptive weights is often more powerful than the competing methods in Section 3.1. We apply HSVS-M to a meta-analysis GWAS for lipid traits and use HSVS-M to identify genes that are associated with lipid traits in Section 3.2. Finally, Section 4 gives a brief discussion.
2. Methods
2.1. HSVS-M for gene-based multivariate GWAS
Consider a gene of p SNPs and k traits for each patient. Let Zi· = (zi1, …, zik)′ be the ith SNP’s summary statistics for the k traits, θi be the effect of the ith SNP’s strongest association with the k traits, w = (w1, …,wk)′ be the gene-specific trait-adaptive weight, R be the LD matrix for p SNPs, and Σ be the correlation matrix for k traits. We consider the model
in which , θ = (θ1, …, θp)′, W = Ip ⊗ w, and ΣZ = R ⊗ Σ; θ and w are parameters to be inferred whereas R and Σ can be empirically estimated. Here, w is a vector with each element defined in [−1, 1] measuring the direction and magnitude of the gene’s association with each trait relative to the trait that has the strongest association with the gene. Because different traits can have different associations with the gene and treating traits equally can weaken the signal and power, we use w to account for the difference in traits and thus boost the power by adaptively weighing the association between the gene and each trait. We use the same w for p SNPs assuming that their relative strength and direction of association with the k traits are same across SNPs in the same gene, which also allows HSVS-M to conduct posterior inference on the association between the gene and each of the k traits. Σ can be empirically estimated by calculating the covariance matrix for (Z·1, …,Z·k), in which Z·j = (z1j, …, zpj)′ is the jth trait’s summary statistics for p SNPs. R can be obtained from the LD information in established databases such as the 1000 Genomes Project (1000 Genomes Project Consortium et al., 2015). Therefore, the model accounts for both the LD among SNPs and correlation among traits in the covariance.
Our interest is to test the null hypothesis H0 : θ = 0; that is, no SNPs in the gene are associated with the k traits. We propose to test the hypothesis using HSVS-M, a discrete mixture prior for the gene selection with a Bayesian fused lasso hierarchy modeling the effect of individual SNPs in the gene. The HSVS-M prior is structured as
in which we use γ, a binary indicator, on θ for the hypothesis testing. The value of γ indicates whether we reject the null hypothesis and P(γ = 0|·) is considered as a Bayesian analog to a p value and closely related to the Bayes factor (Yang et al., 2020).
Under the alternative hypothesis, we assume that θ follows a multivariate normal distribution N(0, σ2Σθ). The (i, j)th element of gives the conditional dependence between SNPs i and j. This motivates the specification of the (i, j)th element of
Here, we use the Bayesian fused lasso hierarchy (Casella et al., 2010) on θ to induce correlation in the effect of neighboring SNPs and thus to account for their LD, which allows HSVS-M to borrow information from SNPs in LD to boost the power. This is achieved by further specifying the hyperpriors for and in as
The exponential hyperpriors for and lead to exponential-scale mixture normal priors, which are equivalent to a Bayesian lasso (Park and Casella, 2008) and a Bayesian fused lasso, respectively. To see this, we collapse the hierarchical priors and obtain the marginal prior
The marginal prior indicates that under the alternative hypothesis, HSVS-M induces two shrinkage effects through the L1-norm regularization: one independently shrinks |θi|’s, the individual SNP effects, to control the type I error rates, whereas the other independently shrinks |θi+1−θi|’s, the difference between the SNP effects, to borrow strength from neighboring SNPs in LD.
To further boost the power, HSVS-M incorporates a trait-adaptive weight w to account for the various directions and magnitudes of the associations between a gene and k traits. We consider a flat prior for w
| (1) |
in which t > 0 is a decision threshold that determines the values of bj1 and bj2, the lower and upper bounds for wj, respectively. We will discuss the choice of t in the following section. We assume in this study that wj is the same across the p SNPs, i.e., the relative strength and direction of association with the k traits are same across SNPs in the same gene. We constrain the range of |wj| to [0, 1] so that θi measures the maximum effect of the ith SNP on the k traits. To avoid identifiability issues, we use a data-adaptive procedure to empirically determine the direction of wj based on except when is close to zero, as shown in Equation 1. The flat prior leads to a truncated normal posterior distribution for wj with a weighted mean of Z·j based on the SNP-level effect θ, which allows HSVS-M to borrow information from all SNPs while adaptively updating wj. In addition, HSVS-M can also use the posterior samples for wj to conduct posterior inference on the direction and magnitude of the association between the gene and the jth trait.
For the remaining priors and hyperpriors, we specify the prior for γ as Bernoulli(pγ), and the hyperprior for pγ as Beta(a, b) with constant parameters a and b. We also specify the prior for as and as , and the hyperprior for as Gamma(r1, δ1) and as Gamma(r2, δ2) with constant shape parameters r1 and r2 and constant rate parameters δ1 and δ2. For σ2, we use the improper prior σ2 ∝ 1/σ2. These choices of priors and hyperpriors lead to closed-form full conditional posterior distributions, as shown in Appendix A, allowing an efficient Gibbs sampler.
The full HSVS-M model is formulated as follows:
2.2. Choice of hyperparameters
For the beta hyperprior on pγ, we set (a, b) = (0.1, 0.1) to induce a non-informative prior for γ. Different parameterizations of (a, b) can be used for more informative choices; for example, a sparse hyperprior parameterization can be used with an empirical Bayes estimate of b (Yang et al., 2018). For the gamma hyperpriors on and , we set (r1, δ1) = (r2, δ2) = (0.1, 0.1) to again induce non-informative priors for and . For the uniform prior on wj, we set t = 0.5 to achieve a balance between the power and the accuracy of weight estimates; overall, HSVS-M is robust to the choice of t and the effects of different t’s on HSVS-M’s power and weight estimation will be evaluated in Section 3.1.1.
3. Results
3.1. Simulation studies
3.1.1. Power of HSVS-M and competing methods
We conducted simulation studies to compare the power of HSVS-M with MSATS and MTaSPUsSet. Both methods are among the latest state-of-the-art multi-SNP multi-trait methods based on summary statistics and are accessible through R packages. We additionally included MTaSPUs (Kim et al., 2015), a single-SNP multi-trait method, to compare with the multi-SNP multi-trait methods. Because MTaSPUs is a SNP-based method, we used the aggregated Cauchy association test (Liu et al., 2019) to combine SNP-based p-values into a gene-based p-value. We also included HSVS-M with different t’s as a prior sensitivity test because t, as a decision threshold, is more subject to arbitrary choices compared to other (hyper)parameters, for which we used non-informative priors. To mimic the real data from the meta-analysis GWAS for four lipid traits (Teslovich et al., 2010) in the simulations, we considered four traits and set Σ, the trait-level correlation, as the empirical correlation matrix for the four lipid traits in the real data. Particularly, total cholesterol has a high correlation with low-density lipoprotein cholesterol (i.e., Σ13 = 0.8787) and a moderate correlation with triglycerides (i.e, Σ14 = 0.3842). We considered 20 SNPs per gene and set R, the SNP-level correlation, as the LD matrix of a random gene that has 20 SNPs in the real data. In line with the simulation method in MSATS (Guo and Wu, 2019), we simulated GWAS summary statistics Z from N(S ⊗P,R⊗Σ), in which S is the SNP-level effect vector of length 20 and P is the trait-level weight vector of length four that indicates the directions and magnitudes of the associations between the gene and the four traits. In all simulations, we let S = c × (2, 1.5, 1, 0.5, 0, …, 0)′ with four causal SNPs and c being the effect size scale parameter, and let the jth element of P, Pj ∈ {−1, 0, 1}, which indicates negative association, no association, and positive association, respectively, between the gene and the jth trait. We evaluated a method’s power in scenarios with various types of associations between a gene and four traits using different P’s, as shown in Table 1.
Table 1.
Simulation scenarios with various types of associations between a gene and four traits using different P’s
| P | # of + trait1 | # of − trait2 | # of noise trait |
|---|---|---|---|
| (1 1 1 1) | 4 | 0 | 0 |
| (0 1 1 1) | 3 | 0 | 1 |
| (0 1 1 0) | 2 | 0 | 2 |
| (0 1 0 0) | 1 | 0 | 3 |
| (−1 1 −1 1) | 2 | 2 | 0 |
| (0 1 1 −1) | 2 | 1 | 1 |
| (0 −1 1 0) | 1 | 1 | 2 |
| (0 −1 0 0) | 0 | 1 | 3 |
traits that are positively associated with a gene
traits that are negatively associated with a gene
To demonstrate the power of HSVS-M in estimating the trait-adaptive weight w, we also included HSVS-MC, a competing method in which empirical estimates of the weights are used for the HSVS-M. We increased the effect size scale parameter c so that we generated a power curve for each method. When c = 0, we simulated 1,000 null replicated datasets and obtained a calibrated significance threshold α that preserves the type I error rate at 0.05 for each method. When c ≠ 0, a method’s power was estimated over 1,000 replicates as the proportion of P(γ = 0|·) < α for HSVS-M and HSVS-MC with 500 burn-in and 1,000 sampling Markov chain Monte Carlo (MCMC) iterations or p values < α for MSATS, MTaSPUsSet, and MTaSPUs.
As shown in Figure 1, HSVS-M outperformed the competing methods in most scenarios and was followed by MSATS, HSVS-MC, MTaSPUs, and MTaSPUsSet. HSVS-M was substantially more powerful than HSVS-MC, its constant-weight counterpart, which demonstrates the power of the trait-adaptive weight. HSVS-M was also robust to the presence of traits that have a negative or no association with the gene. We note that the power of HSVS-M, HSVS-MC, and MSATS was higher when P = (0, 1, 1, 1)′ and P = (0, 1, 1, 0)′ than that when P = (1, 1, 1, 1)′, which seems counterintuitive. A probable explanation is that the first trait has a high and moderate correlation with the third and the fourth trait, respectively (i.e., Σ13 = 0.8787 and Σ14 = 0.3842). Therefore, collinearity is likely to occur and thus reduce the power when the first trait is associated with the gene in the same manner as the third and the fourth trait, as seen in P = (1, 1, 1, 1)′. HSVS-M’s power is not sensitive to the choice of t in most scenarios in the sense that the difference in power among different t’s is negligibly small. In few scenarios, HSVS-M’s power increases very slightly as t increases.
Figure 1. Power comparison of HSVS-M, HSVS-MC, MSATS, MTaSPUsSet, and MTaSPUs.
In each panel, the horizontal axis indicates the effect size scale parameter c in the SNP-level effect vector S = c × (2, 1.5, 1, 0.5, 0, …, 0)′. P is the trait-level weight vector that indicates the directions and magnitudes of the associations between the gene and the four traits. Power was estimated as the proportion of P (γ = 0|·) < α for HSVS-M and HSVS-MC or p values < α for MSATS, MTaSPUsSet, and MTaSPUs among 1000 replicated datasets, with α being the calibrated significance threshold that preserves the type I error rate at 0.05. HSVS-M with different t’s are included with t being the decision threshold in the prior for the gene-specific trait-adaptive weight w.
The power of HSVS-M arises from the trait-adaptive weights w that HSVS-M uses to borrow strength from multiple correlated traits. To see this, we show in Figure 2 HSVS-M’s estimated trait-adaptive weights for the various associations between genes and traits corresponding to each simulation scenario. The estimated weights were calculated as the median of the posterior median of w among all replicated datasets. As the effect size increases, HSVS-M is more likely to produce accurate weight estimates for non-null traits and distinguish traits that are in different associations with the gene. This not only increases the power of HSVS-M but also allows it to identify the direction and magnitude of the association between a gene and a trait, which is a desirable feature that effectively addresses the question of how a gene is associated with each trait in multivariate GWASs. The accuracy of HSVS-M’s weight estimates for non-null traits increases slightly when t decreases in scenarios of small or medium effect sizes. This is because a larger t would lead more wj’s to have a prior distribution of Uniform(−1, 1), whereas a smaller t is more likely to capture the direction of non-null traits when the effect size is not large.
Figure 2. HSVS-M’s estimated trait-adaptive weights for various types of associations between a gene and four traits.
In each panel, the horizontal axis indicates the effect size scale parameter c in the SNP-level effect vector S = c×(2, 1.5, 1, 0.5, 0, …, 0)′. P is the trait-level weight vector that indicates the directions and magnitudes of the associations between the gene and the four traits. Estimated weights were calculated as the median of the posterior median of w among 1,000 replicated datasets. HSVS-M with different t’s are included with t being the decision threshold in the prior for the gene-specific trait-adaptive weight w.
Although we could not include every possible scenario, we have strived to cover a broad range of scenarios by using various combinations of positive, negative, and null associations in P’s. For skipped scenarios such as P=(−1, −1, −1, −1), the results for power were almost identical to that of P=(1, 1, 1, 1) and the results for HSVS-M’s weight estimates were that of P=(1, 1, 1, 1) with a negative sign, which are analogous to P=(0 1 0 0) and P=(0 −1 0 0).
We further implemented additional simulations, in which we assume different P’s for different SNPs, to investigate the robustness of HSVS-M when the model assumptions are violated. Specifically, instead of assuming the same P for all SNPs, we sampled Pik from N(μk, 1) for the ith SNP and kth trait. As shown in Figure S1 in the Supplementary Material, HSVS-M was more powerful than MTaSPUsSet and MTaSPUs, though less powerful than MSATS when the effect size is large. This is not surprising because MSATS is based on the variance-component test, which gains more power when the directions and magnitudes of associations between SNPs and traits vary to a greater extent.
3.1.2. Type I error rates of HSVS-M and competing methods
HSVS-M uses a calibrated significance threshold to preserve the type I error rate. We randomly chose 1,000 genes from the real data and simulated 1,000 replicated datasets of null Z from N(0,R⊗Σ) with respective LD matrices. We then obtained the calibrated significance threshold α that gives a false positive rate of 0.05 using the null Z. To evaluate the preservation of the type I error rates for each method, we applied α to the 1,000 replicated datasets of null Z that we generated in the simulations based on a randomly selected gene and calculated the type I error rate for each method.
In Table 2, we show the type I error rates of HSVS-M and the competing methods using each method’s calibrated significance threshold. The results indicate that all methods control the type I error rate at around 0.05.
Table 2.
Type I error rates of HSVS-M, HSVS-MC, MSATS, MTaSPUsSet, and MTaSPUs using calibrated significance thresholds
| HSVS-M | HSVS-MC | MSATS | MTaSPUsSet | MTaSPUs |
|---|---|---|---|---|
| 0.051 | 0.054 | 0.049 | 0.050 | 0.046 |
Type I error rates were calculated as the proportion of P(γ = 0|·) < α for HSVS-M and HSVS-MC or p values < α for MSATS, MTaSPUsSet, and MTaSPUs under the null hypothesis, with α being each method’s calibrated significance threshold.
3.2. Application to GWAS summary statistics for lipid traits
We applied HSVS-M to the meta-analysis summary statistics of GWASs for total cholesterol (TC), high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), and triglycerides (TG) (Teslovich et al., 2010) from the Global Lipids Genetics Consortium with the aim of identifying genes that are associated with lipid traits. The dataset contains the Z-statistics of 2,647,705 SNPs for the four traits from a meta-analysis of 46 lipid GWASs with a total of more than 100,000 participants. We also obtained the coding region of genes and the chromosomal position and the minor allele frequency (MAF) of SNPs from Ensembl, a BioMart database (Durinck et al., 2005, 2009). In line with other studies (Guo and Wu, 2019), we filtered out SNPs that have a p value < 5×10−8 for any trait so that we demonstrate the power of HSVS-M as a multivariate gene-based approach in spite of the lack of univariate genome-wide significant individual SNPs. We further filtered out SNPs that have a MAF < 0.05 and group SNPs within 20kb of a gene coding region into a gene. Our final dataset contains 20,541 genes that harbor 990,860 unique SNPs.
HSVS-M identified 249 significant genes, among which 15 genes have been confirmed as risk factors in previous studies of lipid traits. Specifically, fatty acid desaturase 3 (FADS3) has been implicated for HDL-C and TG in a meta-analysis of GWASs for lipids in Caucasians (Kathiresan et al., 2009) and increased risk for familial combined hyperlipidemia that is characterized by familial segregation of elevated fasting plasma TG and TC (Plaisier et al., 2009). Diacylglycerol O-acyltransferase 2 (DGAT2) is the major enzyme of TG biosynthesis in eukaryotes (Stone et al., 2009) and has been found to facilitate the expansion of lipid droplets (Xu et al., 2012) that play a central role in most cardiovascular diseases. Monoacylglycerol acyltransferase-2 (MGAT2) has been found to interact with DGAT2 to promote TG biosynthesis(Jin et al., 2014). Solute carrier family 10 member 1 (SLC10A1) has been demonstrated to effectively transport eprotirome, a thyroid hormone analog that lowers cholesterol in humans (Kersseboom et al., 2017). Acyl-CoA thioesterase 4 (ACOT4) acts as an auxiliary enzyme in human peroxisomes, which have an established role in lipid metabolism (Hunt et al., 2012, 2014). Integrin subunit beta 3 (ITGB3) has been found to be associated with TG in the non-Hispanic black population (Chang et al., 2010). Cytochrome P450 1B1 (CYP1B1) has been shown to significantly lower HDL-C in incinerator workers (Hu et al., 2008). Forkhead box a2 (FOXA2) has been found to increase HDL-C levels in mice by regulating apolipoprotein M (Wolfrum et al., 2008). Non-coding RNA activated by DNA damage (NORAD) has been confirmed as an upstream mechanism to regulate transthyretin while patients with a higher level of transthyretin tend to have higher cholesterol (Zuo et al., 2020). Studies have shown that ADP ribosylation factor related protein 1 (ARFRP1) is essential for lipid droplet growth (Hommel et al., 2010) and the deletion of ARFRP1 results in impaired lipidation of very low density lipoproteins, which in turn leads to reduced plasma TG levels in the fasted state (Hesse et al., 2014). A GWAS meta-analysis of 188,000 individuals has also shown that ubiquitin conjugating enzyme e2 l3 (UBE2L3) is in close proximity with miR-130b and miR-301b, the two miRNAs that control the expression of key proteins involved in the LDL receptor, and plectin (PLEC) is in close proximity with miR-661, a miRNA in genome-wide significant association with TC and LDL-C (Wagschal et al., 2015). MicroRNA 128–2 (MIR128–2) has been shown to be a regulator of cholesterol homeostasis in that cholesterol efflux is attenuated by overexpression of MIR128–2 (Adlakha et al., 2013). Growth differentiation factor 9 (GDF9) has been found to facilitate cholesterol biosynthesis by promoting cholesterol biosynthetic enzymes (Su et al., 2008). R-spondin 3 (RSPO3) has been identified as harboring a SNP in genome-wide significant association with HDL-C (Ligthart et al., 2016). Table 3 shows the P(γ = 0|·) for HSVS-M and the p values for MSATS and MTaSPUsSet for these 15 risk factor genes.
Table 3.
Analysis of 15 genes that have been identified as risk factors in previous studies of lipid traits and confirmed by HSVS-M
| Gene | HSVS-M | MSATS | MTaSPUsSet |
|---|---|---|---|
|
| |||
| FADS3 | <1.00e-06*† | 6.21e-14* | <1.00e-06* |
| UBE2L3 | <1.00e-06* | 1.10e-07* | 2.10e-04 |
| DGAT2 | <1.00e-06* | 2.27e-06* | 5.99e-03 |
| RSPO3 | <1.00e-06* | 3.06e-06 | 5.00e-06 |
| ITGB3 | <1.00e-06* | 2.63e-05 | 1.20e-03 |
| PLEC | <1.00e-06* | 3.23e-05 | 4.00e-05 |
| NORAD | <1.00e-06* | 8.60e-05 | 2.00e-04 |
| ARFRP1 | <1.00e-06* | 4.37e-03 | 1.70e-02 |
| MIR128-2 | <1.00e-06* | 5.95e-03 | 1.42e-01 |
| ACOT4 | <1.00e-06* | 1.66e-02 | 5.26e-01 |
| CYP1B1 | <1.00e-06* | 1.79e-02 | 8.63e-01 |
| FOXA2 | <1.00e-06* | 2.65e-02 | 1.50e-01 |
| GDF9 | <1.00e-06* | 4.20e-02 | 1.57e-01 |
| MGAT2 | <1.00e-06* | 1.23e-01 | 3.09e-01 |
| SLC10A1 | <1.00e-06* | 6.78e-01 | 5.36e-01 |
Identified by the corresponding methods after the Bonferroni corrections
P(γ = 0|·) <1.00e-06 indicates P(γ = 0|·) = 0 with 106 MCMC samples
In comparison, MSATS and MTaSPUsSet identified 249 and 163 significant genes, respectively, after controlling each method’s family-wise type I error rate at 0.05 with the Bonferroni correction. However, MSATS identified only 3 of these 15 risk factor genes and MTaSPUsSet identified one of them. We note that although both HSVS-M and MSATS identified 249 genes, there were only 12 genes identified by both methods. We consider this as a result of the difference in the power of methods. For HSVS-M, because we use a gene-specific weight under the model assumption that SNPs in a gene are associated with a trait in similar directions and magnitudes, HSVS-M would gain more power when this assumption holds. On the other hand, MSATS is based on the variance-component test, which gains more power when the directions and magnitudes of associations between SNPs and traits vary to a greater extent.
To demonstrate HSVS-M’s trait-adaptive weights, we also show as an example in Table 4 the estimated weights of the associations between the traits and PLEC, a gene identified by HSVS-M but not the competing methods. The estimated weights were calculated as the posterior median of w. PLEC has strong negative associations with TC and LDL-C, a weak negative association with HDL-C, and a weak positive association with TG, which are in line with the finding that PLEC is in close proximity with a miRNA significantly associated with TC and LDL-C (Wagschal et al., 2015).
Table 4.
Estimated weights of the associations between PLEC and the lipid traits using HSVS-M
| Trait | Weight1 [95% CI2] |
|---|---|
|
| |
| TC | −0.93 [−0.99, −0.77] |
| HDL-C | −0.20 [−0.33, −0.07] |
| LDL-C | −0.98 [−0.99, −0.84] |
| TG | 0.16 [0.05, 0.31] |
Estimated weights were calculated as the posterior median of w with a range of [−1, 1]
CI: credible interval
4. Discussion
In this study, we propose HSVS-M, a gene-based multi-trait method that uses summary statistics and a gene-specific trait-adaptive weight both to test the association of variants in a gene with multiple traits and to estimate the strength of association of the gene with each trait. HSVS-M also provides an integrated framework both to test and to interpret associations, an important feature that addresses how traits are associated with a gene, which is often a question of primary interest in multi-trait GWASs (Stephens, 2013) but has not been tackled by most existing gene-based multivariate association tests.
As an integral part of HSVS-M, the gene-specific trait-adaptive weight w allows HSVS-M to produce accurate weight estimates and identify traits that are in different associations with the gene. In this study, we use a uniform distribution as the prior for wj. Other more informative choices of priors can be considered given different assumptions about the association between the gene and the traits. For example, a normal prior can be used if one has the prior biological evidence that the weight for the average strength of association between the gene and the jth trait would be around .
We demonstrate the power of HSVS-M in various simulated scenarios and applications to the real data. Specifically, HSVS-M is powerful when multiple traits are correlated and SNPs in a gene are associated with a trait in similar directions and magnitudes. HSVS-M’s superior power is mainly due to the gene-specific trait-adaptive weight w that HSVS-M uses to borrow strength from the traits. A SNP-specific weight wi = (wi1, …,wik)′ can also be a viable choice of weight, which would presumably give HSVS-M more latitude in modeling the association between SNPs and traits in case that the gene-level weight assumption is violated. However, a SNP-specific weight does not directly yield a gene-level weight estimate, which would make it less straightforward to draw inference about the strength of association between a gene and a trait.
The runtime of HSVS-M to complete 1,000 MCMC posterior samples for a simulated replicate of 20 variants and 4 traits is 0.30 seconds on a 2.3 GHz Intel Core i5 processor, whereas the runtime is 0.02 and 0.95 seconds for MSATS and MTaSPUsSet, respectively. In large-scale analyses where more MCMC samples might be desired due to the Bonferroni correction, we suggest a multi-phase analysis in which researchers apply HSVS-M to a gene with 1,000 MCMC samples as a first attempt to evaluate its significance and only keep promising genes in the analysis. Researchers can then apply HSVS-M to these few promising genes with more MCMC samples. To further achieve scalability, we have the option to implement parallel computing for HSVS-M, which, as shown in our osteosarcoma trio data analyses in the work of Yang et al. (2018) can potentially improve the computational efficiency by 25.7% for our method.
Supplementary Material
Acknowledgments
This study is supported in part by grants to S.B. from the National Institutes of Health/National Institute on Drug Abuse 5R01DA033958-02 and 1R21DA046188-01A1. This study makes use of data generated by the Global Lipids Genetics Consortium.
Grant numbers: National Institutes of Health/National Institute on Drug Abuse 5R01DA033958-02 and 1R21DA046188-01A1
Appendix A
Posterior inference for HSVS-M via MCMC
We present the full conditional distributions for the HSVS-M prior as follows. GIG is the generalized inverse Gaussian distribution and IG is the inverse gamma distribution.
-
where and
-
where
Data Availability Statement
The summary statistics for the four lipid traits from the Global Lipids Genetics Consortium are publicly available at http://csg.sph.umich.edu/willer/public/lipids2010/. The R package for HSVS-M is publicly available at https://yiyangphd.github.io/software/.
References
- 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, and Abecasis GR (2015). A global reference for human genetic variation. Nature, 526(7571):68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Adlakha YK, Khanna S, Singh R, Singh VP, Agrawal A, and Saini N (2013). Pro-apoptotic mirna-128–2 modulates abca1, abcg1 and rxrÎś expression and cholesterol homeostasis. Cell Death Dis, 4:e780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casella G, Ghosh M, Gill J, and Kyung M (2010). Penalized regression, standard errors, and bayesian lassos. Bayesian analysis, 5(2):369–411. [Google Scholar]
- Chang M. h., Yesupriya A, Ned RM, Mueller PW, and Dowling NF (2010). Genetic variants associated with fasting blood lipids in the u.s. population: Third national health and nutrition examination survey. BMC Med Genet, 11:62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, and Huber W (2005). Biomart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics, 21(16):3439–3440. [DOI] [PubMed] [Google Scholar]
- Durinck S, Spellman PT, Birney E, and Huber W (2009). Mapping identifiers for the integration of genomic datasets with the r/bioconductor package biomart. Nat Protoc, 4(8):1184–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo B and Wu B (2019). Powerful and efficient snp-set association tests across multiple phenotypes using gwas summary data. Bioinformatics, 35(8):1366–1372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hesse D, Radloff K, Jaschke A, Lagerpusch M, Chung B, Tailleux A, Staels B, and Schürmann A (2014). Hepatic trans-golgi action coordinated by the gtpase arfrp1 is crucial for lipoprotein lipidation and assembly. J Lipid Res, 55(1):41–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hommel A, Hesse D, Völker W, Jaschke A, Moser M, Engel T, Blüher M, Zahn C, Chadt A, Ruschke K, Vogel H, Kluge R, Robenek H, Joost H-G, and Schürmann A (2010). The arf-like gtpase arfrp1 is essential for lipid droplet growth and is involved in the regulation of lipolysis. Mol Cell Biol, 30(5):1231–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu S-W, Lin P, and Chen C-C (2008). Association of cytochrome p450 1b1 gene expression in peripheral leukocytes with blood lipid levels in waste incinerator workers. Ann Epidemiol, 18(10):784–91. [DOI] [PubMed] [Google Scholar]
- Hunt MC, Siponen MI, and Alexson SEH (2012). The emerging role of acyl-coa thioesterases and acyltransferases in regulating peroxisomal lipid metabolism. Biochim Biophys Acta, 1822(9):1397–410. [DOI] [PubMed] [Google Scholar]
- Hunt MC, Tillander V, and Alexson SEH (2014). Regulation of peroxisomal lipid metabolism: the role of acyl-coa and coenzyme a metabolizing enzymes. Biochimie, 98:45–55. [DOI] [PubMed] [Google Scholar]
- Jin Y, McFie PJ, Banman SL, Brandt C, and Stone SJ (2014). Diacylglycerol acyltransferase-2 (dgat2) and monoacylglycerol acyltransferase-2 (mgat2) interact to promote triacylglycerol synthesis. J Biol Chem, 289(41):28237–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kathiresan S, Willer CJ, Peloso GM, Demissie S, Musunuru K, Schadt EE, Kaplan L, Bennett D, Li Y, Tanaka T, Voight BF, Bonnycastle LL, Jackson AU, Crawford G, Surti A, Guiducci C, Burtt NP, Parish S, Clarke R, Zelenika D, Kubalanza KA, Morken MA, Scott LJ, Stringham HM, Galan P, Swift AJ, Kuusisto J, Bergman RN, Sundvall J, Laakso M, Ferrucci L, Scheet P, Sanna S, Uda M, Yang Q, Lunetta KL, Dupuis J, de Bakker PIW, O’Donnell CJ, Chambers JC, Kooner JS, Hercberg S, Meneton P, Lakatta EG, Scuteri A, Schlessinger D, Tuomilehto J, Collins FS, Groop L, Altshuler D, Collins R, Lathrop GM, Melander O, Salomaa V, Peltonen L, Orho-Melander M, Ordovas JM, Boehnke M, Abecasis GR, Mohlke KL, and Cupples LA (2009). Common variants at 30 loci contribute to polygenic dyslipidemia. Nat Genet, 41(1):56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kersseboom S, van Gucht ALM, van Mullem A, Brigante G, Farina S, Carlsson B, Donkers JM, van de Graaf SFJ, Peeters RP, and Visser TJ (2017). Role of the bile acid transporter slc10a1 in liver targeting of the lipid-lowering thyroid hormone analog eprotirome. Endocrinology, 158(10):3307–3318. [DOI] [PubMed] [Google Scholar]
- Kim J, Bai Y, and Pan W (2015). An adaptive association test for multiple phenotypes with gwas summary statistics. Genet Epidemiol, 39(8):651–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwak I-Y and Pan W (2017). Gene- and pathway-based association tests for multiple traits with gwas summary statistics. Bioinformatics, 33(1):64–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ligthart S, Vaez A, Hsu Y-H, Inflammation Working Group of the CHARGE Consortium, PMI-WG-XCP, LifeLines Cohort Study, Stolk R, Uitterlinden AG, Hofman A, Alizadeh BZ, Franco OH, and Dehghan A (2016). Bivariate genome-wide association study identifies novel pleiotropic loci for lipids and inflammation. BMC Genomics, 17:443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, and Lin X (2019). Acat: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics, 104(3):410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan W (2009). Asymptotic tests of association with multiple snps in linkage disequilibrium. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 33(6):497–507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan W, Kim J, Zhang Y, Shen X, and Wei P (2014). A powerful and adaptive association test for rare variants. Genetics, 197(4):1081–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan W, Kwak I-Y, and Wei P (2015). A powerful pathway-based adaptive test for genetic association with common or rare variants. Am J Hum Genet, 97(1):86–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park T and Casella G (2008). The bayesian lasso. J Am Stat Assoc, 103(482):681–686. [Google Scholar]
- Plaisier CL, Horvath S, Huertas-Vazquez A, Cruz-Bautista I, Herrera MF, Tusie-Luna T, Aguilar-Salinas C, and Pajukanta P (2009). A systems genetics approach implicates usf1, fads3, and other causal candidate genes for familial combined hyperlipidemia. PLoS Genet, 5(9):e1000642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ray D, Pankow JS, and Basu S (2016). Usat: A unified score-based association test for multiple phenotype-genotype analysis. Genetic epidemiology, 40(1):20–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephens M (2013). A unified framework for association analysis with multiple related phenotypes. PLoS One, 8(7):e65245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stone SJ, Levin MC, Zhou P, Han J, Walther TC, and Farese RV (2009). The endoplasmic reticulum enzyme dgat2 is found in mitochondria-associated membranes and has a mitochondrial targeting signal that promotes its association with mitochondria. Journal of Biological Chemistry, 284(8):5352–5361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su Y-Q, Sugiura K, Wigglesworth K, O’Brien MJ, Affourtit JP, Pangas SA, Matzuk MM, and Eppig JJ (2008). Oocyte regulation of metabolic cooperativity between mouse cumulus cells and oocytes: Bmp15 and gdf9 control cholesterol biosynthesis in cumulus cells. Development, 135(1):111–21. [DOI] [PubMed] [Google Scholar]
- Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, Pirruccello JP, Ripatti S, Chasman DI, Willer CJ, Johansen CT, Fouchier SW, Isaacs A, Peloso GM, Barbalic M, Ricketts SL, Bis JC, Aulchenko YS, Thorleifsson G, Feitosa MF, Chambers J, Orho-Melander M, Melander O, Johnson T, Li X, Guo X, Li M, Shin Cho Y, Jin Go M, Jin Kim Y, Lee J-Y, Park T, Kim K, Sim X, Twee-Hee Ong R, Croteau-Chonka DC, Lange LA, Smith JD, Song K, Hua Zhao J, Yuan X, Luan J, Lamina C, Ziegler A, Zhang W, Zee RYL, Wright AF, Witteman JCM, Wilson JF, Willemsen G, Wichmann H-E, Whitfield JB, Waterworth DM, Wareham NJ, Waeber G, Vollenweider P, Voight BF, Vitart V, Uitterlinden AG, Uda M, Tuomilehto J, Thompson JR, Tanaka T, Surakka I, Stringham HM, Spector TD, Soranzo N, Smit JH, Sinisalo J, Silander K, Sijbrands EJG, Scuteri A, Scott J, Schlessinger D, Sanna S, Salomaa V, Saharinen J, Sabatti C, Ruokonen A, Rudan I, Rose LM, Roberts R, Rieder M, Psaty BM, Pramstaller PP, Pichler I, Perola M, Penninx BWJH, Pedersen NL, Pattaro C, Parker AN, Pare G, Oostra BA, O’Donnell CJ, Nieminen MS, Nickerson DA, Montgomery GW, Meitinger T, McPherson R, McCarthy MI, McArdle W, Masson D, Martin NG, Marroni F, Mangino M, Magnusson PKE, Lucas G, Luben R, Loos RJF, Lokki M-L, Lettre G, Langenberg C, Launer LJ, Lakatta EG, Laaksonen R, Kyvik KO, Kronenberg F, König IR, Khaw K-T, Kaprio J, Kaplan LM, Johansson A, Jarvelin M-R, Janssens ACJW, Ingelsson E, Igl W, Kees Hovingh G, Hottenga J-J, Hofman A, Hicks AA, Hengstenberg C, Heid IM, Hayward C, Havulinna AS, Hastie ND, Harris TB, Haritunians T, Hall AS, Gyllensten U, Guiducci C, Groop LC, Gonzalez E, Gieger C, Freimer NB, Ferrucci L, Erdmann J, Elliott P, Ejebe KG, Döring A, Dominiczak AF, Demissie S, Deloukas P, de Geus EJC, de Faire U, Crawford G, Collins FS, Chen Y.-d. I., Caulfield MJ, Campbell H, Burtt NP, Bonnycastle LL, Boomsma DI, Boekholdt SM, Bergman RN, Barroso I, Bandinelli S, Ballantyne CM, Assimes TL, Quertermous T, Altshuler D, Seielstad M, Wong TY, Tai E-S, Feranil AB, Kuzawa CW, Adair LS, Taylor HA Jr, Borecki IB, Gabriel SB, Wilson JG, Holm H, Thorsteinsdottir U, Gudnason V, Krauss RM, Mohlke KL, Ordovas JM, Munroe PB, Kooner JS, Tall AR, Hegele RA, Kastelein JJP, Schadt EE, Rotter JI, Boerwinkle E, Strachan DP, Mooser V, Stefansson K, Reilly MP, Samani NJ, Schunkert H, Cupples LA, Sandhu MS, Ridker PM, Rader DJ, van Duijn CM, Peltonen L, Abecasis GR, Boehnke M, and Kathiresan S (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307):707–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van der Sluis S, Dolan CV, Li J, Song Y, Sham P, Posthuma D, and Li M-X (2015). Mgas: a powerful tool for multivariate gene-based genome-wide association analysis. Bioinformatics, 31(7):1007–1015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wagschal A, Najafi-Shoushtari SH, Wang L, Goedeke L, Sinha S, deLemos AS, Black JC, Ramírez CM, Li Y, Tewhey R, Hatoum I, Shah N, Lu Y, Kristo F, Psychogios N, Vrbanac V, Lu Y-C, Hla T, de Cabo R, Tsang JS, Schadt E, Sabeti PC, Kathiresan S, Cohen DE, Whetstine J, Chung RT, Fernández-Hernando C, Kaplan LM, Bernards A, Gerszten RE, and Näär AM (2015). Genome-wide identification of micrornas regulating cholesterol and triglyceride homeostasis. Nat Med, 21(11):1290–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wolfrum C, Howell JJ, Ndungo E, and Stoffel M (2008). Foxa2 activity increases plasma high density lipoprotein levels by regulating apolipoprotein m. J Biol Chem, 283(24):16940–9. [DOI] [PubMed] [Google Scholar]
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, and Lin X (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu N, Zhang SO, Cole RA, McKinney SA, Guo F, Haas JT, Bobba S, Farese RV Jr, and Mak HY (2012). The fatp1-dgat2 complex facilitates lipid droplet expansion at the er-lipid droplet interface. J Cell Biol, 198(5):895–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Y, Basu S, Mirabello L, Spector L, and Zhang L (2018). A bayesian gene-based genome-wide association study analysis of osteosarcoma trio data using a hierarchically structured prior. Cancer Inform, 17:1176935118775103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Y, Basu S, and Zhang L (2020). A bayesian hierarchical variable selection prior for pathway-based gwas using summary statistics. Stat Med, 39(6):724–739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Y, Basu S, and Zhang L (2021). A bayesian hierarchically structured prior for rare-variant association testing. Genetic epidemiology. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zuo H, Su X, Jin Y, Zhang C, Wang L, and Yang L (2020). Transthyretin regulated by linc00657/mir-205–5p promoted cholesterol metabolism by inducing srebp2-hmgcr and inhibiting lxrÎś-cyp7a1. Arch Med Res. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The summary statistics for the four lipid traits from the Global Lipids Genetics Consortium are publicly available at http://csg.sph.umich.edu/willer/public/lipids2010/. The R package for HSVS-M is publicly available at https://yiyangphd.github.io/software/.


