Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2021 Mar 16;108(4):669–681. doi: 10.1016/j.ajhg.2021.02.016

A powerful subset-based method identifies gene set associations and improves interpretation in UK Biobank

Diptavo Dutta 1,2,3, Peter VandeHaar 1,2, Lars G Fritsche 1,2, Sebastian Zöllner 1,2, Michael Boehnke 1,2, Laura J Scott 1,2, Seunggeun Lee 1,2,4,
PMCID: PMC8059336  PMID: 33730541

Summary

Tests of association between a phenotype and a set of genes in a biological pathway can provide insights into the genetic architecture of complex phenotypes beyond those obtained from single-variant or single-gene association analysis. However, most existing gene set tests have limited power to detect gene set-phenotype association when a small fraction of the genes are associated with the phenotype and cannot identify the potentially “active” genes that might drive a gene set-based association. To address these issues, we have developed Gene set analysis Association Using Sparse Signals (GAUSS), a method for gene set association analysis that requires only GWAS summary statistics. For each significantly associated gene set, GAUSS identifies the subset of genes that have the maximal evidence of association and can best account for the gene set association. Using pre-computed correlation structure among test statistics from a reference panel, our p value calculation is substantially faster than other permutation- or simulation-based approaches. In simulations with varying proportions of causal genes, we find that GAUSS effectively controls type 1 error rate and has greater power than several existing methods, particularly when a small proportion of genes account for the gene set signal. Using GAUSS, we analyzed UK Biobank GWAS summary statistics for 10,679 gene sets and 1,403 binary phenotypes. We found that GAUSS is scalable and identified 13,466 phenotype and gene set association pairs. Within these gene sets, we identify an average of 17.2 (max = 405) genes that underlie these gene set associations.

Keywords: pathway association, summary statistics, core subset, UK Biobank, phenome-wide associations

Introduction

Over the last fifteen years, genome-wide association studies (GWASs) have identified thousands of genetic variants associated with hundreds of complex diseases and phenotypes.1 However, the variants identified to date, individually or collectively, typically account for a small proportion of phenotype heritability.2 A possible explanation is that because of the large number of genetic polymorphisms examined in GWASs and the massive number of tests conducted, many weak associations are missed after multiple comparison adjustments.3

Gene set analysis (GSA) can identify sets of associated genes that may not be identified with single-variant and single-gene analysis, especially for rare variants or variants or genes with weak to moderate effects.4 In GSA, individual genes are aggregated into groups sharing certain biological or functional characteristics. This approach considerably reduces the number of tests performed because the number of gene sets analyzed is much smaller than the number of genes or genetic variants tested.5,6 Additionally, most complex phenotypes are manifested through the combined activity of multiple genes or variants, so GSA can provide insight into the involvement of specific biological pathways or cellular mechanisms to the phenotype.7

GSA aims to find evidence regarding one of two types of null hypotheses:6 (1) the competitive null hypothesis in which genes in a gene set of interest are no more associated with the phenotype than any other genes outside of it or (2) the self-contained null hypothesis in which none of the genes in a gene set of interest is associated with the phenotype. Several statistical methods to perform GSA for self-contained null hypothesis have been developed and have successfully identified gene sets associated with complex diseases.8, 9, 10, 11, 12, 13, 14, 15 For example, de Leeuw et al.13 developed MAGMA, a method that transforms p values of the genes in the gene set to Z values by using an inverse normal transformation and employs linear regression to test the association. Pan et al.12 developed aSPUpath, which uses an adaptive test statistic based on the sum of powered scores and calculates a permutation-based p value.

However, there are several concerns regarding the power, type I error control, and computational scalability of these methods. Existing GSA methods often have relatively low power,9 especially in situations where only a few genes within the gene set have moderate to weak associations with the phenotype.14 Additionally, in the presence of correlation between variants or genes due to linkage disequilibrium (LD), many existing methods cannot appropriately control the type I error.16 Resampling-based strategies can be used for p value calculation,17 but in current implementation, these approaches are computationally very expensive, reducing the applicability of the method, especially for large datasets. Although identifying the specific genes that possibly drive the association signal within the gene set is important in further downstream analysis, most of the existing GSA methods fail to identify such genes.

Here, we describe a computationally efficient subset-based gene set association method, Gene set analysis Association Using Sparse Signals (GAUSS), which aims to increase power over existing methods while maintaining proper type I error control and facilitate interpretation by extracting a subset of genes that drive the association. GAUSS focuses on the self-contained null hypothesis, as our main goal is to identify phenotype-associated genes or loci. GAUSS identifies a subset of genes (called the core subset) within the gene set, which produce the maximum signal of association. The gene set p value is calculated through a combination of fast copula-based simulation and statistical approximation approaches using the generalized Pareto distribution.18,19 GAUSS is constructed with the gene-based test p values for the genes in the gene set. The gene-based p values can be directly computed from the individual level genotype data if available or approximated with GWAS summary statistics (effect sizes, standard errors, and minor allele frequency). Using pre-computed genetic correlation matrices makes GAUSS computationally fast and applicable to large biobank-scale datasets.

Through computer simulation, we show that GAUSS can be more powerful than existing methods while maintaining the correct type I error. We applied GAUSS to UK Biobank GWAS summary statistics for 1,403 phenotypes20 with 10,679 gene sets derived from the molecular signature database (MsigDB v.6.2),21 demonstrating that GAUSS is feasible for large-scale data and can provide new insights into the genetic architecture of the phenotypes. We have made the association analysis results publicly available through a visual browser.

Material and methods

To conduct GAUSS, we need p values for the regions or genes in the gene set. Popular gene-based tests, including genetic association tests such as SKAT22 and SKAT-Common-Rare23 or genetic expression tests such as prediXcan24 and TWAS-FUSION,25 can be used to obtain the p values when individual-level data are available. If only GWAS summary statistics (effect size, standard error, p value, minor allele frequency for each variant) are available, we can approximate the gene-based tests and obtain their p values by using LD information from a suitable reference panel (see Appendix A).26 The GAUSS test for a given gene set can be calculated in the following two steps.

Step 1: GAUSS test statistic

To construct the GAUSS test statistic, we start with the gene-based p values for m genes in the gene set H. Suppose Pvaluei is the p value of the ith gene (i=1, …, m) in the gene set H. We first convert the Pvaluei to a Z score as zi=Φ1(1Pvaluei), where Φ1 is the inverse of the standard normal cumulative distribution function. Here, we have used SKAT-Common-Rare to obtain Pvaluei from GWAS summary statistics, but other tests such as prediXcan can be used to obtain gene-based p values.

For any non-empty subset BH, we define S(B), the association score for the subset B, as S(B)=iBzi/|B|, where |B| is the number of genes in B. We define the GAUSS statistic for the gene set H as the maximum score of a subset of H

GAUSS(H)=maxBHiBzi|B|.

Although the maximum is overall 2m − 1 possible non-empty subsets of H, the computational complexity can be greatly reduced by rewriting the formula as

GAUSS(H)=maxk{1,..,m}maxBkHiBkzi|Bk|,

where Bk denotes a non-empty subset of H with k elements. It is easy to show that

maxBkHiBkzi|Bk|=z(1)+ z(2)+..+z(k)k, (Equation 1)

where z(1), z(2), …, z(m) are the ordered Z statistic in decreasing order and z(1) is the maximum. Equation 1 holds for any subset of k values from H regardless of their joint distribution (for detailed proof see supplemental methods, section A). We implement the following algorithm to obtain the GAUSS statistic:

  • 1.

    order the Z statistic for the m genes as z(1), z(2), …, z(m);

  • 2.

    starting with k =1, compute Sk=(z1+z2+..+zk)/k for all k = 1, 2, …, m;

  • 3.

    calculate the GAUSS test statistic as maxk{1,..,m}Sk.

Using this approach, computational cost is reduced from O(2m) to O(mlogm). We term the subset of genes B for which the maximum is attained the core subset (CS) of the gene set H.

Step 2: p value calculation

Because of LD between variants in genes in the same genomic region, Z statistics in step 1 may be dependent. Thus, it is challenging to derive the null distribution of the GAUSS analytically. Instead, we employ a fast simulation approach. We first estimate the correlation structure (VˆH) among the Z statistics (z1, z2, · · ·, zm) under the null hypothesis, which can be estimated by using the sample itself or an ancestry-matched genotype reference panel. Here, we use genotype data from the European individuals in publicly available 1000 Genomes data27 as the reference panel. We note that VˆH needs to be estimated only once for a given dataset and reused for all iterations. With VˆH, we approximate the joint distribution of Z statistics by using a multivariate normal distribution (see Appendix C and supplemental methods, section D). Now the null distribution of GAUSS test statistics can be simulated by repeatedly generating Z statistics from the mean zero multivariate normal distribution with covariance VˆH and calculating GAUSS statistics from the simulated sets of Z statistics. The proportion of simulated null test statistics greater than the observed GAUSS test statistic is the p value estimate (see Appendix B for details). To further reduce computational cost, we use an adaptive resampling scheme (see Appendix B). To estimate very small p values (e.g., p value < 5×10−6), we use a generalized Pareto distribution (GPD)-based method18 (see supplemental methods, section B). We fit a GPD to the upper tail of the simulated GAUSS test statistics by using right-tailed second order Anderson-Darling statistic (GPD-AD2R) (see supplemental methods, section B and Figure S1) and estimate the p value by inverting the distribution function of the fitted GPD.

Simulation studies

We carried out simulation studies to evaluate the performance of GAUSS by using individual-level genotype data from the UK Biobank. For realistic LD patterns in the generative model, we used genotypes of 5,000 unrelated UK Biobank participants throughout our simulations. To understand the effect of the number of genes in the gene sets, we selected three gene sets of varying length from GO terms in MSigDB (v.6.2) for our simulations: regulation of blood volume by renin angiotensin (GO: 0002016; 11 genes), sterol metabolic process (GO: 0016125; 123 genes), and immune response process (GO: 0006955; 1,100 genes).

We define a gene within a gene set as “active” if at least one variant annotated to the gene has non-zero effect size. For a given gene set, we randomly set ga genes to be active and, within the lth active gene with tl variants, we set va;l to be the proportion of variants with non-zero effects. Using genotypes of N randomly selected unrelated individuals from the UK Biobank, we generate the phenotypes for individual i (i = 1, · · ·, N) according to the model

Yi=k=1TβkGik+εi,

where εi~N(0,1) and Gik is the genotype of the ith individual at the kth variant and T=l=1gatlva;l is the total number of variants with non-zero effects. Throughout our simulations, we used N = 5,000. The effect size of the kth active variant with minor allele frequency MAFk is generated as bik = c|log10(MAFk)|, where c is the magnitude of the association between a variant and phenotypes. For type I error simulations, we set c = 0, while for power, we set c > 0. We determined the value of c by fixing the average heritability explained by the gene set (hgs2). We used several values for the average heritability explained by the gene set between hgs2 = 1% and 10%. With 20%–30% of variants having non-zero effect sizes, the corresponding values of c varied approximately between 0.10 and 0.25.

Given the summary statistics generated as above, we estimated the SKAT-Common-Rare p values by using publicly available 1000 Genomes unrelated Europeans (sample size = 498) as LD reference panel and applied GAUSS to estimate the p value and extract the core genes.

UK Biobank data analysis

We applied GAUSS to the publicly available UK Biobank GWAS summary results for 1,403 binary phenotypes that were generated by SAIGE28 (see PheWeb entry in web resources). The summary statistic files included results for markers directly genotyped or imputed by the Haplotype Reference Consortium (HRC), which produced approximately 28 million markers with MAC ≥ 20 and an imputation info score ≥ 0.3. We used EPACTs29 (see web resources) with the RefSeq gene database for the variant annotation. For each gene, we included non-synonymous variants and variants within 1 kb of the first and last variants in each exon to test for the effect of possibly functional and regulatory variants. We extracted LD information and constructed an LD matrix from the 1000 Genomes European reference panel by using emeraLD. For each of the 1,403 phenotypes and 18,334 genes, we constructed SKAT-Common-Rare test statistics by using the estimates from SAIGE of effect size (b), standard error (SE), and minor allele frequency (MAF). We transformed the gene-based p values into Z statistics and performed gene set analysis for each phenotype.

Results

Simulation results

Type I error rates and power to identify associated gene sets

To estimate the type I error, we generated a normally distributed phenotype for these same individuals (see simulation studies), independent of genotypes. We then calculated the gene-based p values by using the SKAT-Common-Rare test for each gene in the gene sets and subsequently applied the GAUSS test. Type I errors of GAUSS remained well calibrated at a = 1 × 10−4, 1 × 10−5, and 5 × 10−6 (Table 1) for all three gene sets.

Table 1.

Estimated type I error of GAUSS for gene sets GO: 0016125, GO: 0006955, and GO: 0002016

α GO: 0016125 (123 genes) GO: 0006955 (1,100 genes) GO: 0002016 (11 genes)
1 × 10−4 9.8 × 10−5 9.8 × 10−5 9.7 × 10−5
1 × 10−5 9.9 × 10−6 9.3 × 10−6 9.6 × 10−6
5 × 10−6 4.6 × 10−6 4.8 × 10−6 4.2 × 10−6

Next, we compared the power to detect a gene set-phenotype association, under a spectrum of association models for GAUSS and three existing methods: SKAT for all the variants in the gene set (SKAT-Pathway), MAGMA, and aSPUpath. With gene set GO: 0016125, we first considered a scenario that 20 of the 123 genes (16.2%) were active and, within each active gene, we set 30% of the variants to be causal. We varied the gene set heritability (hgs2) from 1% to 6%. The empirical power of each method increased with increasing hgs2. GAUSS and MAGMA had similar power (Figure 1; left panel) for all scenarios, and SKAT-Pathway had the lowest power. aSPUpath had slightly lower power than GAUSS when hgs2 = 1%–3% and had similar power for hgs2 = 4%–6%.

Figure 1.

Figure 1

Empirical power for GAUSS

Estimated power of GAUSS with GO: 0016125 gene set (123 genes), compared with that of aSPUpath, SKAT-Pathway, and MAGMA under different average heritability explained (hgs2) and different number of active genes (ga).

(A) Power of GAUSS when 20 genes are active (ga = 20) and the variants with different average heritability (hgs2) explained by the gene set.

(B) Power of GAUSS with different number of active genes (2, 4, 5, and 6) and the gene set has an average heritability hgs2 of 3%. The proportion of causal variants in an active gene (see simulation studies) was set to be 30%.

Next, we considered a scenario where the signals were sparser (Figure 1; right panel), i.e., two (1.6%) to six (5.0%) genes among the 123 genes in the gene set were active. We fixed the gene set heritability (hgs2) at ~3%. In all the simulation settings, GAUSS was the most powerful method. The power gap between GAUSS and the other methods was particularly large when only two genes were active. Among the other methods, aSPUpath had the second highest power and MAGMA had the lowest power when 10 to 20 genes were active. The overall trend remained similar when we used a much larger gene set, such as GO: 0006955, or a much smaller gene set, such as GO: 0002016 (Figures S2 and S3).

Identification of active genes

We investigated the sensitivity and specificity of GAUSS in identifying active genes through the core subset (CS) genes. Sensitivity is defined as the proportion of active genes correctly identified by GAUSS as CS genes, and specificity is defined as the proportion of inactive genes correctly identified by GAUSS as not CS genes. Because no current methods attempt to identify the active genes within the gene set, we compared the performance of GAUSS to the heuristic approach of defining the significant genes (p value < 2.5×10−6) as active. For GAUSS, both sensitivity and specificity remained higher (>75%) than the significant genes approach at different values of hgs2 and for varying number of active genes (Figure 2). We also evaluated power to identify the exact set of active genes, which is a more stringent criteria compared with sensitivity and specificity. Under different magnitudes of gene effect sizes defined by different values of heritability, the empirical probability to identify the exact set of active genes via GAUSS had a slight decreasing trend with increasing number of active genes (Figure 2). The overall patterns remained similar when we varied the length of the gene set by using GO: 0006955 (1,100 genes) and GO: 0002016 (11 genes) for our simulations (see Figures S4 and S5).

Figure 2.

Figure 2

Sensitivity, specificity, and probability of identifying the exact non-null subset for GAUSS

The results are across different number of active genes in the gene set (horizontal axis) and different average heritability explained by the gene set (hgs2) for GO: 0016125 (123 genes).

(A–C) Sensitivity (A), specificity (B), and probability (C) of identifying an exact non-null subset with GAUSS (solid line) compared with the method of using the set of significant genes as the active gene set (dashed line). The proportion of causal variants in an active gene (see simulation studies) was set to be 30%.

Simulation results highlight the utility of GAUSS compared with the existing methods. Especially when only a few genes in the gene set are weakly associated with the phenotype, GAUSS has greater power to identify gene set associations. Further by extracting CS genes, GAUSS can identify the set of active genes with high probability and provides a direct way to interpret findings.

Association analysis in UK Biobank

We applied GAUSS to the UK Biobank GWAS summary results for 1,403 binary phenotypes28 to identify disease-related gene sets and the corresponding core genes (see material and methods). We used 10,679 gene sets from two MsigDB (v.6.2) collections: (1) the curated gene sets (C2) from the KEGG, BioCarta, and Reactome databases and gene sets representing expression signatures of genetic and chemical perturbations and (2) gene sets that contain genes annotated by GO term (C5). For each phenotype, we estimated the gene-based (SKAT-Common-Rare) p value for 18,334 genes by using SAIGE summary statistics and LD information from a reference panel consisting of unrelated Europeans in the 1000 Genomes Project (see material and methods). For each pair of phenotype and gene set, we computed the GAUSS test statistic, corresponding p value, and the CS of genes (if the gene set is reported to be significant). We used the Bonferroni corrected gene set p value threshold for each phenotype: 0.05/10,679 ≈ 5 × 10−6.

Overview of UK Biobank results

The 10,679 gene sets had median size of 36 genes per gene set (average: 93.2). 94.2% (17,284 of 18,334) of genes belonged to at least one gene set. In our analysis, we identified 13,466 significant phenotype-gene set associations at a p value cut-off of 5 × 10−6. Note that the expected number of p values < 5 × 10−6 under no association across all the phenotypes is approximately 75, so the false discovery rate is 0.004. Among the 1,403 phenotypes, 199 (14.1%) had at least one significantly associated gene set, while among the 10,679 gene sets, 34.1% (3,638) had at least one significantly associated phenotype. There was no significant enrichment in the proportion of association by category of gene sets, i.e., the GO (C5) gene sets or curated (C2) gene sets (p value = 0.13). For the significant associations, the average number of the extracted CS genes was 17.2, and a large proportion of the associations (53.6%; 7,237) was due to effects of a single gene within the gene set. However, 24.6% of the associations were driven by a set of five or more CS genes. Approximately 32.7% of the significant associations were with gene sets that do not have any genes significant at the gene-based cutoff of 2.5× 10−6. This underlines that GAUSS can effectively aggregate weaker associations to detect significant gene sets associated to phenotype. Among the different categories of phenotypes, “endocrine/metabolic” diseases had the highest number of associations (5,015; 37.2%), followed by “circulatory system” diseases (2,312; 17.2%) and “digestive” diseases (1,985; 14.7%) (see Figures S6–S9).

Gene set association analysis for two exemplary phenotypes

To demonstrate the utility of GAUSS in detecting weak associations and improving interpretation, we show association results for two example phenotypes: E. coli infection (EC; PheCode: 041.4) and gastritis and duodenitis (GD; PheCode: 535). Single variant GWAS results using SAIGE for these phenotypes can be visualized on UK Biobank PheWeb (see web resources) and do not show any evidence of substantial inflation (lGC varies from 0.91 to 1.09). In the single-variant analysis, EC has no genome-wide significant locus and GD has five genome-wide significant loci. When we estimated the gene-based (SKAT-Common-Rare) p values for EC and GD, the QQ plots were well calibrated without any indication of inflation (lGC varies from 0.98 to 1.01; Figure S10). At a gene-based cut-off of 2.5 × 10−6, EC does not have any significantly associated genes; GD has three genes that are significantly associated: HLA-DQA1 (p value = 9.8 × 10−11), HLA-DQB1 (p value = 1.4 × 10−8), and PBX2 (p value = 2.1 ×10−6).

Next, we performed gene set association analysis by using GAUSS (Figure 3). We found that EC is associated with two gene sets (Figure 3; left panel): fatty acid catabolic process (GO: 0009062; p value < 1×10−6) and fatty acid beta oxidation (GO: 0006635; p value = 2 × 10−6). Although a thorough gene set association analysis of EC has not been done before to our knowledge, the antibacterial role of fatty acids has been well reported.30, 31, 32 A set of 25 distinct genes (Table 2) is selected by GAUSS as the CS genes that are responsible for the association, although none of them are marginally associated with EC (minimum p value = 2.2 × 10−4), demonstrating that GAUSS can effectively aggregate weaker signals within a gene set, which would otherwise not have been detected.

Figure 3.

Figure 3

p values for EC and GD across GO and curated gene sets

p values for association of E. coli infection (EC; PheCode 041.4) and gastritis and duodenitis (GD; PheCode 535) with the GO pathways (C5; upper panel) and curated pathways (C2; lower panel). p values < 1 × 10−6 were estimated with GPD (see materials and methods and supplemental methods, section B). The horizontal solid black line denotes the significance threshold of 5 × 10−6.

Table 2.

Significant gene sets associated with E. coli infection (EC) and gastritis and duodenitis (GD), corresponding p values, and the CS genes selected by GAUSS

Phenotype Gene set Genes p value Core subset (CS) selected by GAUSS
EC GO: Fatty acid catabolic process 73 9.9× 10−8 SLC27A2, CRAT, CPT1B, ACOX2, LPIN1, CPT1C, ETFB, SLC27A4, EHHADH, ACAA1, LEP, ABCD2, GCDH, HADH, MUT, BDH2, PLA2G15, PEX2, IVD, ACAAS, PEX13, ACAD8, ACADL, ECI1, ADIPOQ
GO: Fatty acid beta oxidation 51 1.8 × 10−6 SLC27A2, CRAT, CPT1B, ACOX2, CPT1C, ETFB, EHHADH, ACAA1, LEP, ABCD2, GCDH, HADH, BDH2, PEX2, IVD, ACAAS, ACAD8, ACADL, ECI1, ADIPOQ
GD Reactome: P53 independent G1/S DNA damage checkpoint 51 9.7 × 10−8 PSMB2, PSMB9, PSMC5, CHEK1, PSMB8, PSMD9, PSMD2, RPS27A, PSMA6, PSMB7
Reactome: CDK mediated phosphorylation and removal of CDC6 48 2.8 × 10−6 PSMB2, PSMB9, PSMC5, PSMB8, PSMD9, PSMD2, RPS27A, PSMA6, PSMB7
Reactome: cyclin E-associated events during G1/S transition 65 1.9 × 10−6 PSMB2, PSMB9, PKMYT1, PSMC5, PSMB8, PSMD9, PSMD2, RPS27A, PSMA6, PSMB7
Reactome: P53-dependent G1 DNA damage response 57 1.6 × 10−6 PSMB2, PSMB9, PSMC5, MDM2, PSMB8, PSMD9, PSMD2, RPS27A, PSMA6, PSMB7

p values < 1 × 10−6 were estimated with GPD (see materials and methods and supplemental methods, section B).

In gene set association analysis of GD (Figure 3; right panel), we found four gene sets to be associated (Table 2). Although the gene sets and the corresponding functions are biologically related, their role in GD is not easily identifiable. GAUSS selects a set of ten genes to be the CS genes for the gene sets, the majority of which are from the different proteasome endopeptidase complex (PSM) subunits. Different proteasome subunit genes have been found to be associated with several inflammatory responses and intestinal diseases.33,34 In particular, the role of PSMB835 in gastric cancer has been extensively reported in the literature. Also, PSMB9 and PSMB8 have been found to be associated with several gastrointestinal disorders, such as celiac disease and inflammatory bowel disease.36, 37, 38 Although, none of these genes are individually significantly associated with GD (minimum p value = 2.8 × 10−4), they jointly drive the strong association signal. This highlights that the selected core genes (CS) can help in finding biological targets for downstream investigation.

To validate the results, we tested the gene sets identified by GAUSS for EC and GD (Table 2) in an independent dataset. For this, we applied GAUSS to summary data from the Michigan Genomics Initiative (MGI) of about 38,000 European samples39 (see supplemental methods, section C). Our results show that five out of the six gene sets that were significant in UK Biobank for either EC or GD had a nominal evidence of significance (p value < 0.05; Table S1). Given that the sample size of MGI is about 10 times lower than UK Biobank, our findings indicate that the detected associations are potentially true.

Phenome-wide association analysis for single gene set

We further analyzed the association of a gene set across the binary phenome. Figure 4 shows association results across the 1,403 phenotypes for one example gene set: ATP-binding cassette (ABC) transporters from KEGG (ABC transporters; web resources). ABC transporters are involved in tumor resistance, cystic fibrosis, and a spectrum of other heritable phenotypes along with the development of resistance to several drugs.40 We found 18 phenotypes significantly associated (p value < 5×10−6) with ABC transporters (Table 3), mainly from “digestive” disease and “endocrine/metabolic” disease categories. Among the CS genes selected for different associated phenotypes, TAP2 is the most frequent. TAP2 has been reported to be associated with several phenotypes, including diastolic blood pressure,41 type 1 diabetes, and autoimmune thyroid diseases.42 Our results suggest that the significant association of ABC transporters to disorders such as psoriasis, celiac disease, and type 1 diabetes is mainly driven by the single-gene effect of TAP2. However, the association of ABC transporters with gout, lipoid metabolism, and gallstones is driven mainly by ABCG5 and ABCG2. Thus, although ABC transporters gene set is significantly associated with 18 phenotypes, the CS genes that drive the associations are different, which can be indicative of different mechanisms underlying the phenotypes.

Figure 4.

Figure 4

Phenome-wide p values for ABC transporter pathway in KEGG

p values for association of 1,403 phenotypes with ABC transporters pathway (KEGG). The horizontal solid black line denotes the significance threshold of 5 × 10−6. p values < 1 × 10−6 were estimated through the GPD method (see materials and methods and supplemental methods, section B).

Table 3.

Phenotypes associated with ABC transporters gene set, corresponding p values, and the CS genes selected by GAUSS

Phenotype Category PheCode p value Core subset (CS) selected by GAUSS
Psoriasis dermatologic 696.4 9.1 × 10−11 TAP2
Psoriasis and related disorders dermatologic 696 3.5 × 10−11 TAP2
Celiac disease digestive 557.1 1.8 × 10−32 TAP2
Intestinal malabsorptions (non-celiac) digestive 557 3.4 × 10−31 TAP2
Cholelithiasis with other cholecystitis digestive 574.12 1.8 × 10−19 ABCG5
Cholelithiasis digestive 574.1 1.2 × 10−31 ABCG5
Calculus of bile duct digestive 574.2 1.2 × 10−11 ABCG5
Cholelithiasis without cholecystitis digestive 574.3 4.1 × 10−6 ABCG5, ABCC12, ABCA8, ABCB4
Cholelithiasis and cholecystitis digestive 574 9.6 × 10−34 ABCG5
Other biliary tract disease digestive 575 6.4 × 10−8 ABCG5
Hypothyroidism NOS endocrine/metabolic 244.4 1.5 × 10−12 TAP2
Type 1 diabetes endocrine/metabolic 250.1 2.9 × 10−8 TAP2
Hypercholesterolemia endocrine/metabolic 272.11 2.8 × 10−8 ABCG5, TAP2, ABCC10, ABCA2, ABCA5, ABCA1, ABCA6, ABCC12, ABCC1, ABCA8, ABCB9
Hyperlipidemia endocrine/metabolic 272.1 2.1 × 10−7 TAP2, ABCG5, ABCC10, ABCA6, ABCA2, ABCA5, ABCA1, ABCC1, ABCA8
Disorders of lipoid metabolism endocrine/metabolic 272 5.7 × 10−27 ABCG2
Gout endocrine/metabolic 274.1 7.1 × 10−10 ABCG2
Gout and other crystal arthropathies endocrine/metabolic 274 3.8 × 10−6 ABCG2
Asthma respiratory 495 9.1 × 10−11 TAP2

p values < 1 × 10−6 were estimated with GPD (see material and methods and supplemental methods, section B).

Computation time comparison

GAUSS uses an adaptive resampling scheme to estimate the p value (see material and methods). Hence, the computation time of GAUSS can vary across different phenotypes depending on the number of associated gene sets. To evaluate the computation time of GAUSS in phenotypes with small and large numbers of associated gene sets, we chose two phenotypes in the UK Biobank data: pernicious anemia (PA; PheCode: 281.11), which had only one associated gene set, and type 2 diabetes (T2D; PheCode: 250.2), which had 227 associated gene sets. Figure S11 shows the total runtime (in CPU hours [CPU h]) of GAUSS, MAGMA, and aSPUpath for UK Biobank. Total runtimes were calculated as the net time taken starting from the input of summary statistics until the p values for 10,679 gene sets were generated. In terms of total runtime, MAGMA (8.1 and 8.3 CPU h for PA and T2D, respectively) performed slightly better than GAUSS (10.3 and 12.8 CPU h, respectively). aSPUpath (93 and 98 CPU h, respectively) was substantially slower than all other methods. For the full UK Biobank analysis, GAUSS had an average runtime of 11.2 CPU h per phenotype.

To obtain GAUSS p values, we need to estimate the null correlation structure VˆH. VˆH needs to be calculated once for a given dataset and can be used for the analysis across all different phenotypes. We calculated VˆH by using a reference panel consisting of the unrelated individuals of European ancestry in 1000 Genomes dataset. This required 723 CPU h. However, because the calculation can be parallelizable, the actual clock time was around 12 h with 60 CPU cores.

Discussion

Here, we present GAUSS, which uses a subset-based statistic to test the association between a gene set and a phenotype. Similar to several existing approaches such as MAGMA and aSPUpath, GAUSS aims to aggregate weak to moderate association signals across a set of genes. Additionally, GAUSS identifies the core subset (CS) of genes, which maximize the association signals.

The identification of CS genes within a gene set is a key feature of GAUSS. Most existing approaches suggest using the genes with the lowest p values in the gene set. In contrast, GAUSS selects CS genes that have the maximum association score. The selected CS genes can highlight possible underlying mechanisms and can be used for downstream analysis. Furthermore, the association results for a given gene set across many phenotypes can highlight the underlying biological similarities or differences between phenotypes, especially through the CS genes.

GAUSS can use p values generated from any gene-based test. Although here we have used SKAT-Common-Rare, other popular gene-based tests such as TWAS-FUSION or prediXcan can also be used (Appendix D; supplemental methods, section E; and Figure S14). Currently, we have made available the estimated VH of TWAS-FUSION models for genes expressed in 48 different tissues (GTEx v.7;43 see web resources). In the future, we will continue to update the implementation of GAUSS with VH estimates from other tissues and transcriptomic studies.

GAUSS uses several approximations and adaptive approaches to reduce computation cost. It summarizes the gene-based association signals to Z scores and uses Gaussian copula to model the joint distribution among Z scores. This approach allows estimation of the distribution of GAUSS statistics by generating multivariate normal random variables and is far more efficient than using standard permutation approaches. Gaussian copula assumptions have been previously used by several methods to approximate the joint null distribution of correlated variables, especially in context of functional annotations,44 gene-environment interactions,45 and multiple phenotypes.46,47 Furthermore, such joint normality assumptions have also been employed for the imputation of association summary statistics in the presence of covariates, using adjusted LD estimates.48, 49, 50 Here, we have additionally demonstrated that the Gaussian copula provides a reasonable approximation to the joint distribution of the Z statistics for the genes in a gene set (supplemental methods, section D, and Figure S13). In addition, GAUSS uses adaptive resampling methods and GDP-based small p value estimation, which further reduces computation cost.

Our UK Biobank analysis shows that typically only a small percentage of genes in the pathway are selected as core genes (Figure S8). Simulations show that GAUSS has substantially greater power than the existing methods in detecting associations in such sparse scenarios. Existing methods, such as MAGMA, use test statistics that are averaged over all the variants or genes in the gene set. If the fraction of associated variants is relatively low and associated variants have weak effects, these tests might have low power. However, GAUSS uses a subset-based approach to choose the subset with maximum evidence of association and thus does not average over all associated and unassociated variants or genes in the gene set. Hence, even when the fraction of associated variants is low, GAUSS can have greater power. When many of the genes in the gene set are associated, the power of GAUSS was similar to MAGMA. Thus, in most of the practical scenarios, GAUSS has power greater than or equal to that of existing methods MAGMA and aSPUpath. Further, the type I error for GAUSS remains calibrated at the desired level.

A limitation of GAUSS is that it only allows testing for the self-contained null hypothesis. Although this allows us to detect association of a gene set with a phenotype, it does not provide information on enrichment of the associated gene set. Further, the GPD method of estimating very small p values needs additional research and exploration.

GAUSS relies on a reference dataset to calculate the LD between the variants and subsequently to calculate the variance-covariance matrix between the Z statistics for a given gene set. Currently, we have used 1000 Genomes data to calculate these matrices, which have been successfully used as LD reference in many applications including heritability estimation,51 association testing,25,52 and polygenic risk score prediction.53 However, this can be restrictive both in terms of the number of individuals in 1000 Genomes as well the number of low-frequency and rare variants reported. As newer datasets are becoming publicly available, we will continue to update the current implementation of GAUSS to incorporate a broader spectrum of variants.

In analysis of simulated and UK Biobank data, we used European ancestry samples of 1000 Genomes data for reference data. To investigate whether the method is sensitive to the reference panel, we compared the performance of GAUSS using 1000 Genomes data to that using UK Biobank data as reference (Figure S12). The results show that the choice of reference panel did not substantially impact the results from GAUSS. However, it is important that the reference population and the study sample belong to the same ancestry to reflect similar LD patterns.

In the current implementation, we have focused towards analyzing GWAS results from individuals of European ancestry. However, GAUSS can also perform analysis with samples from a single non-European ancestry by using the corresponding reference population in 1000 Genomes. In the presence of multi-ethnic samples, it is recommended to perform GAUSS analysis within each ancestry and then combine p values via Fisher’s method to test for overall significance. For admixed samples, currently there is a lack of consensus on the proper reference panel that is most suitable because the proportion of admixture can vary for individuals. We leave it for future work.

In summary, we have shown that GAUSS can be more powerful than the existing methods to detect gene sets associated with phenotypes and facilitates interpretation of gene set analysis results through CS genes. The insights generated by GAUSS and its computational scalability make it an attractive choice to perform phenome-wide gene set analysis. Our UK Biobank analysis identified large numbers of gene sets by phenotype association pairs, and we have partially validated associations in EC and GD phenotypes through MGI data analysis. By providing powerful, scalable, and more interpretable gene set analysis results, our approach will contribute to identifying genetic components of complex phenotypes. We have made a GAUSS software package and UK Biobank analysis results publicly available (see web resources).

Declaration of interests

The authors declare no competing interests.

Acknowledgements

This research was supported by NIH grants R01-HG008773, R01-LM012535 (D.D. and S.L.), and R01-HG009976 (M.B.) and the Brain Pool Plus (BP+, Brain Pool+) Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (2020H1D3A2A03100666, S.L.). UK Biobank data were accessed under the accession number UKB: 45227. The authors acknowledge the Michigan Genomics Initiative participants, Precision Health at the University of Michigan, the University of Michigan Medical School Central Biorepository, and the University of Michigan Advanced Genomics Core for providing data and specimen storage, management, processing, and distribution services and the Center for Statistical Genetics in the Department of Biostatistics at the School of Public Health for genotype data curation, imputation, and management in support of the research reported in this publication.

Published: March 16, 2021

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2021.02.016.

Appendix A: Estimating gene-based p values from summary statistics

Let y=(y1,y2,,yn)T be the vector of phenotype for n individuals; X the matrix of q non-genetic covariates including the intercept; Gj=(G1j,G2j,,Gnj)T the vector of the minor allele counts (0, 1, or 2) for genetic variant j; and G=(G1,G2,,Gm) the genotype matrix for m genetic variants in a target gene or region. The regression model used to relate the phenotype to the m genetic variants in the region is

f[E(y)]=Xα+Gβ, (Equation A1)

where f(.) is a link function and can be set to be the identity function for continuous traits or the logistic function for binary traits; α is the vector of regression coefficients of q non-genetic covariates; and b = (b1, · · ·, bm) T is the vector of regression coefficients of the m genetic variants. To test H0: b = 0, under the random effects assumption bi : N (0, t 2), the SKAT test statistic [35] is

Q=(yμˆ)TGWWGT(yμˆ), (Equation A2)

where μˆ is the estimated expected value of y under the null hypothesis of no association and W = diag(w1, .., wm) is a diagonal weighting matrix. Wu et al22 suggested using the Beta (MAF, 1,25) density function as a weight to upweight rarer variants. Under the null hypothesis, Q asymptotically follows a mixture of chi-squared distributions and p values can be computed by inverting the characteristic function. The mixing parameters are the eigenvalues of WGT P0GW, where P0 = In - X(XT X)-1XT and In is the identity matrix of order n.

Equation A1 uses individual-level data on the samples. However, the test of association can be effectively approximated via summary statistics on the m variants in the region.26,54 Given the estimated GWAS summary statistics (MAFj, βj, SEj) for each variant j, the test statistic Q in Equation A2 can be approximated as

Qsummary=i=1m2MAFj(1MAFj)wj2βj2SEj2. (Equation A3)

Under the null hypothesis, Q follows a mixture of chi-squares and the mixing parameters are the eigenvalues of the matrix WGT P0GW. Replacing P0 by Φ0 = I - 11T /n, we can approximate the eigenvalues by that of the matrix WGT Φ0GW. The matrix GT Φ0G is the LD matrix of the m variants, which can be estimated via a publicly available reference panel.27

Ionita-Laza et al.23 developed SKAT-Common-Rare, which tests the combined effect of rare and common variants in a region. Given summary statistics as above, we construct the test statistic separately for common and rare variants as Qsummary;common and Qsummary;rare by using Equation A3. For common variants, we used the Beta(MAF, 0.5, 0.5) density function for weight calculation, and for rare variants, Beta(MAF, 1, 25)23.Then, SKAT-Common-Rare test is then constructed as

Qcommonrare=(1λ)Qsummary;common+λQsummary;rare,

where λ=(SD(Qsummary;rare)/SD(Qsummary;rare)+SD(Qsummary;common)). The asymptotic null distribution of Qcommon-rare is a mixture of chi-squares and can be approximated with the empirical LD matrices of common and rare variants.

Appendix B: Fast estimation of the p value of GAUSS

We employ a fast two-step approach that uses a normal Copula to estimate p values for GAUSS. We first estimate the correlation structure (VˆH) among the Z statistics z1, z2, …, zm under the null hypothesis of no association through a small number of simulations by using a reference LD panel (see Appendix C). Then we estimate the p value of the GAUSS test statistic as follows:

  • 1.

    starting from r = 1, in the rth step, generate a random m vector Zr from the multivariate normal distribution N (0, VˆH);

  • 2.

    calculate the GAUSS statistic with Zr as above, GAUSS(H)r;

  • 3.

    repeat steps 1 and 2 R times, say R (= 106);

  • 4.

    estimate the p value for the observed GAUSS(H) as r=1RGAUSS(H)r>GAUSS(H)/R.

Although it is a simulation-based method, the algorithm can be efficiently implemented because it only requires generating multivariate normal (MVN) random vectors. For example, generating 1 million MVN random vectors for a gene set with 100 genes (m = 100) requires 2 CPU seconds on an Intel Xeon 2.80 GHz computer.

We also implemented an adaptive resampling scheme that performs fewer iterations if the p value is large (say >0.005). For a given GAUSS test statistic, we first use 1,000 iterations to estimate the p value. If the estimated p value is ≤0.005, then we perform 106 iterations to more accurately estimate the p value. Thus, if the true p value is large (>0.005), the above algorithm estimates it in less than 1 CPU second, and if the true p value is small, the algorithm takes 161 CPU seconds on average. If the true p value is very small, 106 resampling cannot estimate it. For this, we use a GPD-based approximation approach (supplemental methods, section B).

Appendix C: Reference data and the estimation of correlation structure VH

Given the GWAS summary statistic for a phenotype, to obtain the GAUSS p value for a gene set, we have used the reference panel twice. First, we used the reference panel to extract LD across variants in a gene or region. This LD information is used to construct the null distribution and evaluate the gene-based p value. We use emeraLD55 (see web resources) for fast extraction of LD from variant call format (VCF) files. Second, we used the reference panel to estimate the null correlation matrix VH among the Z statistics. This is a pre-computed matrix that needs to be computed once from the reference data and can be reused for future applications. To estimate this matrix, we generated a null continuous phenotype from the standard normal distribution, computed the gene-based p values for the annotated genes by using SKAT-Common-Rare, and converted them to Z statistics. We repeated this procedure 1,000 times and calculated VH as the Pearson’s correlation between 1,000 null Z statistic values. This approach greatly reduces the computational burden of GAUSS because it does not need to estimate VH for every iteration or gene set.

Appendix D: Gene-based tests

The choice of gene-based tests for constructing the p values is critical to the interpretation and utility of GAUSS analysis. Although we have used SKAT-Common-Rare throughout the analysis, GAUSS can accept p values from any gene-based test, including TWAS-FUSION, prediXcan, and others. The estimated correlation structure VH between the Z statistics can change substantially depending on the gene-based test used, so it needs to be re-estimated for each type of test. In the supplement methods (section E), we demonstrate that GAUSS can use TWAS-FUSION p values for E. coli infection (EC) as an example trait (Figure S14), which we previously used to demonstrate the performance of GAUSS (see results). Using the UK Biobank GWAS summary statistics for EC generated by SAIGE, we first estimated the gene-based p values for the genes expressed in whole blood in the genotype tissue expression study (GTEx v.7) and subsequently calculated the gene set associations for 5,917 GO gene sets (C5). Although we did not find any significant associations, the analysis demonstrates that GAUSS can also be used with gene-based test p values calculated from expression imputation tests such as TWAS-FUSION. In the GAUSS software package, we have made available estimated VH for Z statistics derived from the p values of SKAT-Common-Rare and TWAS-FUSION tests across 48 different tissues as reported in GTEx v.7.

Data and code availability

GAUSS R package is publicly accessible and can be download from GitHub (see web resources). The results and summary statistics from the UK Biobank data analysis has been presented in an online visualization platform (PathWeb; web resources) and can be downloaded.

Web resources

Supplemental information

Document S1. Figures S1–S14, Table S1, and supplemental methods
mmc1.pdf (842.7KB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (1.9MB, pdf)

References

  • 1.Buniello A., MacArthur J.A.L., Cerezo M., Harris L.W., Hayhurst J., Malangone C., McMahon A., Morales J., Mountjoy E., Sollis E. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47(D1):D1005–D1012. doi: 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Manolio T.A., Collins F.S., Cox N.J., Goldstein D.B., Hindorff L.A., Hunter D.J., McCarthy M.I., Ramos E.M., Cardon L.R., Chakravarti A. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Liu J.Z., McRae A.F., Nyholt D.R., Medland S.E., Wray N.R., Brown K.M., Hayward N.K., Montgomery G.W., Visscher P.M., Martin N.G., Macgregor S., AMFS Investigators A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 2010;87:139–145. doi: 10.1016/j.ajhg.2010.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cantor R.M., Lange K., Sinsheimer J.S. Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 2010;86:6–22. doi: 10.1016/j.ajhg.2009.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Fridley B.L., Biernacka J.M. Gene set analysis of SNP data: benefits, challenges, and future directions. Eur. J. Hum. Genet. 2011;19:837–843. doi: 10.1038/ejhg.2011.57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yu K., Li Q., Bergen A.W., Pfeiffer R.M., Rosenberg P.S., Caporaso N., Kraft P., Chatterjee N. Pathway analysis by adaptive combination of P-values. Genet. Epidemiol. 2009;33:700–709. doi: 10.1002/gepi.20422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Pers T.H. Gene set analysis for interpreting genetic studies. Hum. Mol. Genet. 2016;25(R2):R133–R140. doi: 10.1093/hmg/ddw249. [DOI] [PubMed] [Google Scholar]
  • 8.Lee P.H., O’Dushlaine C., Thomas B., Purcell S.M. INRICH: interval-based enrichment analysis for genome-wide association studies. Bioinformatics. 2012;28:1797–1799. doi: 10.1093/bioinformatics/bts191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Jia P., Wang L., Meltzer H.Y., Zhao Z. Pathway-based analysis of GWAS datasets: effective but caution required. Int. J. Neuropsychopharmacol. 2011;14:567–572. doi: 10.1017/S1461145710001446. [DOI] [PubMed] [Google Scholar]
  • 10.O’Dushlaine C., Kenny E., Heron E.A., Segurado R., Gill M., Morris D.W., Corvin A. The SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics. 2009;25:2762–2763. doi: 10.1093/bioinformatics/btp448. [DOI] [PubMed] [Google Scholar]
  • 11.Mooney M.A., Nigg J.T., McWeeney S.K., Wilmot B. Functional and genomic context in pathway analysis of GWAS data. Trends Genet. 2014;30:390–400. doi: 10.1016/j.tig.2014.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Pan W., Kwak I.-Y., Wei P. A Powerful Pathway-Based Adaptive Test for Genetic Association with Common or Rare Variants. Am. J. Hum. Genet. 2015;97:86–98. doi: 10.1016/j.ajhg.2015.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.de Leeuw C.A., Mooij J.M., Heskes T., Posthuma D. MAGMA: Generalized Gene-Set Analysis of GWAS Data. PLoS Comput. Biol. 2015;11:e1004219. doi: 10.1371/journal.pcbi.1004219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sun R., Hui S., Bader G.D., Lin X., Kraft P. Powerful gene set analysis in GWAS with the Generalized Berk-Jones statistic. PLoS Genet. 2019;15:e1007530. doi: 10.1371/journal.pgen.1007530. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zhang H., Wheeler W., Hyland P.L., Yang Y., Shi J., Chatterjee N., Yu K. A Powerful Procedure for Pathway-Based Meta-analysis Using Summary Statistics Identifies 43 Pathways Associated with Type II Diabetes in European Populations. PLoS Genet. 2016;12:e1006122. doi: 10.1371/journal.pgen.1006122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Moskvina V., Schmidt K.M., Vedernikov A., Owen M.J., Craddock N., Holmans P., O’Donovan M.C. Permutation-based approaches do not adequately allow for linkage disequilibrium in gene-wide multi-locus association analysis. Eur. J. Hum. Genet. 2012;20:890–896. doi: 10.1038/ejhg.2012.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Holmans P., Green E.K., Pahwa J.S., Ferreira M.A., Purcell S.M., Sklar P., Owen M.J., O’Donovan M.C., Craddock N., Wellcome Trust Case-Control Consortium Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am. J. Hum. Genet. 2009;85:13–24. doi: 10.1016/j.ajhg.2009.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Knijnenburg T.A., Wessels L.F.A., Reinders M.J.T., Shmulevich I. Fewer permutations, more accurate P-values. Bioinformatics. 2009;25:i161–i168. doi: 10.1093/bioinformatics/btp211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Pickands J. Statistical Inference Using Extreme Order Statistics. Ann. Stat. 1975;3:119–131. [Google Scholar]
  • 20.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Liberzon A., Birger C., Thorvaldsdóttir H., Ghandi M., Mesirov J.P., Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1:417–425. doi: 10.1016/j.cels.2015.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ionita-Laza I., Lee S., Makarov V., Buxbaum J.D., Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am. J. Hum. Genet. 2013;92:841–853. doi: 10.1016/j.ajhg.2013.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Gamazon E.R., Wheeler H.E., Shah K.P., Mozaffari S.V., Aquino-Michaels K., Carroll R.J., Eyler A.E., Denny J.C., Nicolae D.L., Cox N.J., Im H.K., GTEx Consortium A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Gusev A., Ko A., Shi H., Bhatia G., Chung W., Penninx B.W., Jansen R., de Geus E.J., Boomsma D.I., Wright F.A. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 2016;48:245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lumley T., Brody J., Peloso G., Morrison A., Rice K. FastSKAT: Sequence kernel association tests for very large sets of markers. Genet. Epidemiol. 2018;42:516–527. doi: 10.1002/gepi.22136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Abecasis G.R., Auton A., Brooks L.D., DePristo M.A., Durbin R.M., Handsaker R.E., Kang H.M., Marth G.T., McVean G.A., The 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhou W., Nielsen J.B., Fritsche L.G., Dey R., Gabrielsen M.E., Wolford B.N., LeFaive J., VandeHaar P., Gagliano S.A., Gifford A. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 2018;50:1335–1341. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Masamoto Y., Arai S., Sato T., Takamoto I., Kubota N., Kadowaki T., Kurokawa M. Adipocyte-Derived Adiponectin Positively Regulates Exit from Quiescence of Hematopoietic Stem Cells By Potentiating mTORC1 Activation after Myelotoxic Injury. Blood. 2015;126 doi: 10.1182/blood.v126.23.777.777. [DOI] [Google Scholar]
  • 31.Desruisseaux M.S., Nagajyothi, Trujillo M.E., Tanowitz H.B., Scherer P.E. Adipocyte, adipose tissue, and infectious disease. Infect. Immun. 2007;75:1066–1078. doi: 10.1128/IAI.01455-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Yao J., Rock C.O. Exogenous fatty acid metabolism in bacteria. Biochimie. 2017;141:30–39. doi: 10.1016/j.biochi.2017.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Fitzpatrick L.R., Small J.S., Poritz L.S., McKenna K.J., Koltun W.A. Enhanced intestinal expression of the proteasome subunit low molecular mass polypeptide 2 in patients with inflammatory bowel disease. Dis. Colon Rectum. 2007;50:337–348. doi: 10.1007/s10350-006-0796-7. [DOI] [PubMed] [Google Scholar]
  • 34.Arlt A., Bauer I., Schafmayer C., Tepel J., Müerköster S.S., Brosch M., Röder C., Kalthoff H., Hampe J., Moyer M.P. Increased proteasome subunit protein expression and proteasome activity in colon cancer relate to an enhanced activation of nuclear factor E2-related factor 2 (Nrf2) Oncogene. 2009;28:3983–3996. doi: 10.1038/onc.2009.264. [DOI] [PubMed] [Google Scholar]
  • 35.Kwon C.H., Park H.J., Choi Y.R., Kim A., Kim H.W., Choi J.H., Hwang C.S., Lee S.J., Choi C.I., Jeon T.Y. PSMB8 and PBK as potential gastric cancer subtype-specific biomarkers associated with prognosis. Oncotarget. 2016;7:21454–21468. doi: 10.18632/oncotarget.7411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Wu F., Dassopoulos T., Cope L., Maitra A., Brant S.R., Harris M.L., Bayless T.M., Parmigiani G., Chakravarti S. Genome-wide gene expression differences in Crohn’s disease and ulcerative colitis from endoscopic pinch biopsies: insights into distinctive pathogenesis. Inflamm. Bowel Dis. 2007;13:807–821. doi: 10.1002/ibd.20110. [DOI] [PubMed] [Google Scholar]
  • 37.Goudey B., Abraham G., Kikianty E., Wang Q., Rawlinson D., Shi F., Haviv I., Stern L., Kowalczyk A., Inouye M. Interactions within the MHC contribute to the genetic architecture of celiac disease. PLoS ONE. 2017;12:e0172826. doi: 10.1371/journal.pone.0172826. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Muraro D., Simmons A. An integrative analysis of gene expression and molecular interaction data to identify dys-regulated sub-networks in inflammatory bowel disease. BMC Bioinformatics. 2016;17:42. doi: 10.1186/s12859-016-0886-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Fritsche L.G., Gruber S.B., Wu Z., Schmidt E.M., Zawistowski M., Moser S.E., Blanc V.M., Brummett C.M., Kheterpal S., Abecasis G.R., Mukherjee B. Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative. Am. J. Hum. Genet. 2018;102:1048–1061. doi: 10.1016/j.ajhg.2018.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Chang G. Multidrug resistance ABC transporters. FEBS Lett. 2003;555:102–105. doi: 10.1016/s0014-5793(03)01085-8. [DOI] [PubMed] [Google Scholar]
  • 41.Warren H.R., Evangelou E., Cabrera C.P., Gao H., Ren M., Mifsud B., Ntalla I., Surendran P., Liu C., Cook J.P. Genome-wide association analysis identifies novel blood pressure loci and offers biological insights into cardiovascular risk. Nat. Genet. 2017;49:403–415. doi: 10.1038/ng.3768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Tomer Y., Dolan L.M., Kahaly G., Divers J., D’Agostino R.B., Jr., Imperatore G., Dabelea D., Marcovina S., Black M.H., Pihoker C., SEARCH for Diabetes in Youth Study Genome wide identification of new genes and pathways in patients with both autoimmune thyroiditis and type 1 diabetes. J. Autoimmun. 2015;60:32–39. doi: 10.1016/j.jaut.2015.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Battle A., Brown C.D., Engelhardt B.E., Montgomery S.B., GTEx Consortium. Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group. Statistical Methods groups—Analysis Working Group. Enhancing GTEx (eGTEx) groups. NIH Common Fund Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.He Z., Xu B., Lee S., Ionita-Laza I. Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data. Am. J. Hum. Genet. 2017;101:340–352. doi: 10.1016/j.ajhg.2017.07.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Yu Y., Xia L., Lee S., Zhou X., Stringham H.M., Boehnke M., Mukherjee B. Subset-Based Analysis Using Gene-Environment Interactions for Discovery of Genetic Associations across Multiple Studies or Phenotypes. Hum. Hered. 2018;83:283–314. doi: 10.1159/000496867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Dutta D., Scott L., Boehnke M., Lee S. Multi-SKAT: General framework to test for rare-variant association with multiple phenotypes. Genet. Epidemiol. 2019;43:4–23. doi: 10.1002/gepi.22156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Dutta D., Gagliano Taliun S.A., Weinstock J.S., Zawistowski M., Sidore C., Fritsche L.G., Cucca F., Schlessinger D., Abecasis G.R., Brummett C.M., Lee S. Meta-MultiSKAT: Multiple phenotype meta-analysis for region-based association test. Genet. Epidemiol. 2019;43:800–814. doi: 10.1002/gepi.22248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Lee D., Bigdeli T.B., Riley B.P., Fanous A.H., Bacanu S.-A. DIST: direct imputation of summary statistics for unmeasured SNPs. Bioinformatics. 2013;29:2925–2927. doi: 10.1093/bioinformatics/btt500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Xu Z., Duan Q., Yan S., Chen W., Li M., Lange E., Li Y. DISSCO: direct imputation of summary statistics allowing covariates. Bioinformatics. 2015;31:2434–2442. doi: 10.1093/bioinformatics/btv168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Pasaniuc B., Zaitlen N., Shi H., Bhatia G., Gusev A., Pickrell J., Hirschhorn J., Strachan D.P., Patterson N., Price A.L. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics. 2014;30:2906–2914. doi: 10.1093/bioinformatics/btu416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Bulik-Sullivan B.K., Loh P.R., Finucane H.K., Ripke S., Yang J., Patterson N., Daly M.J., Price A.L., Neale B.M., Schizophrenia Working Group of the Psychiatric Genomics Consortium LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Turley P., Walters R.K., Maghzian O., Okbay A., Lee J.J., Fontana M.A., Nguyen-Viet T.A., Wedow R., Zacher M., Furlotte N.A. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 2018;50:229–237. doi: 10.1038/s41588-017-0009-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Lloyd-Jones L.R., Zeng J., Sidorenko J., Yengo L., Moser G., Kemper K.E., Wang H., Zheng Z., Magi R., Esko T. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 2019;10:5086. doi: 10.1038/s41467-019-12653-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Lee S., Teslovich T.M., Boehnke M., Lin X. General framework for meta-analysis of rare variants in sequencing association studies. Am. J. Hum. Genet. 2013;93:42–53. doi: 10.1016/j.ajhg.2013.05.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Quick C., Fuchsberger C., Taliun D., Abecasis G., Boehnke M., Kang H.M. emeraLD: rapid linkage disequilibrium estimation with massive datasets. Bioinformatics. 2019;35:164–166. doi: 10.1093/bioinformatics/bty547. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S14, Table S1, and supplemental methods
mmc1.pdf (842.7KB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (1.9MB, pdf)

Data Availability Statement

GAUSS R package is publicly accessible and can be download from GitHub (see web resources). The results and summary statistics from the UK Biobank data analysis has been presented in an online visualization platform (PathWeb; web resources) and can be downloaded.


Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES