Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Dec 6.
Published in final edited form as: Genomics. 2011 Apr 30;98(1):10.1016/j.ygeno.2011.04.006. doi: 10.1016/j.ygeno.2011.04.006

Gene set analysis of genome-wide association studies: methodological issues and perspectives

Lily Wang a,*, Peilin Jia b,c, Russell D Wolfinger d, Xi Chen e, Zhongming Zhao b,c,f,*
PMCID: PMC3852939  NIHMSID: NIHMS298464  PMID: 21565265

Abstract

Recent studies have demonstrated that gene set analysis, which tests disease association with genetic variants in a group of functionally related genes, is a promising approach for analyzing and interpreting genome-wide association studies (GWAS) data. These approaches aim to increase power by combining association signals from multiple genes in the same gene set. In addition, gene set analysis can also shed more light on the biological processes underlying complex diseases. However, current approaches for gene set analysis are still in an early stage of development in that analysis results are often prone to sources of bias, including gene set size and gene length, linkage disequilibrium patterns and the presence of overlapping genes. In this paper, we provide an in-depth review of the gene set analysis procedures, along with parameter choices and the particular methodology challenges at each stage. In addition to providing a survey of recently developed tools, we also classify the analysis methods into larger categories and discuss their strengths and limitations. In the last section, we outline several important areas for improving the analytical strategies in gene set analysis.

Keywords: Genome-wide association study, Gene set, Pathway, Gene-set enrichment analysis, Statistical significance, Complex disease

Introduction

Recently, genome-wide association studies (GWAS), which typically test disease associations with half to a few million single nucleotide polymorphisms (SNPs) across the human genome in hundreds to thousands of samples, have successfully identified many genetic variants contributing to the susceptibilities of complex diseases. However, the variants identified so far, individually or in combination, account for only a small proportion of the inherited component of disease risk [1]. A possible explanation is that due to the large number of genetic polymorphisms examined in GWAS and the massive amount of tests conducted, real but weak associations are likely to be missed after multiple comparison adjustment (e.g., corrected by half a million tests in a typical GWAS).

To help prioritize association signals from GWAS and to better understand the biological themes underlying complex diseases, gene set analysis has become increasingly popular. Instead of conducting analysis for single SNPs or single genes, gene set analysis tests disease association with genetic variants in a group of functionally related genes, such as those belonging to the same biological pathway. One possible cause of complex diseases is the changes in activities of biological pathways: where there are a number of mutations in different genes, each contributes a modest amount to disease predisposition and work together to cause disruptions in normal biological processes.

Current approaches for gene set analysis are still in an early stage of development. When different analysis methods are used, the resulting significant gene sets often vary substantially, even when the same dataset is used [2,3]. One possible reason might be the lack of statistical power in the tests, which are often borrowed from gene set analysis for microarray gene expression data. For many diseases, compared to the amount of differentiation in gene expression levels, effect sizes for SNPs that contribute to disease risk or are in linkage disequilibrium (LD) with the causal variants are typically much smaller. In a recent simulation study [4], we found for gene sets consisting of markers weakly associated with disease (nominal P-value < 0.05), all three gene set analysis methods examined – Gene Set Enrichment Analysis (GSEA) [5], Fisher’s exact test, and SNP Ratio Test [6] – lacked statistical power for detecting disease associated gene sets. Several recent studies also indicated that gene set analysis results are often prone to sources of bias including gene set size, LD patterns and overlapping genes [3,5,7,8]. Before gene set based approaches are used to draw significant conclusions, the limitations in these methods must be addressed first.

In this review, we discuss the detailed procedures for gene set analysis, along with parameter choices and the particular methodological challenges at each stage. In addition to providing a survey of recently developed tools, we also classify the analysis methods into larger categories and discuss their strengths and limitations. As many new methods are expected to be developed quickly due to the strong demand of initial and secondary (or advanced) analysis of numerous GWAS datasets, our goal is not to provide a comprehensive list of gene set analysis methods. Instead, we aim to provide readers with some of our insights so that they can assess and then use the most appropriate methods for their specific needs. In the last section, we outline several important areas for improving the analytical strategies in gene set analysis. Other recent reviews on gene set analysis of GWAS are Wang et al. (2010) [9] and Cantor et al. (2010) [7].

Methodological issues

Figure 1 outlines the critical steps for assessing statistical significance of disease associations with gene sets: 1) Preprocess data and define the gene sets to be tested, 2) formulate a hypothesis, 3) construct corresponding statistical tests, and 4) assess the statistical significance of the study results. We next discuss each of these steps in order.

Fig 1.

Fig 1

Work flow for gene set analysis of GWAS datasets.

From SNPs to genes

When defining gene boundaries, different criteria (e.g., 500kb [5], 200kb [10], 20kb [11], and 5kb [12] in both upstream and downstream of the gene coding regions) have been proposed in the literature. Considering LD and gene regulation pattern, investigators often define a gene region to include both the genic region (core part) and the boundary regions (upstream and downstream of the gene). More sophisticated approaches, such as including SNPs that are in LD with the gene, have also been developed [13,14]. These strategies aim to cover SNP markers that play regulatory roles in gene expression and/or link to causal variants within the same LD block. However, these approaches also include more irrelevant SNPs. Thus, they may not only dilute potential signal strength for a gene set but also increase computational burden dramatically, especially for gene sets with a large number of genes. One potentially promising strategy is to take advantage of the information from gene expression studies. Veyrieras et al. [15] estimated that the majority of genetic variants influencing gene expression are located within 20kb of the genes. Recently, to identify T2D associated pathways, Zhong et al. [16] assessed the impact of the SNPs on gene expressions in liver and adipose tissues and summarized each gene by the SNP significantly associated with the gene’s transcript abundance. For general reference, Gamazon et al. [17] developed the SCAN database, which provides information on mapping genetic variants associated with gene expression based on the samples in the HapMap project [18,19]. More comprehensive databases will be developed in the future, for example, those for expression quantitative trait loci (eQTL, regions of the genome that impact gene expression) measured in disease relevant tissues. Thus, we expect that utilizing the information from gene expression studies will improve the power of the gene set analysis approach for GWAS.

From genes to gene sets

The Kyoto Encyclopedia of Genes and Genomes (KEGG) [20] and Gene Ontology (GO) [21] are frequently used gene set annotation databases. When GO terms are used, gene sets categorized into biological process categories have often been selected for gene set analysis, since the other two categories (molecular function and cellular components) are not similar to the typical biological pathways such as those from KEGG. The MSigDB database [22] includes comprehensive gene sets from both the KEGG and GO databases, as well as from other sources such as chromosome and cytogenetic band regions, gene sets collected from expert knowledge in literature, cis-regulatory motifs, and co-expressed cancer-associated genes. In addition, other sources such as the PANTHER Classification System [23] and REACTOME [24] also provide publicly available gene set information. Note that GO terms are organized in a hierarchical structure, and substantial overlap of component genes are expected between parent and child nodes. The MSigDB collection has partially solved this problem by removing the gene sets that have the same member genes with their parent nodes or their sibling nodes.

Redundancy among gene sets has often been observed because, by their nature, gene sets such as pathways are biological systems in which a gene may function in multiple ways and thus may appear multiple times in functional gene sets. Although at the systems biology level this reflects the crosstalk between gene sets and the complexity of biological systems, it causes an overlap of member genes and redundant information among gene sets, thus making the results of gene set analysis more difficult to interpret.

Another issue is that gene set annotation is still incomplete. So far, only about 5000 human genes have been annotated to the KEGG pathways, which are most frequently used in the literature. Thus, in gene set analysis of GWAS, all non-annotated genes will be automatically filtered out. A potential improvement is to use protein-protein interaction (PPI) data. As of March 4, 2010, there were approximately 11,000 proteins included in an integrated PPI network analysis platform, Protein Interaction Network Analysis (PINA), which collected and annotated six other public PPI databases (MINT, IntAct, DIP, BioGRID, HPRD, and MIPS/MPact) [25]. This provides much more annotation information about human proteins than does KEGG, and has been used for dense-module searching (DMS) of enriched association signals from one or multiple GWAS datasets [26]. Another advantage in the DMS approach is its flexibility in defining gene set size, which overcomes a potential limitation of the fixed size in KEGG or other biological pathways. However, DMS utilizes the information only from PPIs, rather than from gene regulation as in typical biological pathways. Even so, it highlights the degree of incompleteness of our current knowledge about the human genes and their regulation.

Formulating hypothesis

In the analysis of gene expression data, Tian et al. [27] formulated two statistical hypotheses for testing coordinated association between a group of genes with a phenotype of interest. In the context of GWAS analysis, they are

Competitive null hypothesis (Q1) - The genes in a gene-set show the same magnitude of associations with the disease phenotype compared with genes in the rest of the genome;

Self-contained null hypothesis (Q2) - The genes in a gene-set are not associated with the disease phenotype.

A third null hypothesis (Q3) - none of the gene sets considered is associated with the phenotype - has also been proposed recently [28,29]. In contrast to Q1 and Q2, which test for individual gene sets, Q3 tests the entire dataset. For tests of individual gene sets, Goeman and Buhlmann [30] classified tests corresponding to Q1 and Q2 as competitive and self-contained tests, respectively. While a competitive test compares disease association test statistics for genes in the gene set versus that for genes in the rest of the genome, a self-contained test directly tests gene set association with disease and does not depend on genes outside the gene set. Table 1 lists some examples of competitive tests for gene set analysis of GWAS, including GSEA, over-representation analysis based on Fisher’s exact test (hypergeometric test) and their extensions such as ALIGATOR [31] and GSA-SNP [32]. Table 2 lists some examples of self-contained tests, including the SNP Ratio Test [6], GRASSS [33] and the SPCA method [12]. When the “real” causal SNPs are fully contained in one particular gene set, testing Q1 and Q2 are approximately the same. However, when SNPs in multiple gene sets are associated with the disease or when causal genes are shared by multiple gene sets, using competitive tests that compare gene set association signals with the rest of the genome may result in loss of power [8,34]. For example, Tintle et al. [35] found the SUMSTAT statistics (based on the MAX-MIN statistic [36]) performed better than GSEA and Fisher’s exact test.

Table 1.

Some examples of competitive tests, which compare disease associations for the genes in a gene-set with genes in the rest of the genome.

Reference Year Software Input data for the
method
Condense SNP information
within each gene
Gene set test
statistic
Significance
assessment
Gene-based methods
Wang et al. [5] 2007 GSEA http://www.openbioinformatics.org/gengen Genotype Most significant SNP P-value; Sime’s combination test Modified Kolmogorov-Smirnov (KS) statistic Sample permutations
Askland et al. [77] 2009 EVA (Exploratory Visual Analysis) http://www.exploratoryvisualanalysis.org/ SNP P-values Most significant SNP P-value Fisher’s exact test Hypergeometric distribution
Guo et al. [51] 2009 SNP P-values Most significant SNP P-value Modified Kolmogorov-Smirnov statistic SNP permutations
Holmans et al. [31] 2009 ALIGATOR (Association List Go AnnoTatOR) http://x004.psycm.uwcm.ac.uk/~peter/ SNP P-values Most significant SNP P-value with correction for gene size Modified Fisher’s Exact test Gene re-samplings
Freudenberg et al. [47] 2010 Genotype Most significant SNP P-value Odds ratio for the presence of SNP associations; number of loci in a category that have SNP associations Sample permutations
Jia et al. [26] 2010 dmGWAS http://bioinfo.mc.vanderbilt.edu/dmGWAS.html Genotype Most significant SNP P-value Z-score Gene randomization and sample permutations
Luo et al. [70] 2010 SNP P-values Linear combination test; Quadratic test; Decorrelation test of SNP P-values Linear combination test; Quadratic test; Decorrelation test of Gene P-values Normal or Chi-square distribution
Nam et al. [32] 2010 GSA-SNP http://gsa.muldas.org SNP P-values Second best SNP P-value Z-statistic, maxmean statistic [36], and modified KS statistic [5] Gene re-samplings and sample permutations
Peng et al. [41] 2010 SNP P-values Fisher’s combined P-value; Sidak’s correction to the most significant SNP; Sime’s combination test; or FDR method Fisher’s exact test Hypergeometric distribution
Zhang et al. [87] 2010 i-GSEA4GWAS http://gsea4gwas.psych.ac.cn/ SNP P-values Most significant SNP P-value Modified Kolmogorov-Smirnov (KS) statistic SNP permutations
SNP-based methods
Holden et al. [88] 2008 GSEA-SNP http://nr.no/pages/samba/area_emr_smbi_gseasnp Genotype modified KS statistic Sample permutations
Schwarz et al. [89] 2008 SNPtoGO http://webtools.imbs.uni-luebeck.de/snptogo SNP P-values Fisher’s exact test Hypergeometric distribution
Medina et al. [90] 2009 GESBAP (GEne Set Based Analysis of Polymorphisms) http://bioinfo.cipf.es/gesbap/www/index.jsp SNP P-values Sequential applications of Fisher's exact test on different partitions of the gene list Hypergeometric distribution corrected by FDR[91]

Table 2.

Some examples of self-contained tests, which test for disease associations for genes in a gene-set directly.

Reference Year Software Input data for the
method
Condense SNP information
within each gene
Gene set test
statistic
Significance
assessment
Gene-based Methods
Yu et al. [37] 2009 SNP P-values Adaptive rank truncated product statistic (ARTP) method Adaptive rank truncated product statistic (ARTP) An efficient single-level permutation algorithm
Chen et al. [33] 2010 GRASS (Gene set Ridge regression in ASsociation Studies) http://linchen.fhcrc.org/grass.html Genotype Principal components Sample permutations
SNP-based Methods
Dinu et al. [92] 2007 Genotype U-statistic [93] Sample permutations
Chai et al. [34] 2009 Genotype Fisher’s combined P-value, corrected by Brown’s approximation Chi-square distribution
O’Dushlaine et al. [6] 2009 SNP Ratio Test http://sourceforge.net/projects/snpratiotest/ Genotype SNP ratio test Sample permutations
De la Cruz et al. [94] 2009 Genotype Fisher’s combined P-value, with rank truncation and weights Sample permutations
Chen et al. [12] 2010 Genotype Supervised Principal Components Mixture distribution
Eleftherohorinou et al. [73] 2010 Genotype Cumulative trend test statistics – sum of all single SNP P-values in the gene set Fit skewed normal distribution to 1000 sample permutations
Ruano et al. [68] 2010 Genotype Fisher’s combined P-value Sample permutations
Wang et al. [55] 2011 SNP P-values t-statistic in mixed model Empirical null distribution

Constructing test statistics

A test statistic can be constructed with units based on either gene or SNP association signals. We refer to them as gene-based and SNP-based methods, respectively. In the former, the P-values of the SNPs located within each gene are summarized by gene-level association measures first, and, then, the gene-level P-values are used to calculate gene set test scores. The power of these methods mainly depends on the proportion of the genes (for gene-based methods) or SNPs (for SNP-based methods) with strong association signals in the gene set. In practice, several studies reported gene-based methods may have more power [5,37]; this is because only a few SNPs, which are often located on different genes, may contribute to disease risk (or are in LD with causal variants).

However, in gene-based methods, a consensus has not been reached on the best strategy for SNP information reduction within each gene. A common and simple approach is to represent each gene using the most significant SNP. Since only one SNP P-value is used to represent each gene, the potential effects of multiple association signals for the gene may be missed. In addition, because longer genes are more likely to have significant P-values, this approach may inflate the association test statistic for gene sets that have many long genes. Multiple comparison procedures, such as Sidak’s correction [38], Simes’ correction [39], or False Discovery Rate (FDR) [40], can be used to adjust the most significant P-value (for the number of SNPs located on the gene), but representing genes with the corrected P-values may give overly conservative gene set testing results [5,41].

Recently, Ballard et al. [42] compared seven multi-marker association tests, including single marker analysis using the best-scoring SNP, and found principal component regression [43] is the most powerful among them. In addition, several recent studies also proposed using a subset of SNPs with the lowest P-values. The selection of the SNP subset can be based on a fixed truncation point [4446] or data adaptive thresholds [12]. It has been shown that the SNP selection process can improve power over other approaches that either include all SNPs or use only the most significant SNPs [12,37].

Potential sources of bias

When scoring gene sets, several sources of potential bias need to be considered:

  1. Linkage disequilibrium patterns. Because markers in high LD may originate from a single association signal, an effective strategy may involve down-weighting P-values from regions with high LD compared to regions with relatively independent association signals. To this end, strategies have been proposed to group markers in high LD as a “proxy cluster” [8] or use LD blocks from the HapMap database as units of analysis [47] and then assign a single P-value for each cluster or LD block.

  2. Overlapping genes. Another related and potentially serious problem may result from overlapping genes. When several functionally related genes in a gene set are clustered locally, careful attention should be paid to the SNPs mapped to overlapping genes. When selecting one or more of the most significant SNPs to represent each gene, gene set significance may be driven by only a few of these SNPs, because the significant SNPs mapped to multiple genes could be included multiple times. For example, in our analysis of the GAIN schizophrenia dataset [11], the "starch and sucrose metabolism gene set (HSA00500)" included several genes located closely on the chromosome (e.g., UGT1A1, UGT1A3, UGT1A4, UGT1A5, UGT1A6, UGT1A7, UGT1A8, UGT1A9, UGT1A10). When the most significant SNP was used to represent the association signal of each gene, most of the genes in the cluster were represented by the same SNP, which had the P-value 6.502×10−4. Therefore, when this SNP has a small P-value, the gene set would likely be identified as a significant gene set, while, in fact, the results of multiple significant genes in the gene set was driven by one highly significant SNP located on multiple genes.

  3. Gene set size and gene length. Finally, as mentioned above, in order to score gene sets in an unbiased manner, all selection processes (e.g., selecting the most significant SNPs to represent each gene and selecting the most significant genes to represent each gene set) need to be accounted for in the final gene set analysis. For example, when a gene is represented by the signal of a single SNP from the gene region, the potential effects of multiple association signals for the gene may be missed. Furthermore, because longer genes are more likely to have nominally significant P-values, choosing the most significant SNP to represent each gene may inflate the association test statistic for gene sets that have many long genes. Several recent studies assessed the impact of gene length on gene set analysis results and proposed new resampling based strategies for the correction of such bias [48,49].

There are other biases that might affect gene set analysis results including annotation biases from different databases. Two examples are: 1) genes that have been well studied are more thoroughly annotated; and 2) there may be discrepancies of gene set definitions in different databases (e.g., KEGG [20] vs. BioCyc [50]). Furthermore, for enrichment based methods that test the competitive null hypothesis, the choice of the background genes for the enrichment test is a critical factor and a potential bias.

Assessing statistical significance

To preserve LD patterns, permutations of sample labels are typically employed to establish null distribution of gene set scores. However, several difficulties remain for the application of permutation tests to GWAS.

First, a typical GWAS measures a half million or more SNPs on hundreds or even thousands of samples. The recalculation of a gene set score for each permutation is extremely computationally intensive, especially for competitive tests based on markers from the entire genome. To reduce the amount of computation, several researchers explored assessing gene set significance by resampling genes [31] or SNPs [32,51]. It has been suggested that apart from genomic regions that exhibit long range LD (e.g., the Major Histocompatibility Complex (MHC) region), SNPs located on different genes may have little LD [31,33]. Another permutation scheme introduced recently is restandardization, which combines sample label permutation and gene re-sampling [36,52]. The idea of restandardization is that, while permuting sample labels preserves the correlation structure between genes, the null distribution based on sample permutation approximates the theoretical null distribution (0,1) [53]. However, this distribution ignores the empirical mean and standard deviation of the gene set statistic, which can be approximated more closely by resampling genes. Therefore, for each sample permutation, the mean and standard deviation from gene resampling are used to restandardize the permutation value. Specifically, the restandardized permutation value is computed as S**=μ++σ+σ*(S*μ*) where (μ+, σ+) and (μ*, σ*) are the mean and standard deviation of gene set scores obtained from resampling sets of genes or permuting sample labels, respectively [36].

Second, it is not straight forward to model the hierarchical structure in gene sets: SNPs lie within genes, which lie within gene sets using permutation tests. To this end, an efficient algorithm that uses single level permutation iterations to achieve the goal of the multiple-level permutation procedure has been recently proposed [37].

Third, to increase sample size, many GWAS were conducted at multiple study sites, often with different sampling designs. Permutation tests rely on exchangeability of the permuted units. To avoid misleading results, careful consideration is required to account for data structure in complex study designs [54].

An alternative strategy is to employ more flexible parametric models. For GWAS with case-control designs, we have explored modeling disease associations with gene sets using a class of statistical models called mixed effects models [55,56]. In addition to the fixed effects that model the mean structure (e.g., overall association for a group of genes), these models also include random effects that account for variance and covariance structures in the dataset. Future studies include assessing the feasibility of these models for GWAS with more complex designs.

Additionally, Bayesian methods have recently been proposed for genetic association studies [5760]. These methods can be extended to combine association signals across SNPs and genes in the same pathway. For example, Stephens et al. [61] performed a SNP set analysis for the association between polymorphisms of the HNF1A gene and plasma C-reactive protein (CRP) concentration [62] using Bayesian regression approach. This approach was implemented in the software BIMBAM [59].

Several areas for improving gene set analysis of GWAS

Although the underlying principle that many functionally related genes collectively contribute to overall disease susceptibility is simple and appealing, as we described earlier, the complexities in GWAS dataset structure raise many technical issues. Several areas of improvement for gene set analysis especially worth noting are as follows.

1) Improve statistical power for detecting disease associated gene sets. Nearly all current methods treat every gene equally when constructing gene set statistics, a more powerful strategy would involve weighing genes and SNPs within a gene set differentially by leveraging a priori biological information, such as that from expression quantitative trait loci (eQTL) studies [16] or network topology [26,6366].

In addition, improving SNP coverage and, thus, the number of informative genes may also be beneficial, although it will also increase the computational burden. Holmans et al. [31] performed an imputation analysis for un-typed SNPs using genotype information from the HapMap samples. They demonstrated that the imputation analysis could improve the power for detecting bipolar disorder (BPD) associated gene sets using their ALIGATOR method. Better refined gene set definitions [67,68] that group genes according to well-defined biological information may also be beneficial. For example, Low et al. [67] divided the estrogen metabolic pathway into three sub-pathways involved in androgen synthesis, androgen-to-estrogen conversion and estrogen removal and then found only SNPs within the androgen-to-estrogen conversion pathway were significantly associated with breast and endometrial cancer susceptibilities.

2) Develop strategies for the assessment and comparison of gene set analysis methods. When assessing the performance of a method, it is important to ensure the proportion of false positive findings from the test is as expected. Null gene sets can be generated by randomly simulating disease outcomes without using any genotype data [55], or by randomly sampling genes from a GWAS dataset [3,4]. Next, one can plot a histogram of the estimated P-values for these “null” gene sets. These P-values are expected to roughly follow a uniform distribution. It is desirable to have a method whose type I error is equal to or less than the significance cutoff (e.g., 0.05).

Similarly, to compare the power of different methods, one can randomly sample disease associated genes (with different strengths of associations) from a GWAS dataset or generate disease outcome based on genetic models with various parameters indicating strengths of associations [12,55]. Benchmark GWAS datasets for diseases with well known biological basis, such as Crohn’s Disease (CD), would also be useful for evaluating and comparing gene set analysis methods. As an example, Ballard et al. [69] compared two gene set analysis methods based on their applications to three CD datasets.

Although most GWAS gene set analyses are discovery projects, careful attention still needs to be paid to guard against spurious findings so that resources can be efficiently allocated to subsequent genotyping, re-sequencing and functional studies. As mentioned above, these biases may stem from gene length (the number of SNPs in a gene), gene set size (the number of genes in a gene set), overlapping genes, LD patterns, and population stratifications. In addition, any selection process during data processing (e.g., selecting the most significant SNP to represent each gene) should be accounted for in the final tests. The impact of several potential sources of bias need to be evaluated for gene set analysis methods. When two or more GWAS datasets are available for the same disease or phenotype, to minimize the bias, we suggest investigators use one dataset as the discovery dataset and the other(s) as validation dataset(s) [26].

3) Assess the stability of gene set testing results. In addition to power and type I error rate, another important aspect is the stability of the significance testing results. Different sets of samples would give different results due to sampling variations. When different sub-samples from a homogenous population are taken, a method with small variance, and thus stable results across the sub-samples, would be desirable. One strategy is to take sub-samples from all the samples, conduct gene set testing for each subsample, and evaluate the stability of gene set P-values based on their changes in rank ordering in different sub-samples.

One possible cause for instability of the results in genetic association studies is genetic heterogeneity, in which different variants may account for disease status or trait level in different patients. To address this problem, several investigators have hypothesized that results from testing gene sets rather than from individual markers would be more stable across different samples in the population and, thus, easier to replicate [31,32,51,70]. More studies are needed to evaluate and test this hypothesis, which has already been validated in gene expression studies [71]. Note that replication and stability assessments are most meaningful when type I error rate for a method is preserved, so applying a method with severe downward biased P-values to two datasets would not constitute a valid replication [72].

4) Develop threshold-free procedures. To improve stability of results, one strategy is to develop threshold-free procedures with few, if any, a priori selected parameters. For example, in the commonly used over-representation analysis, a significance threshold is first selected and used to classify whether or not genes are significantly associated with a particular disease, followed by comparing the proportion of disease associated genes in the gene set with the proportion in the rest of the genome by Fisher’s exact test. The identification of an optimal threshold is often a difficult task. Holmans et al. [31] suggested investigators to apply a range of cutoff values and then select the cutoff value that gives the most significant increase in over-represented gene sets. A more comprehensive approach, albeit computationally intensive, is to choose a threshold value that could make a reasonable compromise among power, type I error rate, and stability of gene set analysis results using a cross-validation scheme.

Summary and perspectives

In summary, recent studies [11,7384] have repeatedly demonstrated that gene set analysis is a promising approach for analyzing and interpreting GWAS datasets in order to better understand the genetic architecture underlying complex diseases. In this paper, we have provided an up-to-date review of the current progress, as well as the limitations in gene set analysis methods for GWAS. The power and potential performance of these methods may be further improved by integrating additional biological and environmental information at the systems level. For example, network-based approaches that combine association signals in GWAS with local PPI information can help account for gene-gene interactions and identify genes playing central roles in protein networks by interconnecting many disease genes that are weakly associated with disease themselves [26,6366]. Similarly, analysis that models gene pathways with environmental interactions will help investigators identify novel genes with weak marginal effects that act jointly with exposure factors [85]. As many more GWAS datasets are expected to be generated in the near future, meta-analyses, which integrate multiple independent GWAS datasets, can be included in gene set analysis methods to increase sample size and power [86]. We hope this review and discussion on the methodological issues on gene set analysis of GWAS will help investigators to find better solutions, understand potential biases, and make gene set analysis more practical and beneficial for understanding genetic variants conferring disease risks in GWAS.

Acknowledgments

We thank two anonymous reviewers for their helpful comments and Rebecca Hiller Posey for critically reading and improving an earlier draft of the manuscript. The work of LW was partially supported by NICHD grant 5P30 HD015052-25 and NIH grant 1P50 MH078028-01A1. The work of XC was partially supported by NCI grant 5P30CA068485-13. The work of ZZ was partially supported by NIH grants R21AA017437, P20AA017828 and R01MH083094, the Vanderbilt-Ingram Cancer Center Core grant P30CA68485, and a 2009 NARSAD Maltz Investigator Award.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Elbers CC, van der Schouw YT, Wijmenga C, Onland-Moret NC. Comment on: Perry et al. (2009) interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes. 58:1463–1467. doi: 10.2337/db08-1378. Diabetes 58 (2009) e9; author reply e10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Elbers CC, van Eijk KR, Franke L, Mulder F, van der Schouw YT, Wijmenga C, Onland-Moret NC. Using genome-wide pathway analysis to unravel the etiology of complex diseases. Genet Epidemiol. 2009;33:419–431. doi: 10.1002/gepi.20395. [DOI] [PubMed] [Google Scholar]
  • 4.Jia P, Wang L, Meltzer HY, Zhao Z. Pathway-based analysis of GWAS datasets: effective but caution required. Int J Neuropsychopharmacol. 2011 doi: 10.1017/S1461145710001446. Epub ahead of print December 16, 2010. [DOI] [PubMed] [Google Scholar]
  • 5.Wang K, Li M, Bucan M. Pathway-Based Approaches for Analysis of Genomewide Association Studies. Am. J. Hum. Genet. 2007;81:1278–1283. doi: 10.1086/522374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.O'Dushlaine C, Kenny E, Heron EA, Segurado R, Gill M, Morris DW, Corvin A. The SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics. 2009;25:2762–2763. doi: 10.1093/bioinformatics/btp448. [DOI] [PubMed] [Google Scholar]
  • 7.Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 2010;86:6–22. doi: 10.1016/j.ajhg.2009.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hong MG, Pawitan Y, Magnusson PK, Prince JA. Strategies and issues in the detection of pathway enrichment in genome-wide association studies. Hum. Genet. 2009;126:289–301. doi: 10.1007/s00439-009-0676-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet. 2010;11:843–854. doi: 10.1038/nrg2884. [DOI] [PubMed] [Google Scholar]
  • 10.Perry JR, McCarthy MI, Hattersley AT, Zeggini E, Weedon MN, Frayling TM. Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes. 2009;58:1463–1467. doi: 10.2337/db08-1378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jia P, Wang L, Meltzer HY, Zhao Z. Common variants conferring risk of schizophrenia: a pathway analysis of GWAS data. Schizophr Res. 2010;122:38–42. doi: 10.1016/j.schres.2010.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Chen X, Wang L, Hu B, Guo M, Barnard J, Zhu X. Pathway-based analysis for genome-wide association studies using supervised principal components. Genet Epidemiol. 2010;34:716–724. doi: 10.1002/gepi.20532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bush WS, Chen G, Torstenson ES, Ritchie MD. LD-spline: mapping SNPs on genotyping platforms to genomic regions using patterns of linkage disequilibrium. BioData Min. 2009;2:7. doi: 10.1186/1756-0381-2-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Hong MG, Pawitan Y, Magnusson PK, Prince JA. Strategies and issues in the detection of pathway enrichment in genome-wide association studies. Hum Genet. 2009;126:289–301. doi: 10.1007/s00439-009-0676-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Veyrieras JB, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 2008;4:e1000214. doi: 10.1371/journal.pgen.1000214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Zhong H, Yang X, Kaplan LM, Molony C, Schadt EE. Integrating pathway analysis and genetics of gene expression for genome-wide association studies. Am J Hum Genet. 2010;86:581–591. doi: 10.1016/j.ajhg.2010.02.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gamazon ER, Zhang W, Konkashbaev A, Duan S, Kistner EO, Nicolae DL, Dolan ME, Cox NJ. SCAN: SNP and copy number annotation. Bioinformatics. 2010;26:259–262. doi: 10.1093/bioinformatics/btp644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, de Bakker PI, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Chang K, Hawes A, Lewis LR, Ren Y, Wheeler D, Muzny DM, Barnes C, Darvishi K, Hurles M, Korn JM, Kristiansson K, Lee C, McCarrol SA, Nemesh J, Keinan A, Montgomery SB, Pollack S, Price AL, Soranzo N, Gonzaga-Jauregui C, Anttila V, Brodeur W, Daly MJ, Leslie S, McVean G, Moutsianas L, Nguyen H, Zhang Q, Ghori MJ, McGinnis R, McLaren W, Takeuchi F, Grossman SR, Shlyakhter I, Hostetter EB, Sabeti PC, Adebamowo CA, Foster MW, Gordon DR, Licinio J, Manca MC, Marshall PA, Matsuda I, Ngare D, Wang VO, Reddy D, Rotimi CN, Royal CD, Sharp RR, Zeng C, Brooks LD, McEwen JE. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. doi: 10.1038/nature09298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H, Zhou J, Gabriel SB, Barry R, Blumenstiel B, Camargo A, Defelice M, Faggart M, Goyette M, Gupta S, Moore J, Nguyen H, Onofrio RC, Parkin M, Roy J, Stahl E, Winchester E, Ziaugra L, Altshuler D, Shen Y, Yao Z, Huang W, Chu X, He Y, Jin L, Liu Y, Sun W, Wang H, Wang Y, Xiong X, Xu L, Waye MM, Tsui SK, Xue H, Wong JT, Galver LM, Fan JB, Gunderson K, Murray SS, Oliphant AR, Chee MS, Montpetit A, Chagnon F, Ferretti V, Leboeuf M, Olivier JF, Phillips MS, Roumy S, Sallee C, Verner A, Hudson TJ, Kwok PY, Cai D, Koboldt DC, Miller RD, Pawlikowska L, Taillon-Miller P, Xiao M, Tsui LC, Mak W, Song YQ, Tam PK, Nakamura Y, Kawaguchi T, Kitamoto T, Morizono T, Nagashima A, Ohnishi Y, Sekine A, Tanaka T, Tsunoda T, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008;36:D480–D484. doi: 10.1093/nar/gkm882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, Thomas PD. PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium. Nucleic Acids Res. 2010;38:D204–D210. doi: 10.1093/nar/gkp1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, Wu G, Birney E, Stein L. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2007;8:R39. doi: 10.1186/gb-2007-8-3-r39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wu J, Vallenius T, Ovaska K, Westermarck J, Makela TP, Hautaniemi S. Integrated network analysis platform for protein-protein interactions. Nat. Meth. 2009;6:75–77. doi: 10.1038/nmeth.1282. [DOI] [PubMed] [Google Scholar]
  • 26.Jia P, Zheng S, Long J, Zheng W, Zhao Z. dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks. Bioinformatics. 2011;27:95–102. doi: 10.1093/bioinformatics/btq615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A. 2005;102:13544–13549. doi: 10.1073/pnas.0506577102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. Gene-set analysis and reduction. Brief Bioinform. 2009;10:24–34. doi: 10.1093/bib/bbn042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform. 2008;9:189–197. doi: 10.1093/bib/bbn001. [DOI] [PubMed] [Google Scholar]
  • 30.Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23:980–987. doi: 10.1093/bioinformatics/btm051. [DOI] [PubMed] [Google Scholar]
  • 31.Holmans P, Green EK, Pahwa JS, Ferreira MA, Purcell SM, Sklar P, Owen MJ, O'Donovan MC, Craddock N. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am. J. Hum. Genet. 2009;85:13–24. doi: 10.1016/j.ajhg.2009.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Nam D, Kim J, Kim SY, Kim S. GSA-SNP: a general approach for gene set analysis of polymorphisms. Nucleic Acids Res. 2010;38(Suppl):W749–W754. doi: 10.1093/nar/gkq428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Chen LS, Hutter CM, Potter JD, Liu Y, Prentice RL, Peters U, Hsu L. Insights into Colon Cancer Etiology via a Regularized Approach to Gene Set Analysis of GWAS Data. Am. J. Hum. Genet. 2010;86:960–871. doi: 10.1016/j.ajhg.2010.04.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Chai HS, Sicotte H, Bailey KR, Turner ST, Asmann YW, Kocher JP. GLOSSI: a method to assess the association of genetic loci-sets with complex diseases. BMC Bioinformatics. 2009;10:102. doi: 10.1186/1471-2105-10-102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Tintle NL, Borchers B, Brown M, Bekmetjev A. Comparing gene set analysis methods on single-nucleotide polymorphism data from Genetic Analysis Workshop 16. BMC Proc. 2009;3(Suppl 7):S96. doi: 10.1186/1753-6561-3-s7-s96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Efron B, Tibshirani RJ. On testing the significance of sets of genes. Ann Appl Stat. 2007;1:107–129. [Google Scholar]
  • 37.Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N. Pathway analysis by adaptive combination of P-values. Genet. Epidemiol. 2009;33:700–709. doi: 10.1002/gepi.20422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Sidak Z. Rectangular confidence regions for the means of multivariate normal distributions. J Am Stat Assoc. 1967;62:626–633. [Google Scholar]
  • 39.Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986;73:751–754. [Google Scholar]
  • 40.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995;57:289–300. [Google Scholar]
  • 41.Peng G, Luo L, Siu H, Zhu Y, Hu P, Hong S, Zhao J, Zhou X, Reveille JD, Jin L, Amos CI, Xiong M. Gene and pathway-based second-wave analysis of genome-wide association studies. Eur J Hum Genet. 2010;18:111–117. doi: 10.1038/ejhg.2009.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ballard DH, Cho J, Zhao H. Comparisons of multi-marker association methods to detect association between a candidate region and disease. Genet Epidemiol. 2010;34:201–212. doi: 10.1002/gepi.20448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Wang K, Abbott D. A principal components regression approach to multilocus genetic association studies. Genet Epidemiol. 2008;32:108–118. doi: 10.1002/gepi.20266. [DOI] [PubMed] [Google Scholar]
  • 44.Hoh J, Wille A, Ott J. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. 2001;11:2115–2119. doi: 10.1101/gr.204001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Dudbridge F, Koeleman BP. Rank truncated product of P-values, with application to genomewide association scans. Genet Epidemiol. 2003;25:360–366. doi: 10.1002/gepi.10264. [DOI] [PubMed] [Google Scholar]
  • 46.Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS. Truncated product method for combining P-values. Genet Epidemiol. 2002;22:170–185. doi: 10.1002/gepi.0042. [DOI] [PubMed] [Google Scholar]
  • 47.Freudenberg J, Lee AT, Siminovitch KA, Amos CI, Ballard D, Li W, Gregersen PK. Locus category based analysis of a large genome-wide association study of rheumatoid arthritis. Hum Mol Genet. 2010;19:3863–3872. doi: 10.1093/hmg/ddq304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Bonifaci N, Gorski B, Masojc B, Wokolorczyk D, Jakubowska A, Debniak T, Berenguer A, Serra Musach J, Brunet J, Dopazo J, Narod SA, Lubinski J, Lazaro C, Cybulski C, Pujana MA. Exploring the link between germline and somatic genetic alterations in breast carcinogenesis. PLoS One. 2010;5:e14078. doi: 10.1371/journal.pone.0014078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Jia P, Tian J, Zhao Z. Assessing gene length biases in gene set analysis of Genome-Wide Association Studies. Int J Comput Biol Drug Des. 2011;3:297–310. doi: 10.1504/IJCBDD.2010.038394. [DOI] [PubMed] [Google Scholar]
  • 50.Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P, Ahren D, Tsoka S, Darzentas N, Kunin V, Lopez-Bigas N. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res. 2005;33:6083–6089. doi: 10.1093/nar/gki892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Guo YF, Li J, Chen Y, Zhang LS, Deng HW. A new permutation strategy of pathway-based approach for genome-wide association study. BMC Bioinformatics. 2009;10:429. doi: 10.1186/1471-2105-10-429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009;10:47. doi: 10.1186/1471-2105-10-47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Efron B. Microarrays, empirical Bayes, and the two-groups model. Statistical Science. 2008;23:1–47. [Google Scholar]
  • 54.Churchill GA, Doerge RW. Naive application of permutation testing leads to inflated type I error rates. Genetics. 2008;178:609–610. doi: 10.1534/genetics.107.074609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Wang L, Jia P, Wolfinger RD, Chen X, Grayson BL, Aune TM, Zhao Z. An Efficient Hierarchical Generalized Linear Mixed Model for Testing Disease Association with Biological Pathways in Genome-wide Association Studies. Bioinformatics. 2011;27:686–692. doi: 10.1093/bioinformatics/btq728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.McCulloch CE, Searle SR. Generalized, Linear and Mixed Models. John Wiley & Sons, Inc.; 2001. [Google Scholar]
  • 57.Lunn DJ, Whittaker JC, Best N. A Bayesian toolkit for genetic association studies. Genet Epidemiol. 2006;30:231–247. doi: 10.1002/gepi.20140. [DOI] [PubMed] [Google Scholar]
  • 58.Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
  • 59.Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. doi: 10.1371/journal.pgen.0030114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Wakefield J. A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007;81:208–227. doi: 10.1086/519024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Stephens M, Balding DJ. Bayesian statistical methods for genetic association studies. Nat Rev Genet. 2009;10:681–690. doi: 10.1038/nrg2615. [DOI] [PubMed] [Google Scholar]
  • 62.Reiner AP, Barber MJ, Guan Y, Ridker PM, Lange LA, Chasman DI, Walston JD, Cooper GM, Jenny NS, Rieder MJ, Durda JP, Smith JD, Novembre J, Tracy RP, Rotter JI, Stephens M, Nickerson DA, Krauss RM. Polymorphisms of the HNF1A gene encoding hepatocyte nuclear factor-1 alpha are associated with C-reactive protein. Am J Hum Genet. 2008;82:1193–1201. doi: 10.1016/j.ajhg.2008.03.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, Pelletier D, Wu W, Uitdehaag BM, Kappos L, Gene MSAC, Polman CH, Matthews PM, Hauser SL, Gibson RA, Oksenberg JR, Barnes MR. Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet. 2009;18:2078–2090. doi: 10.1093/hmg/ddp120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Baurley JW, Conti DV, Gauderman WJ, Thomas DC. Discovery of complex pathways from observational data. Stat Med. 2010;29:1998–2011. doi: 10.1002/sim.3962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Pan W. Network-based model weighting to detect multiple loci influencing complex diseases. Hum Genet. 2008;124:225–234. doi: 10.1007/s00439-008-0545-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Chen L, Zhang L, Zhao Y, Xu L, Shang Y, Wang Q, Li W, Wang H, Li X. Prioritizing risk pathways: a novel association approach to searching for disease pathways fusing SNPs and pathways. Bioinformatics. 2009;25:237–242. doi: 10.1093/bioinformatics/btn613. [DOI] [PubMed] [Google Scholar]
  • 67.Low YL, Li Y, Humphreys K, Thalamuthu A, Darabi H, Wedren S, Bonnard C, Czene K, Iles MM, Heikkinen T, Aittomaki K, Blomqvist C, Nevanlinna H, Hall P, Liu ET, Liu J. Multi-variant pathway association analysis reveals the importance of genetic determinants of estrogen metabolism in breast and endometrial cancer susceptibility. PLoS Genet. 2010;6:e1001012. doi: 10.1371/journal.pgen.1001012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Ruano D, Abecasis GR, Glaser B, Lips ES, Cornelisse LN, de Jong AP, Evans DM, Davey Smith G, Timpson NJ, Smit AB, Heutink P, Verhage M, Posthuma D. Functional gene group analysis reveals a role of synaptic heterotrimeric G proteins in cognitive ability. Am J Hum Genet. 2010;86:113–125. doi: 10.1016/j.ajhg.2009.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Ballard D, Abraham C, Cho J, Zhao H. Pathway analysis comparison using Crohn's disease genome wide association studies. BMC Med Genomics. 2010;3:25. doi: 10.1186/1755-8794-3-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Luo L, Peng G, Zhu Y, Dong H, Amos CI, Xiong M. Genome-wide gene and pathway analysis. Eur J Hum Genet. 2010;18:1045–1053. doi: 10.1038/ejhg.2010.62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Manoli T, Gretz N, Grone HJ, Kenzelmann M, Eils R, Brors B. Group testing for pathway analysis improves comparability of different microarray datasets. Bioinformatics. 2006;22:2500–2506. doi: 10.1093/bioinformatics/btl424. [DOI] [PubMed] [Google Scholar]
  • 72.Kraft P, Raychaudhuri S. Complex diseases, complex genes: keeping pathways on the right track. Epidemiology. 2009;20:508–511. doi: 10.1097/EDE.0b013e3181a93b98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Eleftherohorinou H, Wright V, Hoggart C, Hartikainen AL, Jarvelin MR, Balding D, Coin L, Levin M. Pathway analysis of GWAS provides new insights into genetic susceptibility to 3 inflammatory diseases. PLoS One. 2009;4:e8068. doi: 10.1371/journal.pone.0008068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Lesnick TG, Papapetropoulos S, Mash DC, Ffrench-Mullen J, Shehadeh L, de Andrade M, Henley JR, Rocca WA, Ahlskog JE, Maraganore DM. A genomic pathway approach to a complex disease: axon guidance and Parkinson disease. PLoS Genet. 2007;3:e98. doi: 10.1371/journal.pgen.0030098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Perry JR, McCarthy MI, Hattersley AT, Zeggini E, Weedon MN, Frayling TM C. Wellcome Trust Case Control. Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes. 2009;58:1463–1467. doi: 10.2337/db08-1378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Torkamani A, Topol EJ, Schork NJ. Pathway analysis of seven common diseases assessed by genome-wide association. Genomics. 2008;92:265–272. doi: 10.1016/j.ygeno.2008.07.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Askland K, Read C, Moore J. Pathways-based analyses of whole-genome association study data in bipolar disorder reveal genes mediating ion channel activity and synaptic neurotransmission. Hum Genet. 2009;125:63–79. doi: 10.1007/s00439-008-0600-y. [DOI] [PubMed] [Google Scholar]
  • 78.Lambert JC, Grenier-Boley B, Chouraki V, Heath S, Zelenika D, Fievet N, Hannequin D, Pasquier F, Hanon O, Brice A, Epelbaum J, Berr C, Dartigues JF, Tzourio C, Campion D, Lathrop M, Amouyel P. Implication of the immune system in Alzheimer's disease: evidence from genome-wide pathway analysis. J Alzheimers Dis. 2010;20:1107–1118. doi: 10.3233/JAD-2010-100018. [DOI] [PubMed] [Google Scholar]
  • 79.Li J, Humphreys K, Heikkinen T, Aittomaki K, Blomqvist C, Pharoah PD, Dunning AM, Ahmed S, Hooning MJ, Martens JW, van den Ouweland AM, Alfredsson L, Palotie A, Peltonen-Palotie L, Irwanto A, Low HQ, Teoh GH, Thalamuthu A, Easton DF, Nevanlinna H, Liu J, Czene K, Hall P. A combined analysis of genome-wide association studies in breast cancer. Breast Cancer Res Treat. 2010 doi: 10.1007/s10549-010-1172-9. In press. [DOI] [PubMed] [Google Scholar]
  • 80.Menashe I, Maeder D, Garcia-Closas M, Figueroa JD, Bhattacharjee S, Rotunno M, Kraft P, Hunter DJ, Chanock SJ, Rosenberg PS, Chatterjee N. Pathway analysis of breast cancer genome-wide association study highlights three pathways and one canonical signaling cascade. Cancer Res. 2010;70:4453–4459. doi: 10.1158/0008-5472.CAN-09-4502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Wang K, Zhang H, Kugathasan S, Annese V, Bradfield JP, Russell RK, Sleiman PM, Imielinski M, Glessner J, Hou C, Wilson DC, Walters T, Kim C, Frackelton EC, Lionetti P, Barabino A, Van Limbergen J, Guthery S, Denson L, Piccoli D, Li M, Dubinsky M, Silverberg M, Griffiths A, Grant SF, Satsangi J, Baldassano R, Hakonarson H. Diverse genome-wide association studies associate the IL12/IL23 pathway with Crohn Disease. Am J Hum Genet. 2009;84:399–405. doi: 10.1016/j.ajhg.2009.01.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Chasman DI. On the utility of gene set methods in genomewide association studies of quantitative traits. Genet Epidemiol. 2008;32:658–668. doi: 10.1002/gepi.20334. [DOI] [PubMed] [Google Scholar]
  • 83.Jia P, Ewers JM, Zhao Z. Prioritization of epilepsy associated candidate genes by convergent analysis. PLoS ONE. 2011;6:e17162. doi: 10.1371/journal.pone.0017162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.O'Dushlaine C, Kenny E, Heron E, Donohoe G, Gill M, Morris D, Corvin A. Molecular pathways involved in neuronal cell adhesion and membrane scaffolding contribute to schizophrenia and bipolar disorder susceptibility. Mol Psychiatry. 2011;16:286–292. doi: 10.1038/mp.2010.7. [DOI] [PubMed] [Google Scholar]
  • 85.Thomas D. Gene-environment-wide association studies: emerging approaches. Nat Rev Genet. 2010;11:259–272. doi: 10.1038/nrg2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Segre AV, Groop L, Mootha VK, Daly MJ, Altshuler D. Common inherited variation in mitochondrial genes is not enriched for associations with type 2 diabetes or related glycemic traits. PLoS Genet. 2010;6 doi: 10.1371/journal.pgen.1001058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Zhang K, Cui S, Chang S, Zhang L, Wang J. i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Nucleic Acids Res. 2010;38:W90–W95. doi: 10.1093/nar/gkq324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Holden M, Deng S, Wojnowski L, Kulle B. GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics. 2008;24:2784–2785. doi: 10.1093/bioinformatics/btn516. [DOI] [PubMed] [Google Scholar]
  • 89.Schwarz DF, Hadicke O, Erdmann J, Ziegler A, Bayer D, Moller S. SNPtoGO: characterizing SNPs by enriched GO terms. Bioinformatics. 2008;24:146–148. doi: 10.1093/bioinformatics/btm551. [DOI] [PubMed] [Google Scholar]
  • 90.Medina I, Montaner D, Bonifaci N, Pujana MA, Carbonell J, Tarraga J, Al-Shahrour F, Dopazo J. Gene set-based analysis of polymorphisms: finding pathways or biological processes associated to traits in genome-wide association studies. Nucleic Acids Res. 2009;37:W340–W344. doi: 10.1093/nar/gkp481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Al-Shahrour F, Arbiza L, Dopazo H, Huerta-Cepas J, Minguez P, Montaner D, Dopazo J. From genes to functional classes in the study of biological systems. BMC Bioinformatics. 2007;8:114. doi: 10.1186/1471-2105-8-114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Dinu V, Zhao H, Miller PL. Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis. J Biomed Inform. 2007;40:750–760. doi: 10.1016/j.jbi.2007.06.002. [DOI] [PubMed] [Google Scholar]
  • 93.Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN. Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet. 2005;76:780–793. doi: 10.1086/429838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.De la Cruz O, Wen X, Ke B, Song M, Nicolae DL. Gene, region and pathway level analyses in whole-genome studies. Genet. Epidemiol. 2009;34:222–231. doi: 10.1002/gepi.20452. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES