Abstract
Pathway analysis, also known as gene-set enrichment analysis, is a multi-locus analytic strategy that integrates a-priori biological knowledge into the statistical analysis of high-throughput genetics data. Originally developed for the studies of gene expression data, it has become a powerful analytic procedure for in-depth mining of genome-wide genetic variation data. Astonishing discoveries were made in the past years, uncovering genes and biological mechanisms underlying common and complex disorders. However, as massive amounts of diverse functional genomics data accrue, there is a pressing need for newer generations of pathway analysis methods that can utilize multiple layers of high-throughput genomics data. In this review, we provide an intellectual foundation of this powerful analytic strategy as well as an update of the state-of-the-art in recent method developments. The goal of this review is threefold: (1) introduce the motivation and basic steps of pathway analysis for genome-wide genetic variation data; (2) review the merits and the shortcomings of classic and newly emerging integrative pathway analysis tools; (3) discuss remaining challenges and future directions for further method developments.
Keywords: Pathway analysis, Set-based association analysis, gene-set enrichment analysis, genome-wide association study, multi-locus association analysis
1. Introduction
Human genetics has encountered one of the most exciting periods in recent years, moving steps closer to deciphering the genetic basis of common and complex disorders. Since the first genome-wide association study (GWAS) was undertaken in 2005 (Klein et al., 2005), GWAS has become a standard tool for identifying common genetic risk variants underlying disease. Many studies now examine millions of genetic markers simultaneously, aiming to detect disease-gene association at the population level. Remarkable discoveries were made for several disorders, including cardiovascular (Shah et al., 2020), immune (Gonzalez-Serna et al., 2020; Saevarsdottir et al., 2020; Vuckovic et al., 2020), and psychiatric diseases (Pasman et al., 2018; Pardinas et al., 2018; Howard et al., 2019), and more exciting findings are expected to come from recent advances in high-dimensional analytic strategies.
In this review, we focus on pathway analysis, also known as gene-set enrichment analysis, which stands at the forefront of the latest GWAS discoveries. Classical GWAS analysis focuses on the detection of individual risk loci, where statistical tests aim to examine the association of the allele information at a specific locus with a case/control status. While informative, statistical power is a major issue in variant-based analysis. Genetics of complex traits involve hundreds to thousands of causal loci, individually carrying only a modest effect (Lango Allen et al., 2010). Along with high polygenicity, extensive levels of genetic and phenotypic heterogeneity are also the norm rather than the exception, suggesting different combinations of genetic variants or genes can set the trajectory to the same disease manifestation. To disentangle the highly polygenic and heterogeneous nature of common and complex traits, pathway analysis takes a multi-locus strategy that capitalizes on a priori biological knowledge, increasing the discovery power while facilitating the biological interpretation of statistical associations. Furthermore, recent advances in the field of functional genomics present new opportunities in pathway analysis for integrating the complexity of gene regulation and its implications in human health and disease.
This article aims to act as a practical guide in pathway analysis, providing an intellectual foundation of this powerful analytic strategy, a critical review of the latest methodologic developments, and a discussion of open challenges and wide opportunities remaining in the field. Specifically, we begin with an introduction of basic concepts and motivations. We follow this with an explanation of the basic steps of pathway analysis, paying special attention to possible sources of bias that may be introduced in the process. Next, we provide a review of currently available pathway analysis resources and tools. We focus on recent developments of integrative pathway analysis methods that incorporate distinct categories of functional genomics data. We conclude this review with a discussion of remaining challenges and new avenues for future research.
2. Brief Background on Pathway Analysis
Years of empirical GWAS findings have revealed that multiple disease-associated genetic variants impinge on a limited number of common pathways or interacting networks. Notable examples include synaptic biology in schizophrenia (Schrode et al., 2019), cytokine pathways in immune diseases (Xavier and Rioux, 2008), and complement pathways in age-related macular degeneration (Klein et al., 2005). To identify enriched association of specific biological mechanisms and pathways with a disorder, emphasis should be placed on capturing the aggregate effects of the constituent member genes rather than a single gene or a variant in isolation. Analytically, this set-based association can be tested using an alternative hypothesis that a set of biologically-related genes are associated with a target phenotype. Note that genes can be grouped by any type of biological attributes (not solely by pathway), thus pathway analysis is also referred to by more general terms in the context of GWAS data analysis, such as gene-set-based association analysis or gene-set-based enrichment analysis, depending on the tested null hypothesis. We elaborate the distinction of the two testing strategies in more detail in section 4.
Pathway analysis offers several advantages to the in-depth mining of GWAS data. Introducing biological knowledge into statistical analysis provides orthogonal evidence to functionally-agnostic measures of association. A set of modestly associated SNPs may merit further analysis if they map to functionally related genes. Conversely, the discovery of strongly-associated loci may provide insights into disorder-associated pathways, and, by extension, other disorder-associated genes. Pathway analysis can also be used for in-silico fine mapping, in which a certain gene may be selected among a set of genes in high linkage disequilibrium (LD) based on its relationship to other independent and strongly-associated intervals. The identity of enriched gene-sets may also provide insights into disease mechanisms, however, this depends on the accuracy of the tested gene-sets, as well as their relevance to the phenotype’s underlying biology. Biological insights are especially desirable, considering their usefulness in the understanding of disease etiology (Sullivan, 2012), the development of therapeutics (Breen et al., 2016), and the elucidation of new genetic mechanisms underlying human evolution (Li et al., 2014).
3. Selection of Testing Gene-Sets
The main goal of pathway analysis is to identify the statistically significant association of biologically-related genes with a target phenotype. Therefore, the manner in which genes are grouped will determine the conclusions that we can draw from the analysis. Here, we detail different gene-set references, grouped by three broad categories: (1) functional-annotation based; (2) disorder-based; and (3) high-throughput data-based. Table 1 summarizes the list of representative gene-sets under these three categories. We discuss major advantages and shortcomings of these resources.
Table 1. Summary of gene-set resources.
Gene-set resources were grouped into three categories: functional-annotation, disorder-oriented, and omics-based. Statistics of gene-sets are listed as of Oct/22/2020.
| Gene Set | Statistics | Resource | Category |
|---|---|---|---|
| Gene Ontology (GO) | 8,049,377 annotations (3,060,065 BP; 2,553,834 MF; 2,435,478 CP) | http://geneontology.org/stats.html | Functional-annotation |
| Kyoto Encyclopedia of Genes and Genomes (KEGG) | 741,465 pathway maps 261,427 functional hierarchies |
https://www.kegg.jp/kegg/docs/statistics.html | Functional-annotation |
| Protein Analysis Through Evolutionary Relationships (PANTHER) | 142 genomes 2,625,353 total genes 2,065,831 genes in PANTHER™ families with phylogenetic trees, multiple sequence alignments and HMMs |
http://www.pantherdb.org/data/ | Functional-annotation |
| Synapse Gene Ontology (SynGO) | 2,922 annotations for 1,112 synaptic genes | https://www.syngoportal.org | Functional-annotation |
| Database of Olfactory receptors | 780 OR genes (5 S. cerevisiae; 66 D. melanogaster; 338 M. musculus; 371 H. sapiens) | https://senselab.med.yale.edu/ordb/ | Functional-annotation |
| DeCON | 8,864 genes | http://decon.rc.fas.harvard.edu/ | Functional-annotation |
| Online Mendelian Inheritance in Man | 25,621 genes (4,339 with phenotype-causing mutation) 6,751 phenotypes for which molecular basis is known |
https://omim.org/statistics/entry https://omim.org/statistics/geneMap |
Disorder-oriented |
| NHGRI-EBI GWAS Catalog | 5,687 GWAS comprising 71,673 variant-trait associations from 3567 publications | https://www.ebi.ac.uk/gwas/ | Disorder-oriented |
| Disease-Ontology | 10,529 Disease Ontology terms | https://disease-ontology.org/ | Disorder-oriented |
| SZDB | 7,377 genes | http://www.szdb.org/index.html | Disorder-oriented |
| SFARI Gene | 993 human genes 347 animal model genes |
https://gene.sfari.org/ | Disorder-oriented |
| MSigDB | 31,117 gene sets divided into 9 major collections | https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp | Omics-based |
3.1. Functional-annotation-based Gene-Sets
Functional-annotation-based gene-sets assemble genes based on their involvement in specific molecular functions, biological processes, and established annotations. Gene Ontology (GO) gene-sets are inarguably most widely used in pathway analysis (The Gene Ontology, 2019). GO terms are split into three major categories: molecular function, cellular component, and biological process. Molecular function annotations represent activities that occur at the molecular level (e.g., catalytic activity), while biological process annotations describe general biological functions in which genes participate (e.g., DNA repair). Cellular component annotations detail the cell structures in which gene products are mainly located (e.g., nucleus). While GO terms are most widely used in pathway analysis, we note that their hierarchical structure results in term overlaps and thus requires careful interpretations. For instance, association of a parent term may be driven entirely by a highly-enriched, underlying child term. Multiple testing correction also needs to take into account the lack of statistical independence between the overlapping GO gene-sets. To mitigate this issue, multiple analytic strategies have been proposed, such as REVIGO (Supek et al., 2011) or GO Trimming (Jantzen et al., 2011), which computationally reduces GO term redundancies through iterative selection of representative terms. More recent developments like GOMCL (Wang et al., 2020) apply a Markov Clustering algorithm to identify independent structures within GO networks. While GO annotations focus on defining a wide coverage of annotation terms, delineating their hierarchical relationships, and assigning member genes, another functional-annotation data source, the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000) focuses on providing detailed information about how groups of genes carry out specific biomolecular activities via networks of interactions and reactions. KEGG gene sets are manually curated and represented in a visualized format, referred to as pathway maps. KEGG data is thus more descriptive than GO, although a major limitation of detailed elaboration is its limited coverage. Another widely used annotation source is the PANTHER (Protein Analysis Through Evolutionary Relationships) (Mi et al., 2013), which classifies genes with consideration to how they are related through evolution. It utilizes the information from phylogenetic trees constructed across several species in the same taxonomic rank. The PANTHER system also provides gene-sets according to molecular functions, biological processes, and pathways, similar to GO and KEGG.
While the aforementioned gene-set databases aim to provide a comprehensive coverage of biological domains, there are cases in which experts with extensive domain-specific knowledge integrate disparate sources of information to classify and organize gene-sets as they relate to a specific biological theme. A notable example is the Synapse Gene Ontology (SynGO) (Koopmans et al., 2019) maintained by the SynGO Consortium. SynGO synthesizes information from published scientific literature, existing databases, and knowledge of synaptic gene and protein features to create an ontology specific to synaptic structure and function. Examples of other manually compiled gene sets include the Database of Olfactory Receptors (DOR) (Molyneaux et al., 2015) and The Developing Cortical Neuron Transcriptome Resource (DeCoN) (Nagarathnam et al., 2014), among others.
3.2. Disorder-oriented Gene-Sets
The second major category includes gene-set resources compiled based on their implications in a specific disorder. The most well-known is The Online Mendelian Inheritance of Man (OMIM) database (‘Online Mendelian Inheritance in Man, OMIM®’, 2020). OMIM assigns approximately 15,000 genes to phenotypes of known Mendelian disorders, providing insights into gene-phenotype information. OMIM’s extensive library is based upon peer-reviewed biomedical literature and updated on a daily basis. Another widely utilized disorder-based gene database is the NHGRI-EBI GWAS Catalog (Buniello et al., 2019). This database seeks to form a nexus of information for compiling genome-wide association studies (GWAS). Reported trait, SNP-trait associations, and sample metadata are extracted from published and unpublished literatures and updated on a weekly basis. A third example of a disease-oriented gene set is Disease Ontology (DO) (Schriml et al., 2019). Asserting that the relationship between phenotypically similar pathologies can provide a more comprehensive understanding of individual disease mechanisms, DO integrates and categorizes disease-related biomedical data in the interest of cross-disease contextualization. The database of DO terms are cross-referenced with widely-used resources such as OMIM (‘Online Mendelian Inheritance in Man, OMIM®’, 2020), Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT) (Shahpori and Doig, 2010), International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) (National Center for Health Statistics, 2021), and Medical Subject Headings (MeSH) (National Library of Medicine, 2020). Diseases are organized into a directed acyclic graph available in both Web Ontology Language (OWL) and Open Biological and Biomedical Ontologies (OBO) format. A major advantage of disorder-specific databases is that gene-disorder mapping has been compiled from a large breadth of data sources, including GWAS, copy-number variations, protein-protein interactions, and tissue-specific co-expressions. It is important to note that if gene-sets or pathways are defined using GWAS or sequencing studies of the same or related phenotypes, caution is needed to ensure independencies between the testing data and the gene-sets (e.g., overlapping samples can induce artificial correlations between GWAS summary statistics) (Lin and Sullivan, 2009).
3.3. Omics-Data-based Gene-Sets
While the first two categories of gene-sets reflect established biological knowledge, high throughput data analysis can identify biologically-related de novo gene-sets based on the relationships of genes captured in diverse omics datasets including but not limited to genomic, proteomic, transcriptomic, pharmacogenomic, and phenomic data (Thompson et al., 2014). High-throughput experimentation generates biologically-related gene sets through pursuance of specific goals, such as identification of genes distinguishing two cohorts (e.g., cancer vs. normal), showing correlated changes in gene expression over time (e.g., higher prenatal than postnatal expression), or identification of dynamically changing gene relationships responding to a specific event (e.g., shock-response genes) (Braun, 2014). High throughput data-based gene-sets thus should be interpreted under the contexts of when and how the data were generated. The Molecular Signatures Database (MSigDB) (Subramanian et al., 2005) represents a good example of high-throughput-data-based gene-sets. MSigDB provides gene sets under 9 major collections: H: hallmark gene sets; C1: positional gene sets; C2: curated gene sets; C3: regulatory target gene sets; C4: computational gene sets; C5: ontology gene sets; C6: oncogenic signature gene sets; C7: immunologic signature gene sets; C8: cell type signature gene sets. The criteria for MSigDB collections are based on various data types including chromosomal location, curational origin, regulatory elements, ontology, and specific biological signatures. Of particular interest is MSigDB’s combination of omics and functional annotation data in constructing hallmark gene sets. These gene sets were computationally gathered based on other collections, for example, C1-C6, by applying omics-data analysis, clustering, and meta-analysis of the results. To reduce redundancy and heterogeneity, a manual review of computational output was also conducted prior to manual establishment of the gene collection.
4. Analytic Decision Factors in Pathway Analysis
Although each method takes distinct analytic strategies, pathway analysis in general involves the following four steps: (1) null hypothesis selection, (2) SNP-to-gene mapping, (3) enrichment/association testing, and (4) multiple testing correction. In this section, we provide detailed explanations of these steps, and elaborate how certain decisions may introduce potential bias into the analysis. Fig. 1 summarizes the major analytic procedures.
Fig. 1.
Important analytic procedures in pathway analysis. There are three major ways to define genes in the gene sets, including by functional annotations, disorder-specificity, and omics data. Examples of these databases are shown in the figure. Once the gene sets are defined, the four major components of pathway analysis include null hypothesis selection, mapping SNPs to genes, testing enrichment of the gene sets, and correcting for multiple testing.
4.1. Null Hypothesis Selection
For a given gene-set S, there are two major categories of null hypotheses: (1) self-contained and (2) competitive (Goeman and Bühlmann, 2007). In essence1, the two null hypotheses can be formulated as follows:
H0Self-contained : Genes in gene-set G are not associated with a target phenotype.
H0Competitive : Genes in gene-set G are not more associated with a target phenotype than other genes.
In other words, self-contained tests examine only genes belonging to the target gene-set and test whether or not there is an association between the genes and a particular phenotype. Competitive tests, in contrast, compare association signals of a target gene-set to a set of background genes or its complementary gene-sets. Self-contained tests are thus more relevant to ‘gene-set association analysis’, while the competitive tests are often noted as ‘gene-set enrichment analysis’, although the mixed usage of the terms is common in the field.
When testing GWAS data, it is important to recognize that the self-contained test is sensitive to systematic inflation of association statistics, provided that the inflation observed in the gene-set approximates that of the whole genome (Holmans, 2010). That is, as the number of genes significantly associated with a phenotype increases (i.e., high polygenicity), so does the likelihood that any given gene-set will contain some number of phenotype-associated genes. Without considering the situation, some self-contained tests may conclude that a gene-set is significantly associated with the phenotype when in fact it is no more related to the phenotype than any random gene-set and therefore the statistical association yields no biologically meaningful relationship. Using extensive simulation studies, de Leeuw et al. have also demonstrated that self-contained pathway analysis tests, such as the single sample Z-test and the binomial test, have type-1 error rates that depend on the overall heritability of the disorder (de Leeuw et al., 2016). Thus, extra caution is needed when analyzing the GWAS data of highly polygenic traits using self-contained methods, including the SNP-ratio test (O’Dushlaine et al., 2009), the self-contained tests included in MAGMA (de Leeuw et al., 2015) and FORGE (Pedroso et al., 2015), as well as GSEA-SNP (Holden et al., 2008) and GeSBAP (Medina et al., 2009). More recent efforts in the development of self-contained methods, such as the Generalized Berk-Jones (Sun et al., 2019) statistic have focused on improving power, controlling proper type-1 error rates, and thus increasing the statistical rigor of gene-set analysis tests.
It is important to note that, while competitive analysis is robust to varying levels of polygenicity, it may still be vulnerable to other confounding factors. Methodological confounding can inflate type-1 error rates when certain genomic characteristics, such as gene size, SNP density and linkage disequilibrium (LD), are not properly controlled for. Even among competitive analyses that exhibit good control of methodological confounds, biological confounds can also introduce false positives if, for example, the physical clustering of functionally-related genes is not properly taken into account (Hong et al., 2009). In genomic regions with extensive LD (e.g., MHC) or any gene-rich regions, genetic variants in proximity are likely to have highly correlated allele information, thus causing dependencies in gene association as well. In a survey of the widely-used competitive pathway analysis methods, FORGE (Pedroso et al., 2012), JAG (Lips et al., 2015), MAGMA (de Leeuw et al., 2015), ALIGATOR (Holmans et al., 2009), INRICH (Lee et al., 2012), and MAGENTA (Segre et al., 2010), only INRICH and MAGMA demonstrated no detectable dependence of type-1 error rates on various confounding factors such as gene size, gene-set size and LD. Both methods explicitly control for these factors: INRICH measures gene-association by counting gene overlaps with LD-associated regions and employs a permutation scheme that preserves the structure of the original data (e.g. SNP density of associated regions), and MAGMA directly accounts for LD and gene properties in its regression model. However, in simulations of gene-sets consisting of genes in the major histocompatibility region, both MAGMA and INRICH detected enrichment of non-causal gene-sets that were in strong LD with causal gene-sets (de Leeuw et al., 2016), demonstrating the difficulty of completely controlling type-1 error, even in well-designed competitive analyses. We thus recommend excluding genomic regions with extensive LD, such as the MHC region, in pathway analysis for GWAS datasets.
4.2. SNP-to-gene Mapping
To examine relationships between gene-sets and a target phenotype, GWAS data (in the form of SNPs) must be mapped to biologically-relevant regions in the genome (i.e. genes). Genic regions are defined by organism specific reference gene databases; common human reference gene databases include GENCODE (Frankish et al., 2019), NCBI ENTREZ (Brown et al., 2015), EMBL ENSEMBL (Yates et al., 2020), and Sanger HAVANA (‘Homo sapiens - Vega Genome Browser 68’, 2017). SNPs are first mapped to a gene if they are located within a protein-coding gene boundary including introns. When multiple transcripts are mapped to a single gene, the consensus gene region is typically defined from the most upstream transcription start site to the end of the most downstream transcription ending site. Optionally, the gene boundaries can be extended to account for nearby regulatory regions; the extension varies anywhere from 5 kb to 500 kb upstream and downstream of the transcription sites. Shortcomings of this extension strategy are manifold. Obviously, this approach is susceptible to noise added by unrelated, intergenic SNPs that may lie between transcription sites and regulatory sites. Multiple studies have also reported that many of the genes closest to nearby regulatory SNPs are not actual target genes of the SNP (Kirsten et al., 2015; Gamazon et al., 2018). Moreover, when an arbitrarily fixed extension is applied, pathway analysis results may change based on the parameter settings.
A more direct way to identify association signals of regulatory elements is to explicitly assign genetic variants associated with the regulatory elements to the target gene. Regulatory-element-to-gene pairing data have become increasingly available through large functional genomics data resources, such as ENCODE (Encode Project Consortium, 2012) and GTEx (GTEx Consortium, 2020). For example, the integration of expression quantitative trait loci (eQTL) into the analysis will, at least in part, account for the relationship between the regulatory region and tissue-specific expression of a gene without including irrelevant proximal SNPs. We discuss how some pathway analysis tools address the SNP mapping issue in Section 5.
4.3. Enrichment Testing
Once SNPs are mapped to genes and gene-sets, statistical tests are conducted to determine whether or not the gene-sets show association at a statistically significant degree with respect to a target phenotype. The determination of gene-set association tests depends on three key factors: (1) SNP-based vs. gene-based, (2) filtering-based vs. all-inclusive, and (3) phenotype-permutation vs. genotype-permutation.
SNP association statistics are the direct products of GWAS, whereas most of the biological knowledge related to pathway analysis is at the gene-level. Typically, gene-set association statistics are generated either directly through SNP association statistics (e.g., using the most significant SNP P-value as a gene P-value), or by calculating gene-based association scores (e.g., using multivariate association tests). Both scoring strategies may introduce biases if gene sizes and LD properties are not properly taken into account. For instance, larger genes are likely to contain more SNPs than smaller genes, artificially increasing the likelihood that longer genes will appear significantly associated with any given phenotype. Multiple SNPs residing in strong and extensive LD regions (e.g., the MHC region on chromosome 6) may be counted as separate gene association signals when the physical clustering observed in functionally related genes is not properly considered (Hong et al., 2009). This problem may be aggravated when gene boundaries are extended to include large non-coding regions upstream and downstream from the gene.
Another important factor in the determination of gene-set association scores is whether or not to filter SNPs before performing the analysis. Filtering-based methods restrict the analysis to a subset of SNPs that meet a specific level of association (e.g., P-value < 5 × 10−5). Different thresholds produce different testing datasets, thus are likely to change the pathway analysis results. Certain methods, such as INRICH, provide statistical measures for empirically evaluating the choice of given thresholds. However, a more rational approach is to first estimate the polygenicity (e.g., the number of causal variants) of complex traits using methods like MiXeR (Holland et al., 2020), and then use the estimates for selecting a p-value threshold such that the number of LD-independent SNPs included in the final analysis approximates the number of the estimated causal variants. All-inclusive analyses, on the other hand, consider every SNP associated with a gene-set and produce a single set of results, which is a major advantage.
Lastly, the significance of gene-set enrichment is measured either analytically or empirically. Analytic methods utilize known distribution functions (e.g., chi-square distribution), while empirical methods rely on simulations for testing whether or not the observed statistics deviates enough to reject the null hypothesis (e.g., Monte-carlo simulation). In the latter case, there are two main strategies for generating the empirical null distribution. Phenotype-based permutation shuffles the phenotype labels of individuals in the GWAS, while retaining genes and gene-sets. Phenotype-based permutation thus preserves the genetic structure and gene-to-gene relationships of the original study, preventing possible confounding effects. However, raw genotype-level data are often not available, as is the case for meta-analyses, making phenotype-based permutation infeasible. The phenotype-based permutation strategy also becomes computationally demanding as the sample sizes of GWAS grow dramatically. Association data-based permutations create random data-sets that act as a basis of comparison for the tested gene-set. This strategy does not require sample-level GWAS data, however the permuted gene-sets often do not preserve the structure of the original test dataset. It is thus important to understand how individual tools employ certain permutation procedures that specifically attempt to preserve important elements of the original genetic structure.
4.4. Multiple Testing Correction
Pathway analysis typically tests multiple gene-sets in a single run. Gene-sets, such as GO can number in thousands, and thus, it is critical to properly control the number of false-positives for the experiment. There are a number of methods for achieving this, the most notable being the control of the family wise error rate (FWER). The FWER of a series of tests is the probability of generating at least one false positive across an entire set (i.e. family) of hypotheses. In the trivial case of a single hypothesis test, the FWER is simply equal to the alpha level of the test. The most well-known method of FWER control is the Bonferroni correction, which is a rather conservative procedure for testing hierarchical and dependent gene-sets. In contrast to FWER that controls the likelihood of one or more false positives, strategies based on false discovery rates (FDR) consider the expected proportion of null hypotheses for which the null hypothesis is rejected. Popular methods for FDR control include the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995), which assumes that all tests are either independent from one another or positively dependent on one another, and the Benjamini-Yekutieli procedure (Benjamini and Yekutieli, 2001), which allows for arbitrary dependence between tests. The q-value proposed by Storey (Storey, 2002) is another FDR-based correction strategy, representing the minimum FDR attained at a given condition and thus complementing the nonmonotonicity property of its ancestor. FDR control places less stringent constraints on its tests than FWER control, however, FDR control fails when the null p-value distribution deviates from the uniform.
While the aforementioned methods are simple and fast options for FWER and FDR control, there are situations in which empirical statistical models that capture actual dependencies in the data are preferred (e.g., analysis of dependent GO terms). Resampling-based methods do not rely on any particular joint or conditional distribution which makes them useful for cases in which the distribution of enrichment statistic is unknown and/or the p-value distribution deviates from the uniform. Under these conditions, using a bootstrapping method to control for FWER or FDR is preferable to using a method that makes inaccurate assumptions about the distribution of test statistics. Resampling-based methods are also convenient for use in permutation-based pathway analysis methods, as many test statistics are generated for each hypothesis, so the distribution against which the resampled test statistic is compared is already generated over the course of the testing. The major limiting factor is still that calculation is computationally demanding and time-consuming.
5. Pathway Analysis Tools Focusing on the Latest Developments
In Table 2, we describe a list of pathway analysis methods widely used for the analysis of genetic variation data. This is by no means a comprehensive account of all available tools. Rather, we list these methods as examples for highlighting different methodological aspects of pathway analysis, which we elaborated in the previous section. Below, we provide a brief review of some trademark pathway analysis methods including the latest developments.
Table 2. Summary of pathway analysis resources.
Distinct features of hallmark methods are listed in terms of the null hypothesis, gene scoring, SNP selection, and permutation strategies. Some methods provide own multiple testing correction, although different correction strategies can be employed by the users independently.
| Method | Null Hypothesis | Gene Scoring | Filtering vs All-inclusive | Permutation Strategy | Multiple Testing Correction | Reference |
|---|---|---|---|---|---|---|
| GenGen | Self-contained or Competitive | Most significant genic SNP | All-inclusive | Phenotype | FWER and FDR control | 17966091 |
| GSEA-SNP | Self-contained or Competitive | Uses all genic SNP scores to calculate gene score | All-inclusive | Phenotype | FDR control | 18854360 |
| ALIGATOR | Competitive | All gene-set SNPs considered individually | Filtering-based | Genotype | FDR control | 19539887 |
| GeSBAP | Self-contained | Max of −log(P). Arbitrary threshold determines “significant” or “non-significant” | All-inclusive | N/A | FDR control | 19502494 |
| SRT | Self-contained | All gene-set SNPs considered individually | Filtering-based | Genotype | FDR control | 19620097 |
| MAGENTA | Competitive | Most significant genic SNP | Filtering-based | Genotype | FDR control | 20714348 |
| iGSEA4GWAS | Self-contained or Competitive | −log(P) of genic SNPs | All-inclusive | Phenotype | FDR control | 20435672 |
| GLMM | Self-contained | Genic SNPs included in GLM | All-inclusive | N/A | FDR control | 21266443 |
| PoDA | Self-contained | Most significant genic SNP | All-inclusive | Phenotype | FDR control | 21695280 |
| INRICH | Competitive | Number of overlapping intervals | Filtering-based | Genotype | Bootstrapping | 22513993 |
| FORGE | Self-contained or Competitive | Sidak’s corrected min p, Fisher’s method, fixed-effect z-score or random-effect z-score | Filtering-based | N/A | FDR control | 22502986 |
| MAGMA | Self-contained or Competitive | SNP data fit to multiple linear principal components model | All-inclusive | N/A | FDR control | 25885710 |
| RSS-E | Competitive | SNPs are scored directly without gene mapping | All-inclusive | N/A | N/A | 30341297 |
| Generalized Berk-Jones | Self-contained | SNPs are mapped to gene-sets rather than genes | All-inclusive | N/A | N/A | 30875371 |
| PHARAOH-GEE | Self-contained | Genes are the weighted sum of their rare variants | All-inclusive | Phenotype | Bootstrapping or FDR control | 31296220 |
| ActivePathways | Competitive | Gene-level data is taken directly as input | N/A | N/A | FWER control | 32024846 |
GenGen (Wang et al., 2007) is the first pathway analysis method designed specifically for the analysis of GWAS data. It was adapted from gene-set enrichment analysis (GSEA) (Mootha et al., 2003; Subramanian et al., 2005), which was originally developed for the analysis of gene expression data. In GenGen, gene-level association statistics are first calculated as the p-value of the most significant SNP within the gene boundaries. All genes are then ranked by the statistical significance, and enrichment of a gene-set is tested by measuring the deviance of the observed gene-set gene rankings on the extreme ends from what is expected under no association to the phenotype. To control for confounding effects of different gene and gene-set sizes, a phenotype-based permutation strategy is employed for normalization of the gene-set enrichment scores. Final results are reported with both FWER and FDR corrections to control for multiple testing.
ALIGATOR (Association List GO Annotator) (Holmans et al., 2009) is another early method that takes a distinct testing strategy from GenGen. It tests for the overrepresentation of significant SNPs in biological pathways. First, ALIGATOR maps each SNP to its overlapping genic region, forming a reference set R. Next, SNPs are sampled randomly to generate N null sets of the same size as R. Gene-set P-values are equal to the proportion of null sets that contain at least the same number of the gene-set genes as R. Multiple testing correction is achieved via bootstrapping. ALIGATOR accounts for differences in gene size and SNP density by sampling SNPs rather than the genes themselves.
Similar to ALIGATOR, INRICH (Interval Enrichment test) (Lee et al., 2012) takes an overrepresentation-based testing strategy. INRICH, however, differs from the predecessors in that it takes a list of phenotype-associated genomic intervals as testing units rather than genes. Here, association intervals are defined as genomic regions of independent SNP association determined by a genome-wide LD scan (e.g. PLINK LD clumping). INRICH first merges physically overlapping genes as well as intervals that overlap the same gene in order to remove any bias that could be introduced from the effects of physical proximity. It then generates association intervals randomly, while explicitly maintaining certain structural properties of the original testing intervals, specifically the number of overlapping genes and the SNP density. This first round of permutations generates empirical P-values for each gene-set, and a second round of bootstrapping permutations corrects the P-values for multiple testing.
In contrast to the forementioned methods, FORGE (Pedroso et al., 2015) provides a range of gene and gene-set based analyses in a software package. The gene-level association statistics cover: Sidak’s correction on minimum P, modified Fisher’s method, fixed-effect z-score, and random-effect z-score. Gene-set statistics can also be calculated with two different strategies. The “SNP to gene-sets” strategy simply treats a gene-set as one large testing unit consisting of all the constituent gene regions and applies an aforementioned gene association statistic to the entire set. The ‘Gene-set analysis’ strategy combines gene Z-score statistics using a method that accounts for covariance. In order to account for LD between SNPs, FORGE implements a Monte Carlo simulation procedure, as described by Liu et al. (Liu et al., 2010), to calculate gene P-values. While this procedure controls for factors such as LD and gene size, it is also time consuming, performing up to 106 simulations on a single gene.
Among these classic pathway analysis methods that are specifically developed for GWAS, MAGMA (Multi-marker Analysis of Genomic Annotation) (de Leeuw et al., 2015) has gained considerable popularity over the years. Unlike other tools like INRICH and ALIGATOR, which are based on permutation strategies, MAGMA uses a multiple regression framework to detect the effects of multiple genetic variants on the phenotype. Genes are scored by fitting the SNP data to a multiple linear principal components model, and then modeled as a linear combination of gene-set effects. This provides improved speed, versatility, and robustness to various genomic confounding factors. While some studies have noted the statistical issues related to the MAGMA testing framework (Yurko et al., 2020), MAGMA’s multiple regression framework is not restricted to conventional SNP-to-gene mapping, leading other groups to employ the MAGMA framework for incorporating omics-based functional genomics datasets, which we collectively refer to as “integrative pathway analysis methods”.
Unlike classic pathway analysis tools, integrative pathway analysis methods capitalize on rapidly accruing datasets from transcriptomics, proteomics, and metabolomics to further improve the discovery power of gene-set analysis (Fig. 2). Over the past decade, dramatic advances have been made in functional genomics to understand how genes and intergenic regulatory elements contribute to a particular phenotype. This immense dataset helps us to understand: (1) how certain non-genic DNA elements dynamically regulate gene expression in a specific context; (2) how gene expression changes are specific to certain tissues, cell-types, or developmental times; and (3) how multiple layers of omics datasets can be integrated to recapitulate complex and dynamic gene regulation networks.
Fig. 2.
Various omics datasets are incorporated in integrative pathway analysis, including genomics, transcriptomics, metabolomics, proteomics, and phenomics.
eMAGMA (Gerring et al., 2019) and H-MAGMA (Sey et al., 2020) are two recently developed pathway analysis strategies that employ an integrative approach. Instead of their own implementation, these methods propose a strategy to supplement MAGMA’s SNP-to-gene mapping procedure. Specifically, eMAGMA proposes to map significant cis-eQTL SNPs to their target genes, in addition to assigning SNPs to a gene based on their overlap with protein coding regions. H-MAGMA further proposes to incorporate Hi-C data to include cis-eQTLs with distal regulatory effects on gene expression. A major advantage of this extended SNP-to-gene mapping strategy is that only non-coding SNPs with significant statistical evidence are mapped to each gene, irrespective of their physical proximity. Furthermore, as cis- and distal-regulatory variants are defined in a tissue-, cell-type-, and developmental-stage dependent way, gene-set enrichment analysis can be tested under specific hypothesis-driven contexts. Despite the conceptual improvement, practical utility is somewhat limited because no software is provided to facilitate the proposed SNP-to-gene mapping procedures.
RSS-E (Zhu and Stephens, 2018) is another recently developed integrative pathway analysis method that constructs tissue-based gene sets entirely de novo from gene-expression data. It then tests the association of these newly derived gene sets with a target phenotype. RSS-E utilizes two separate Bayesian models to fit GWAS data. The baseline model assumes that all SNPs are equally likely to be associated with a target phenotype, while the enrichment model assumes that gene-set SNPs are more likely to be associated with the phenotype than other SNPs. Gene-set association is measured as the ratio of the likelihood of the enrichment model to the baseline model. Gene association is measured by comparing the probability that at least one SNP in a locus is trait-associated in both models. The major strength of RSS-E is that it allows the researchers to test de novo gene-sets with distinct, context-specific emphasis.
ActivePathways (Paczkowska et al., 2020) is an integrative gene-set analysis method that takes a much more general strategy for incorporating functional genomics data than the aforementioned methods. ActivePathways uses a machine learning technique called ‘data fusion’. Here, a gene significance-data matrix represents genes’ statistical significance based on any type of investigation. For each gene-set, gene-scores can be consolidated into a single score using Brown’s approximation of Fisher’s method, which considers the dependencies between datasets. Enrichment test statistics are then calculated using a ranked hypergeometric test. ActivePathways features good generalizability and scalability, but as the method largely relies on the gene-significance data, care must be taken to assure that gene-association scores have been calculated with well calibrated methods that account for the confounds of GWAS data.
6. Concluding Remarks
As new integrative pathway analysis methods are being developed, it is important to recognize remaining challenges and open questions. Here we discuss several key aspects of GWAS-specific pathway analysis methods that deserve further improvements.
6.1. Evolving Gene-Sets but Little Consensus on Reliability and Compatibility
Our investigation shows that close to 70% of the human genes lack functional annotations (Fig. 3). Many significant association signals in those genes may not be mapped to any gene-set related genes, negatively impacting the power of pathway analyses. Differences between functional-annotation databases also illustrate disagreement over which genes are functional as well as the specific function of the genes (Graur, 2017). Care must be taken to identify coverages, dependencies, and potential discrepancies among tested gene-sets in order to produce meaningful assessments. Network enrichment analysis, another type of multivariate analysis strategy widely used for GWAS data, can identify groups of biologically related genes de novo (Jia and Zhao, 2014), and thus provides a foundation for the latest developments of integrative pathway analysis, as illustrated for RSS-E (Zhu and Stephens, 2018) and ActivePathways (Paczkowska et al., 2020). Another critical aspect of genome-wide genetic variation data analysis is that patterns and prevalence of LD effects vary across the genome as well as across populations. This can invalidate assumptions of independence that underlie the validity of specific p-value calculations, most notably genotype-based permutation strategies. Certain tools such as INRICH and MAGMA, which explicitly account for the effects of LD, have been shown to maintain proper type-1 error rates across a variety of levels of LD. Determining how specific pathway analysis results can be generalized across different populations and ethnicities is also one of the key priorities in genetics for reducing health disparity.
Fig. 3.
Gene-set Gene Coverage. Proportion of genes covered by a) GO Biological Process, b) GO Molecular Function, c) KEGG, d) PANTHER Pathways and e) all datasets a-d. A gene is considered covered by a database if it is included in at least one of its constituent gene-sets or pathways. Homo Sapiens Entrez genes are used as reference genes. Database coverage is the proportion of reference genes that are included in at least one gene-set in the database. Coverage for a-d is calculated as the proportion of genes covered by the union of all four databases. The list of reference genes was downloaded from the NCBI repository at (https://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz). GO and KEGG data was downloaded from the MSIGDB, using version 7.2 (https://www.gsea-msigdb.org/gsea/downloads.jsp) and PANTHER data was downloaded from (http://www.pantherdb.org/pathway/) on 10/29/2020.
6.2. Difficulties in Translating Gene-Set Association to Mechanistic Insights
Once the lists of significantly enriched gene-sets are identified, experimental validation of the pathway association findings is not a straightforward procedure. Gene-sets often represent a pathway with many interacting functional parts, each of which may contribute only slight increases in the malfunctioning of the system. In extreme cases, a set may consist of hundreds of genes, each representing diverse functions. The GO envelope gene-set, for example, contains over one-thousand genes, including genes associated with hundreds of receptors and structural proteins, among others, as well as genes with less well-known functions (Khatri and Draghici, 2005; Yon Rhee et al., 2008). So, if the GO envelope gene-set was found to be enriched, the functional association between the gene-set and the disorder would not be immediately clear. This problem is further exacerbated as most analytic tools do not prioritize specific genes or implicated functional deficits that may be targeted for experimental analysis.
6.3. Weight or No Weight?
Many pathway methods consider the biological significance of constituent genes equally. In other words, two genes in the same gene-set will cause either equivalent enrichment signals (in case of over-representation-based methods) or based on their association significance (in case of regression-based methods). However, if one gene plays a central biological role in the gene-set function and the other plays a peripheral role, it may be more informative to weigh the effects of gene-set association more heavily when disease-association occurs in genes with more central functions. For instance, alcohol dehydrogenase gene ADH1 plays a central role in ethanol metabolism, while another dehydrogenase family gene ADH7 has been shown to play a more marginal role (Hurley and Edenberg, 2012). It may thus provide improved power to place more weight on ADH1 when analyzing the enrichment of the GO ethanol oxidation gene-set. Some gene-expression-based pathway analysis methods have shown promising directions for addressing this issue; For example, SPIA (Tarca et al., 2009) introduces a perturbation factor, which weights the effects of topologically upstream genes in the pathways using the number of downstream genes. Another tool, PARADIGM (Vaske et al., 2010), extends SPIA using an array of omics datasets to calculate more cultured weighting factors, while the Onto-Tools (Draghici et al., 2007) considers the magnitude of gene expression changes, types, and positions in given pathways. Just as the biological information about the role of single genes in their respective pathways may lead to more precise analysis, so may information about gene-gene or pathway-pathway interactions. Genes may have epistatic effects on other genes, especially within the same pathway (McKinney and Pajewski, 2011), and pathways interact with one another as well (Fang et al., 2019; Werner, 2008). Indeed, it is rare for entire pathways to be up- or down-regulated, more often a subset of genes in different pathways across related regulatory networks are affected (Sullivan, 2012). New strategies are being developed to detect interactions between genomic regions (Zhang et al., 2019), pathways (Hill et al., 2019), or conditional independencies between pathways (Byrne et al., 2020). We anticipate the incorporation of inter and intra-pathway information will be a major impetus for increasing the power of GWAS-based pathway analysis.
6.4. Beyond Functional Genomics - Integrating Phenomics Data
So far, pathway analyses have focused mainly on the study of single disorders. However, it has become clear that a single variant or a gene affects multiple traits, a genetics phenomenon known as pleiotropy (Solovieff et al., 2013). In particular, many common and complex diseases are known to share extensive genetic influences (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2019; Tian et al., 2020), for example, between psychiatric disorders (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2019). Extensive pleiotropy has also been reported between mental and somatic diseases (Wang et al., 2015a; Kember et al., 2018), between diseases and anthropomorphic traits (Grassmann et al., 2017), and between distinct kinds of quantitative traits (Wang et al., 2015b). With new methodologies of genetic analysis, these complex genetic relationships can be further estimated beyond correlations of effect (Frei et al., 2019). Some recently-developed methods, for example, SCOPA (Magi et al., 2017), PHARAOH-GEE (Lee et al., 2019), and MARV (Kaakinen et al., 2017), have targeted genetic data analysis of multiple phenotypes, and are extensible to pathway analysis. While informative, these methods require the input of raw-level genotype data, which are often cases not readily available. We anticipate newer pathway analysis methods that can utilize the massive amounts of publicly available GWAS summary statistics will play an essential role in disentangling the complex genetic architecture of human traits.
In conclusion, pathway analysis has become an indispensable resource for tackling extensive polygenicity and heterogeneity of common and complex traits. In this review, we have presented the intellectual foundation, gene set resources, hallmark methods, and open challenges for pathway analysis. There are a variety of powerful methods, even beyond those we have listed, each with its own advantages and disadvantages. Despite the steady progress, however, challenges still remain in various aspects. We expect further developments that can utilize cross-phenotype relationships, inter- and intra-pathway interactions, and functional significance of the member genes will improve the utility of pathway analysis even further in coming years.
Acknowledgements
We are deeply indebted to the investigators for their dedications in genetics research, method development, and the open data science policy. This research was supported by National Institute of Health R00 MH101367 and R01 MH119243 (to P.H. Lee). All authors declare no conflict of interests.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Benjamini Y, Hochberg Y, 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B 57, 289–300. [Google Scholar]
- Benjamini Y, Yekutieli D, 2001. The Control of the False Discovery Rate in Multiple Testing under Dependency. Ann. Stat 29, 1165–1188. [Google Scholar]
- Braun R, 2014. Systems analysis of high-throughput data. Adv. Exp. Med. Biol 844, 153–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breen G, Li Q, Roth BL, O’Donnell P, Didriksen M, et al. , 2016. Translating genome-wide association findings into new therapeutics for psychiatry. Nat. Neurosci 19, 1392–1396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown GR, Hem V, Katz KS, Ovetsky M, Wallin C, et al. , 2015. Gene: a gene-centered information resource at NCBI. Nucleic Acids Res 43, D36–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, et al. , 2019. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byrne EM, Zhu Z, Qi T, Skene NG, Bryois J, et al. , 2020. Conditional GWAS analysis to identify disorder-specific SNPs for psychiatric disorders. Mol. Psychiatry 10.1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cross-Disorder Group of the Psychiatric Genomics Consortium. Electronic address, plee mgh harvard edu, Cross-Disorder Group of the Psychiatric Genomics, C., 2019. Genomic Relationships, Novel Loci, and Pleiotropic Mechanisms across Eight Psychiatric Disorders. Cell 179, 1469–1482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Leeuw CA, Mooij JM, Heskes T, Posthuma D, 2015. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol 11, e1004219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Leeuw CA, Neale BM, Heskes T, Posthuma D, 2016. The statistical properties of gene-set analysis. Nat. Rev. Genet 17, 353–364. [DOI] [PubMed] [Google Scholar]
- Draghici S, Khatri P, Tarca AL, Amin K, Done A, et al. , 2007. A systems biology approach for pathway level analysis. Genome Res. 17, 1537–1545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Encode Project Consortium, 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fang G, Wang W, Paunic V, Heydari H, Costanzo M, et al. , 2019. Discovering genetic interactions bridging pathways in genome-wide association studies. Nat. Commun 10, 4274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frankish A, Diekhans M, Ferreira AM, Johnson R, Jungreis I, et al. , 2019. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47, D766–D773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frei O, Holland D, Smeland OB, Shadrin AA, Fan CC, et al. , 2019. Bivariate causal mixture model quantifies polygenic overlap between complex traits beyond genetic correlation. Nat. Commun 10, 2417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gamazon ER, Segre AV, van de Bunt M, Wen X, Xi HS, et al. , 2018. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat. Genet 50, 956–967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gerring ZF, Gamazon ER, Derks EM, Major Depressive Disorder Working Group of the Psychiatric Genomics, C., 2019. A gene co-expression network-based analysis of multiple brain tissues reveals novel genes and molecular pathways underlying major depression. PLoS Genet. 15, e1008245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goeman JJ, Bühlmann P, 2007. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23, 980–987. [DOI] [PubMed] [Google Scholar]
- Gonzalez-Serna D, Ochoa E, Lopez-Isac E, Julia A, Degenhardt F, et al. , 2020. A cross-disease meta-GWAS identifies four new susceptibility loci shared between systemic sclerosis and Crohn’s disease. Sci. Rep 10, 1862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grassmann F, Kiel C, Zimmermann ME, Gorski M, Grassmann V, et al. , 2017. Genetic pleiotropy between age-related macular degeneration and 16 complex diseases and traits. Genome Med. 9, 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Graur D, 2017. An Upper Limit on the Functional Fraction of the Human Genome. Genome Biol. Evol 9, 1880–1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- GTEx Consortium, 2020. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hill WD, Davies NM, Ritchie SJ, Skene NG, Bryois J, et al. , 2019. Genome-wide analysis identifies molecular systems and 149 genetic loci associated with income. Nat. Commun 10, 5741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holden M, Deng S, Wojnowski L, Kulle B, 2008. GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics 24, 2784–2785. [DOI] [PubMed] [Google Scholar]
- Holland D, Frei O, Desikan R, Fan CC, Shadrin AA, et al. , 2020. Beyond SNP heritability: Polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model. PLoS Genet. 16, e1008612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holmans P, 2010. Statistical methods for pathway analysis of genome-wide data for association with complex genetic traits. Adv. Genet 72, 141–179. [DOI] [PubMed] [Google Scholar]
- Holmans P, Green EK, Pahwa JS, Ferreira MA, Purcell SM, et al. , 2009. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am. J. Hum. Genet 85, 13–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Homo sapiens - Vega Genome Browser 68 [WWW Document], 2017.
- Hong MG, Pawitan Y, Magnusson PK, Prince JA, 2009. Strategies and issues in the detection of pathway enrichment in genome-wide association studies. Hum. Genet 126, 289–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Howard DM, Adams MJ, Clarke TK, Hafferty JD, Gibson J, et al. , 2019. Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nat. Neurosci 22, 343–352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hurley TD, Edenberg HJ, 2012. Genes encoding enzymes involved in ethanol metabolism. Alcohol Res. 34, 339–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jantzen SG, Sutherland BJ, Minkley DR, Koop BF, 2011. GO Trimming: Systematically reducing redundancy in large Gene Ontology datasets. BMC Res. Notes 4, 267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jia P, Zhao Z, 2014. Network.assisted analysis to prioritize GWAS results: principles, methods and perspectives. Hum. Genet 133, 125–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaakinen M, Magi R, Fischer K, Heikkinen J, Jarvelin MR, et al. , 2017. MARV: a tool for genome-wide multi-phenotype analysis of rare variants. BMC Bioinformatics 18, 110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M, Goto S, 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kember RL, Hou L, Ji X, Andersen LH, Ghorai A, et al. , 2018. Genetic pleiotropy between mood disorders, metabolic, and endocrine traits in a multigenerational pedigree. Transl. Psychiatry 8, 218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khatri P, Draghici S, 2005. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21, 3587–3595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kirsten H, Al-Hasani H, Holdt L, Gross A, Beutner F, et al. , 2015. Dissecting the genetics of the human transcriptome identifies novel trait-related trans-eQTLs and corroborates the regulatory relevance of non-protein coding locidagger. Hum. Mol. Genet 24, 4746–4763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, et al. , 2005. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koopmans F, van Nierop P, Andres-Alonso M, Byrnes A, Cijsouw T, et al. , 2019. SynGO: An Evidence-Based, Expert-Curated Knowledge Base for the Synapse. Neuron 103, 217–234 e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, et al. , 2010. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee PH, O’Dushlaine C, Thomas B, Purcell SM, 2012. INRICH: interval-based enrichment analysis for genome-wide association studies. Bioinformatics 28, 1797–1799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, Kim S, Kim Y, Oh B, Hwang H, et al. , 2019. Pathway analysis of rare variants for the clustered phenotypes by using hierarchical structured components analysis. BMC Med. Genomics 12, 100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Calvo SE, Gutman R, Liu JS, Mootha VK, 2014. Expansion of biological pathways based on evolutionary inference. Cell 158, 213–225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY, Sullivan PF, 2009. Meta-analysis of genome-wide association studies with overlapping subjects. Am. J. Hum. Genet 85, 862–872. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lips ES, Kooyman M, de Leeuw C, Posthuma D, 2015. JAG: A Computational Tool to Evaluate the Role of Gene-Sets in Complex Traits. Genes (Basel) 6, 238–251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, et al. , 2010. A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet 87, 139–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Magi R, Suleimanov YV, Clarke GM, Kaakinen M, Fischer K, et al. , 2017. SCOPA and META-SCOPA: software for the analysis and aggregation of genome-wide association studies of multiple correlated phenotypes. BMC Bioinformatics 18, 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKinney BA, Pajewski NM, 2011. Six Degrees of Epistasis: Statistical Network Models for GWAS. Front. Genet 2, 109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Medina I, Montaner D, Bonifaci N, Pujana MA, Carbonell J, et al. , 2009. Gene set-based analysis of polymorphisms: finding pathways or biological processes associated to traits in genome-wide association studies. Nucleic Acids Res. 37, 340–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mi H, Muruganujan A, Thomas PD, 2013. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 41, 377–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Molyneaux BJ, Goff LA, Brettler AC, Chen HH, Hrvatin S, et al. , 2015. DeCoN: genome-wide analysis of in vivo transcriptional dynamics during pyramidal neuron fate selection in neocortex. Neuron 85, 275–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, et al. , 2003. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet 34, 267–273. [DOI] [PubMed] [Google Scholar]
- Nagarathnam B, Karpe SD, Harini K, Sankar K, Iftekhar M, et al. , 2014. DOR - a Database of Olfactory Receptors - Integrated Repository for Sequence and Secondary Structural Information of Olfactory Receptors in Selected Eukaryotic Genomes. Bioinform Biol. Insights 8, 147–158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- National Center for Health Statistics, 2021. International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) [WWW Document]. [PubMed]
- National Library of Medicine, 2020. Medical Subject Headings [WWW Document].
- O’Dushlaine C, Kenny E, Heron EA, Segurado R, Gill M, et al. , 2009. The SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics 25, 2762–2763. [DOI] [PubMed] [Google Scholar]
- Online Mendelian Inheritance in Man, OMIM® [WWW Document], 2020. McKusick-Nathans Inst. Genet. Med. Johns Hopkins Univ. [Google Scholar]
- Paczkowska M, Barenboim J, Sintupisut N, Fox NS, Zhu H, et al. , 2020. Integrative pathway enrichment analysis of multivariate omics data. Nat. Commun 11, 735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pardinas AF, Holmans P, Pocklington AJ, Escott-Price V, Ripke S, et al. , 2018. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat. Genet 50, 381–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pasman JA, Verweij KJH, Gerring Z, Stringer S, Sanchez-Roige S, et al. , 2018. GWAS of lifetime cannabis use reveals new risk loci, genetic overlap with psychiatric traits, and a causal influence of schizophrenia. Nat. Neurosci 21, 1161–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedroso I, Barnes MR, Lourdusamy A, Al-Chalabi A, Breen G, 2015. FORGE: multivariate calculation of gene-wide p-values from Genome-Wide Association Studies. bioRxiv 23648. [Google Scholar]
- Pedroso I, Lourdusamy A, Rietschel M, Nothen MM, Cichon S, et al. , 2012. Common genetic variants and gene-expression changes associated with bipolar disorder are over-represented in brain signaling pathway genes. Biol. Psychiatry 72, 311–317. [DOI] [PubMed] [Google Scholar]
- Saevarsdottir S, Olafsdottir TA, Ivarsdottir EV, Halldorsson GH, Gunnarsdottir K, et al. , 2020. FLT3 stop mutation increases FLT3 ligand level and risk of autoimmune thyroid disease. Nature 584, 619–623. [DOI] [PubMed] [Google Scholar]
- Schriml LM, Mitraka E, Munro J, Tauber B, Schor M, et al. , 2019. Human Disease Ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 47, D955–D962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrode N, Ho SM, Yamamuro K, Dobbyn A, Huckins L, et al. , 2019. Synergistic effects of common schizophrenia risk variants. Nat. Genet 51, 1475–1485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Segre AV, Consortium D, investigators M, Groop L, Mootha VK, et al. , 2010. Common inherited variation in mitochondrial genes is not enriched for associations with type 2 diabetes or related glycemic traits. PLoS Genet. 6, e1001058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sey NYA, Hu B, Mah W, Fauni H, McAfee JC, et al. , 2020. A computational tool (H-MAGMA) for improved prediction of brain-disorder risk genes by incorporating brain chromatin interaction profiles. Nat. Neurosci 23, 583–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shah S, Henry A, Roselli C, Lin H, Sveinbjornsson G, et al. , 2020. Genome-wide association and Mendelian randomisation analysis provide insights into the pathogenesis of heart failure. Nat. Commun 11, 163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shahpori R, Doig C, 2010. Systematized Nomenclature of Medicine-Clinical Terms direction and its implications on critical care. J. Crit. Care 25, 364 1–9. [DOI] [PubMed] [Google Scholar]
- Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW, 2013. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet 14, 483–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Storey JD, 2002. A Direct Approach to False Discovery Rates. J. R. Stat. Soc. Ser. B (Statistical Methodol.) 64, 479–498. [Google Scholar]
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. , 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U S A 102, 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sullivan PF, 2012. Puzzling over schizophrenia: schizophrenia as a pathway disease. Nat. Med 18, 210–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun R, Hui S, Bader GD, Lin X, Kraft P, 2019. Powerful gene set analysis in GWAS with the Generalized Berk-Jones statistic. PLoS Genet. 15, e1007530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Supek F, Bosnjak M, Skunca N, Smuc T, 2011. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One 6, e21800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, et al. , 2009. A novel signaling pathway impact analysis. Bioinformatics 25, 75–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The Gene Ontology C, 2019. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D330–D338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thompson R, Johnston L, Taruscio D, Monaco L, Beroud C, et al. , 2014. RD-Connect: an integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research. J. Gen. Intern Med 3, 780–787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tian D, Wang P, Tang B, Teng X, Li C, et al. , 2020. GWAS Atlas: a curated resource of genome-wide variant-trait associations in plants and animals. Nucleic Acids Res. 48, 927–932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, et al. , 2010. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, 237–245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vuckovic D, Bao EL, Akbari P, Lareau CA, Mousas A, et al. , 2020. The Polygenic and Monogenic Basis of Blood Traits and Diseases. Cell 182, 1214–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang G, Oh DH, Dassanayake M, 2020. GOMCL: a toolkit to cluster, evaluate, and extract non-redundant associations of Gene Ontology-based functions. BMC Bioinformatics 21, 139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang K, Li M, Bucan M, 2007. Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet 81, 1278–1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Q, Yang C, Gelernter J, Zhao H, 2015. Pervasive pleiotropy between psychiatric disorders and immune disorders revealed by integrative analysis of multiple GWAS. Hum. Genet 134, 1195–1209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y, Liu A, Mills JL, Boehnke M, Wilson AF, et al. , 2015. Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genet. Epidemiol 39, 259–275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Werner T, 2008. Bioinformatics applications for pathway analysis of microarray data. Curr. Opin. Biotechnol 19, 50–54. [DOI] [PubMed] [Google Scholar]
- Xavier RJ, Rioux JD, 2008. Genome-wide association studies: a new window into immune-mediated diseases. Nat. Rev. Immunol 8, 631–643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yates AD, Achuthan P, Akanni W, Allen J, Allen J, et al. , 2020. Ensembl 2020. Nucleic Acids Res. 48, 682–688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yon Rhee S, Wood V, Dolinski K, Draghici S, 2008. Use and misuse of the gene ontology annotations. Nat. Rev. Genet 9, 509–515. [DOI] [PubMed] [Google Scholar]
- Yurko R, Roeder K, Devlin B, G’Sell M, 2020. H-MAGMA, inheriting a shaky statistical foundation, yields excess false positives. bioRxiv 2020.08.20.260224. [DOI] [PubMed] [Google Scholar]
- Zhang S, Jiang W, Ma RC, Yu W, 2019. Region-based interaction detection in genome-wide case-control studies. BMC Med. Genomics 12, 133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu X, Stephens M, 2018. Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nat. Commun 9, 4361. [DOI] [PMC free article] [PubMed] [Google Scholar]



