Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2023 Dec 14;52(3):e17. doi: 10.1093/nar/gkad1187

Gene set correlation enrichment analysis for interpreting and annotating gene expression profiles

Lan-Yun Chang 1,2, Meng-Zhan Lee 2,2, Yujia Wu 3,2, Wen-Kai Lee 4, Chia-Liang Ma 5, Jun-Mao Chang 6, Ciao-Wen Chen 7, Tzu-Chun Huang 8, Chia-Hwa Lee 9,10,11,12, Jih-Chin Lee 13, Yu-Yao Tseng 14, Chun-Yu Lin 15,16,17,18,19,20,
PMCID: PMC10853793  PMID: 38096046

Abstract

Pathway analysis, including nontopology-based (non-TB) and topology-based (TB) methods, is widely used to interpret the biological phenomena underlying differences in expression data between two phenotypes. By considering dependencies and interactions between genes, TB methods usually perform better than non-TB methods in identifying pathways that include closely relevant or directly causative genes for a given phenotype. However, most TB methods may be limited by incomplete pathway data used as the reference network or by difficulties in selecting appropriate reference networks for different research topics. Here, we propose a gene set correlation enrichment analysis method, Gscore, based on an expression dataset-derived coexpression network to examine whether a differentially expressed gene (DEG) list (or each of its DEGs) is associated with a known gene set. Gscore is better able to identify target pathways in 89 human disease expression datasets than eight other state-of-the-art methods and offers insight into how disease-wide and pathway-wide associations reflect clinical outcomes. When applied to RNA-seq data from COVID-19-related cells and patient samples, Gscore provided a means for studying how DEGs are implicated in COVID-19-related pathways. In summary, Gscore offers a powerful analytical approach for annotating individual DEGs, DEG lists, and genome-wide expression profiles based on existing biological knowledge.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

Recent advances in genome-/proteome-wide profiling technologies for DNA, RNA, and protein have lowered the costs and improve the accuracy of sequencing, which in turn has increased the amount of available data. Accordingly, the most common need is to interpret and annotate these profiles with the context of prior knowledge in biology, biomedicine, and pharmacology. Pathway analysis has emerged as the primary approach for capturing insight into the biological mechanisms associated with differentially expressed genes (DEGs) and proteins (1). Over 70 pathway analysis methods have been proposed; these can be categorized into nontopology-based (non-TB, also called gene set analysis methods) or topology-based (TB) methods (2,3). The first-generation non-TB methods are known as overrepresentation analysis (ORA), which assess whether the DEGs in a given list are over- or underrepresented in gene sets (e.g. groups of genes in the pathways) using Fisher's exact test, hypergeometric test, or χ2 test (1,2,4). The early tools developed based on ORA were Onto-Express (5,6) and GeneMAPP (7), followed by EASE (8), GeneMerge (9), FuncAssociate (10), and so on. Some commonly used Gene Ontology (GO) analysis tools, such as FatiGO (11), GOstats (12), DAVID (13,14), and WebGestalt (15), are also based on ORA. However, the ORA methods often disregard the measured changes in gene expression (i.e. treating each gene equally) and exclude marginally or less significant genes, resulting in information loss (1,2). To overcome these limitations, functional class scoring (FCS) approaches, assuming that weak but coordinated changes in sets of functionally related genes (e.g. pathways) may also play an important role, have been developed as the second-generation of non-TB methods (1,2). Widely used FCS approaches include GSEA (16), GSA (17), PADOG (18), and so on. Although FCS approaches address some of the limitations of ORA, they are similar to ORA in that they analyze each pathway independently and fail to account for cross-talk and overlap between pathways, limiting their ability to describe and capture biological processes (1,2).

Thus, third-generation approaches, known as topology-based (TB, also called pathway topology-based) methods, have been proposed in order to utilize pathway/network knowledge bases that provide additional information, including the dependencies and interactions between genes (or proteins) (1,19,20). The earliest TB tools and methods were Pathway-Express (21,22) and SPIA (23), followed by more than 30 others, including CePaORA/CePaGSA (24,25), PathNet (26), ROntoTools (27,28), and NEA/NEArender/EviNet (29–31). Previous studies indicated that TB methods perform better than non-TB methods in identifying pathways that include genes that are closely relevant to or directly induce the phenotype, especially when researchers would like to consider how various genes (or proteins) interact, explain downstream effects, and further investigate fundamental mechanisms at the pathway level (1,2). Although the availability of pathway data has increased substantially, the current pathways from an individual database used as the reference network (i.e. global network) for some TB methods are still incomplete (1). Inconsistent pathway data formats among the different databases also have posed a challenge in pathway integration (1,19). Thus, the NEA method and related tools (29–31) offer a solution that allows users to choose and input a reference network. However, selecting appropriate reference networks for different research topics can be a challenging issue for users. Coexpression network analysis is an alternative solution that can be used to infer correlations between genes in a specific spatiotemporal state from genome-wide expression profiles (32,33). It can be effectively utilized to distinguish the correlations for examining which genes are active in the same biological processes simultaneously (34) and to further associate a group of genes of unknown function with previously annotated gene sets. Another unmet need is to prioritize DEGs based on their impacts on a specific pathway after associating a long list of DEGs with this pathway. The prioritization of DEGs can guide the planning of follow-up experiments and translation to clinical applications (35,36). Nevertheless, most of the previously proposed methods and tools were designed without consideration of the use of coexpression networks or the need for DEG prioritization.

To address these issues, inspired by the previous TB methods (e.g. Pathway-Express (21,22) and NEA (29)) and coexpression network analysis (32,34,35), we developed a gene set correlation enrichment analysis (Gscore) method that utilizes a coexpression network derived from the given gene expression data instead of a preexisting aggregate network (or pathway sets) as the reference. Based on the coexpression network, Gscore estimates the enrichment of coexpressed gene pairs between DEGs in a collection of gene sets and in a selected list (as well as each of its DEGs) using the hypergeometric distribution with Benjamini–Hochberg correction. Our experimental results demonstrated that Gscore provided a better ability to identify target pathways (i.e. the lowest median rank values) than the other five TB (i.e. CePaORA, CePaGSA, NEA, ROntoTools, and SPIA) and three non-TB (i.e. ORA, GSEA, and GSA) methods on 75 microarray datasets related to 15 diseases and 16 RNA sequencing (RNA-seq) datasets in 16 cancer types. In the analyses of DEGs derived from RNA-seq datasets for 16 cancers, Gscore displayed the most efficient ability to associate the DEGs with pancancer pathways, which are common hallmarks of cancer, defined by the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (37). Moreover, Gscore also achieved high accuracy, precision, and a low false positive rate (FPR) in detecting associations between individual DEGs and pancancer pathways. The use of Gscore in conjunction with overall survival data and a meta-analytical framework facilitates cancer-wide and pathway-wide investigations to quantify common signatures and prognostic correlations in individual DEGs associated with pancancer pathways. By applying pathway analysis methods to four RNA-seq datasets of samples derived from patients with coronavirus disease 2019 (COVID-19) or cells infected with SARS-CoV-2, we observed that Gscore not only facilitates better associations with the target and related pathways of COVID-19 but also offers insights for studying how individual DEGs are implicated in those pathways. Here, Gscore was implemented through a user-friendly, freely available online tool to provide statistically rigorous hypothesis testing, a clear illustration of the Gscore procedure, and intuitive and interactive visualizations of findings. In summary, Gscore offers a new and valuable framework to facilitate the annotation and characterization of individual DEGs, DEG lists, and genome-wide expression profiles based on prior biological knowledge.

Materials and methods

Gene expression datasets

To examine the performance of different pathway analysis methods, we first assembled 75 microarray expression datasets in 15 diseases from the Gene Expression Omnibus (GEO) database (38) (Supplementary Table S1), which were used as benchmark datasets to evaluate different methods in previous studies (2,18). To independently test the performance of these methods on RNA-seq data, RNA-seq datasets in 16 cancer types were further collected from The Cancer Genome Atlas (TCGA) database (39) (Supplementary Table S2).

For microarray data, the SOFT format file, the corresponding annotation file retrieved from the GEO, and the Bioconductor AnnotationData package (v. 3.14) were used to retrieve information about the array platform, including the probe ID, Entrez gene ID, and gene description. We downloaded raw CEL files of Affymetrix data and employed the threestep function from the affyPLM package (v. 1.70.0) for robust multiarray average (RMA) background adjustment, quantile normalization, and median polish summarization. The median value was used if more than one probe was mapped to the same gene. For the TCGA datasets, we downloaded level 3 RNASeqV2 data (upper quartile-normalized RNA-seq by expectation maximization (RSEM) count estimates) containing the expression profiles of 20531 genes with Entrez Gene IDs for 7351 samples, comprising 6649 tumor samples and 702 corresponding normal tissues. RNA-seq data were matched through the patient barcodes provided by TCGA. The values of both RNA-seq and microarray data were log2-transformed before being used for further analysis.

To identify DEGs between case and corresponding control samples for the microarray datasets in each disease, we followed the criteria suggested by previous work (2) to evaluate the P value for each gene based on the two-sample t test using the ttest_ind function from the SciPy package (v. 1.9.1). Among the genes with P values <0.05, the top 400 genes ranked by the absolute log2(fold change) values were considered DEGs (2). For the TCGA RNA-seq datasets, a modified t-statistic (limma package v. 3.24.15) was utilized to measure DEGs between tumors and corresponding normal samples in each cancer type. Based on the criteria suggested by previous studies (35,40), we used |log2(fold change)| ≥ 1 and adjusted P values ≤0.05 to identify DEGs.

Additionally, to further apply and validate our Gscore method and the other pathway analysis methods, we assembled four RNA-seq datasets relevant to coronavirus disease 2019 (COVID-19), including two patient-derived datasets (GEO ID: GSE157103 (41) and GSE150316 (42)) and two cell-derived datasets (GSE147507 (43,44) and GSE160435 (45); Supplementary Table S3). The patient-derived plasma and leukocyte (PPL) sample dataset (GSE157103) consists of 100 plasma and leukocyte samples from hospitalized patients with COVID-19 and 26 samples from patients without COVID-19. Another patient-derived autopsy (PA) sample dataset (GSE150316) comprises five control autopsy lung samples from five patients and 29 autopsy samples from eight patients with high SARS-CoV-2 loads. The normal human bronchial epithelial (NHBE) cell dataset (GSE147507) contains three mock-treated and three SARS-CoV-2-infected primary human lung epithelium samples. The alveolar type 2 progenitor (AT2) cell dataset (GSE160435) includes five mock-transfected and five SARS-CoV-2-infected organoids generated from primary lung AT2 cells. For the NHBE cell, AT2 cell, and PA sample datasets, we followed the DEG identification criteria from the corresponding works, i.e. P values < 0.05 (Wald test; DESeq2). For the PPL sample dataset, we defined the DEGs based on the same criteria used in the TCGA datasets. Note that either the gene expression data or the corresponding DEG list was used as input depending on the requirement of the pathway analysis method used; for instance, gene expression data were used for the PADOG, GSA, and GSEA methods. In the online Gscore tool, the DEG criteria can be set by the user.

KEGG pathway set

To assess the pathway analysis methods based on their ability to identify the target pathway depicting the mechanism corresponding to the condition studied, we first collected 347 human pathways containing 8146 proteins from the KEGG database (v. 102) (37). A collected expression dataset is relevant to a disease or condition for which the mechanisms are known and are described in a KEGG pathway, called the target pathway (Supplementary Table S4 and S5). The ability to rank the target pathway on top based on its statistical significance (e.g. P value or FDR q value) was used to evaluate the pathway analysis methods (2,3,18,46). For each KEGG pathway, the corresponding gene set includes all the genes (without interactions) involved in the pathway; in other words, the gene set is extracted from the original pathway without the structure or other additional information (Supplementary Figures S1A and S1B). Note that the pathway or corresponding gene set was used as input depending on the requirement of the pathway analysis method used.

KEGG pathways consist of manually curated pathway maps encompassing experts' understanding of molecular interactions, reactions, and relation networks for numerous categories (37). To evaluate whether various methods could detect the associations between pancancer pathways and the DEG lists (or their individual DEGs) derived from the TCGA RNA-seq datasets related to 16 cancers, we first collected all 11 pathways that belong to the category ‘6.1 Cancer: overview’. Then, we obtained the related pathways, as defined in the KEGG database (Supplementary Table S6). For example, ‘PD-L1 expression and PD-1 checkpoint pathway in cancer (hsa05235)’ is one of the pathways in the category of ‘6.1 Cancer: overview’, and ‘MAPK signaling pathway (hsa04010)’ is defined as a related pathway, recorded at the bottom of the pathway entry web page (https://www.genome.jp/entry/hsa05235, Supplementary Figure S1C). Finally, these 11 pathways and 58 nonredundant related pathways (69 in total) were deemed pancancer pathways. Using a parallel procedure, we also collected the ‘coronavirus disease - COVID-19 (hsa05171)’ pathway as the target pathway and identified its 16 related pathways. These 17 COVID-19-related pathways (Supplementary Table S7) were used to examine the ability to detect associations between these pathways and the DEG lists (or their individual DEGs) derived from four RNA-seq datasets relevant to COVID-19.

Prognostic genes in 16 cancers

For the prognostic gene sets in 16 cancers, we chose TCGA patient samples with both gene expression data and clinical outcome data to assess the correlation of each gene with survival outcomes. For each cancer, we estimated the correlation between each gene and the 10-year survival outcomes by Cox proportional hazards regression. The coxph function of the R survival package (v. 3.2-12) was used for Cox proportional hazards regression analysis. We utilized the median (50%) expression value of each gene in every malignancy to stratify the corresponding patients into high- and low-risk groups for Kaplan–Meier analysis to evaluate their correlation with 10-year survival. For each gene in each cancer, Cox proportional hazard regression analysis was used to obtain the Cox coefficients, P values (log-rank test), z scores, and hazard ratios (HRs) with 95% confidence intervals. A gene with a P value of 0.05 or less was considered to have a significant prognostic correlation in each cancer, and genes with HR > 1 (or z score > 0) and < 1 (or z score < 0) were considered adverse and favorable prognostic genes, respectively (Supplementary Table S8).

Evaluation of the association between the DEG list (or each of its DEGs) and a specific gene set

To measure the statistical significance of associations between a list of selected DEGs and a specific gene set (e.g. a group of genes in a KEGG pathway), the Gscore method evaluates the enrichment of coexpressed gene pairs between all the DEGs of the selected list and all the DEGs in the collection of gene sets (e.g. gene sets for 347 KEGG human pathways) based on the hypergeometric distribution with Benjamini–Hochberg correction (Figure 1). Moreover, for each DEG in the selected list, the goal of identifying its association with a certain gene set is to quantify how it contributes to capturing and describing the biological phenomenon based on the correlations and dependencies between it and the gene set. The process details are as follows:

Figure 1.

Figure 1.

Schematic overview of gene set correlation enrichment analysis (Gscore). First, given expression profiles with two phenotypic classes (i.e. control and case) and a collection of gene sets (e.g. gene sets of all KEGG human pathways), differentially expressed genes (DEGs) were identified by comparing the control samples and case samples. Next, we constructed the coexpression network using the expression profiles of case samples, of which two DEGs (blue or orange circle) with a Pearson correlation coefficient (|Pearson's r|) ≥ c (e.g. c is 0.3, 0.5, or 0.7) across case samples were considered a coexpressed gene pair (dotted lines). Finally, Gscore used the coexpressed gene pairs between the DEG list (or each of its DEGs) and each gene set in this collection to determine whether the association was statistically significant (FDR qvalue ≤ 0.05) based on the hypergeometric distribution with Benjamini–Hochberg correction.

  1. For each input gene expression dataset with samples belonging to two classes, we first identified the DEGs between control and case samples. Then, we constructed a coexpression network using the expression profiles of the case samples, in which two DEGs with a Pearson correlation coefficient (|Pearson's r|) ≥ c across case samples were considered a coexpressed gene pair. Here, c can be set by the user, for example, to 0.3 (low), 0.5 (moderate) or 0.7 (high). Note that genes with identical expression values across all ‘case’ samples are ignored when calculating the correlation.

  2. For each DEG in the query list (Supplementary Figure S2A), Gscore first uses the coexpressed gene pairs between that DEG and all the DEGs of a gene set in the selected collection to determine the association significance for this gene set based on the hypergeometric distribution (35,47,48) as follows:
    graphic file with name M0001.gif (1)
    where m and n are, respectively, the numbers of coexpressed gene pairs and all possible gene pairs between each DEG in the query DEG list and all the DEGs in a specific gene set; for instance, the values for m and n between DEG a and Gene Set A (containing 5 DEGs) in Supplementary Figures S2A are 3 (red dotted lines) and 5, respectively. M and N are, respectively, the total numbers of all the coexpressed gene pairs and all possible gene pairs between each DEG in the query DEG list and all the DEGs in the gene sets of the selected collection. The FDR q value for multiple hypothesis testing with the Benjamini–Hochberg method was used (49), and the false discovery rate was controlled at 5%. Here, the association between the DEG in the query DEG list and a certain gene set was considered statistically significant when its q value was ≤ 0.05.
  3. For the query DEG list, Gscore further measured the statistical significance of association for a specific gene set based on the coexpressed gene pairs between all of the involved DEGs and all the DEGs of this gene set in the selected collection (Supplementary Figure S2B). Then, we computed the P value of the hypergeometric distribution as follows:
    graphic file with name M0001a.gif (2)

where Inline graphic and Inline graphic are, respectively, the numbers of coexpressed gene pairs and all possible gene pairs between all the DEGs of the query list and all the DEGs in a specific gene set; for example, Inline graphic and Inline graphic between List i (including 7 involved DEGs) and Gene Set A (containing 5 DEGs) in Supplementary Figure S2B are 6 and 35, respectively. Inline graphic and Inline graphic are, respectively, the total numbers of all the coexpressed gene pairs and all possible gene pairs between all the DEGs of the query list and all the DEGs of gene sets in the selected collection. Here, the association between the query DEG list and a certain gene set was considered statistically significant when its FDR q value was ≤0.05 (Benjamini–Hochberg correction).

Brief introductions to and suggested thresholds for the NEA (29), ROntoTools (27,28), SPIA (23), CePaORA (24,25), CePaGSA (24,25), PADOG (18), GSA (17), GSEA (16), and ORA (hypergeometric test) (5,6,11,12) methods are presented in Supplementary Table S9. For the NEA method, the suggested aggregate network, merged6_and_wir1_HC2 (50), was used as input. It was constructed by merging the higher confidence network ‘FClim_HC2’ with the curated links available from several databases (e.g. CORUM (51), Phosphosite (52), KEGG (37), and MSigDB (53)) and the reverse-engineered network ‘wir1’. The FClim_HC2 and wir1 networks were derived from the FunCoup (54) database and the TCGA glioblastoma expression, methylation, and mutation datasets, respectively.

Pancancer analysis

To assess the association between each DEG list (or its individual DEGs) and the gene set of a certain pancancer pathway with statistical significance across 16 cancer types (i.e. TCGA RNA-seq datasets), the FDR q value (or nominal P value) for each significant association was transformed to a z score (35). The z scores in the 16 cancers were further summarized using Stouffer's unweighted Z-transform test (55) (i.e. meta-z score). Then, Kendall's τ coefficient (kendalltau function in scipy package) was used to measure the correlation between the associations with 69 pancancer pathways across 16 cancer types identified by Gscore and the other methods.

For an individual DEG, we wanted to observe pathway-wide and cancer-wide signatures associated with prognostic outcomes based on two issues: (i) which gene is involved in the most pancancer pathways, and what survival outcome is correlated with that gene; and (ii) which association between a gene and pancancer pathway is more likely to be shared by distinct tumor types, and what prognostic outcome is correlated with that gene. Therefore, the significance of the pathway-wide and cancer-wide enrichment of each DEG was evaluated by summarizing the z scores in 69 pancancer pathways and 16 distinct cancers using Stouffer's method (unweighted). To determine the pancancer prognostic significance of each gene, we further evaluated the meta-z score based on the z scores of 10-year survival outcomes in 16 cancer types by Cox proportional hazards regression analysis.

Results

Gene set correlation enrichment analysis (Gscore)

Figure 1 illustrates the process of using Gscore to annotate and interpret the DEGs in the RNA expression profiles between two phenotypic classes, i.e. ‘control’ and ‘case’. A DEG list is derived from the profiles of both control and case samples. The coexpression network constructed by using the profiles of case samples is meant to describe and capture the biological phenomenon related to a given disease or condition. Given the DEG list, the coexpression network, and a collection of gene sets (e.g. groups of genes that belong to the same pathway or share a common biological function), Gscore aims to determine whether the DEG list and its individual DEGs are significantly associated with a specific gene set. Based on the hypergeometric distribution and statistical correction, the significance of association between the DEG list (or each individual DEG) and the gene set can be determined (Figure 1 and Supplementary Figure S2; details in Materials and methods). A recent study provided a comprehensive comparison of the performances of 13 pathway analysis methods based on their ability to identify target pathways in 75 human microarray datasets (2). To evaluate and compare the accuracy, robustness, and applicability of our Gscore method with different existing methods, five TB methods (e.g. NEA (29–31), ROntoTools (27,28), SPIA (23), CePaORA, and CePaGSA (24,25)) and four non-TB methods (e.g. PADOG (18), GSEA (16), GSA (17), and ORA (hypergeometric test) (5,6,11,12)) were chosen as representative methods according to their performances in the previous study (2). These methods were implemented by the corresponding tools (e.g. R packages) using the parameters and thresholds suggested by the original authors (Supplementary Table S9). Note that the genome-wide gene expression data (or corresponding DEG lists) and the pathways (or corresponding gene sets) were used as inputs according to the requirements of the different methods.

Comparison of association detection for DEG lists by different pathway analysis methods

To infer the associations between the DEG lists and the KEGG pathways (or their gene sets) using distinct pathway analysis methods (i.e. six TB and four non-TB methods), we first collected the 75 microarray datasets related to 15 diseases from the GEO database (Supplementary Table S1). Next, we further assembled and curated RNA-seq datasets in 16 cancer types from TCGA (Supplementary Table S2) to independently benchmark these methods on RNA-seq datasets. Then, the DEG list derived from each dataset was obtained as the query list for the methods that require DEG lists as input. The gene sets derived from 347 human pathways were collected from the KEGG database. Based on 75 GEO microarray datasets and 16 TCGA RNA-seq datasets, the average percentages of significant associations with the KEGG pathways (or their gene sets) detected by Gscore (∼38%) were lower than those detected by the NEA (∼61%) and CePaGSA (∼48%) methods and greater than those detected by the other methods (<19%; Figure 2A). Note that the KEGG pathway version used for the CePaORA and CePaGSA methods are unknown because the pathway data were already embedded in the tools, but in both cases the pathway number is smaller than that in our version (v. 102). ROntoTools and SPIA methods are signaling pathway impact analysis methods; thus, some of the 347 pathways are not applicable (NA) for them.

Figure 2.

Figure 2.

Comparison of six topology-based (TB) and four nontopology-based (non-TB) methods for detecting associations with KEGG pathways or their gene sets. (A) Distributions of the percentages of significant and nonsignificant associations between the DEG lists (or expression data) and the 347 KEGG pathways (or their gene sets) identified by ten different methods. For some methods, the DEG lists were separately derived from the 75 microarray expression datasets in 15 disease types assembled from the GEO database (left) and the 16 TCGA RNA-seq datasets in 16 cancer types (right) as inputs. For example, the percentage of significant associations across 16 TCGA RNA-seq datasets is the cumulative number of significant associations (i.e. impacted pathways) divided by the cumulative number of all possible associations (Inline graphic). For each method, the significance level threshold suggested in the original work is used (details in Supplementary Table S9). Note that an association was considered not applicable (NA) for the method if its corresponding pathway (or gene set) was filtered out in the preprocessing step or unavailable in the evaluation step. (B) Violin plots showing the ranks of target pathways identified by the ten methods using 75 GEO microarray datasets (top) and 10 TCGA RNA-seq datasets (bottom). Among the collected datasets, those relevant to diseases that already have a corresponding KEGG pathway (i.e. target pathway) are used; method performance is assessed according to the ability to rank the target pathway. In each plot, the value represents the median of ranks across the 75 GEO (or 10 TCGA) datasets.

To benchmark these TB and non-TB methods, we first assessed their ability to identify the target pathways using 75 microarray datasets related to 15 diseases and 10 RNA-seq datasets in 10 cancer types. We can expect that a method with better performance would be able to rank the target pathway on top based on its statistical significance (i.e. a small P value). Of the TB and non-TB methods, Gscore and PADOG achieved the best performance (i.e. the lowest median rank values), respectively, on two different expression platforms (Figure 2B). This also demonstrates concordance with the results of Nguyen et al. (2) showing that PADOG performs better than the other non-TB methods. Additionally, the distributions of median P values of the target pathways for different methods are shown in Supplementary Figure S3. We observed that some methods exhibited inconsistent performance in terms of median rank values and P values. For example, the CePaGSA method exhibited the highest median rank values but the third-lowest median P values in 75 microarray datasets. This is reminiscent of previous reports that the estimate of a method based on the derived P values of the target pathways is not as reliable as that based on rank values due to its intrinsic bias of P values toward zero (2). Since the use of unknown and older KEGG versions may lead to relatively poor performance, the CePaORA and CePaGSA methods were excluded from the subsequent analyses. Taken together, these results suggest that our Gscore method can identify a moderate number of significant associations and has a better ability to detect target pathways than the other TB and most of the non-TB methods (except for the PADOG method), regardless of the expression platform.

Pancancer analysis of significant associations

The pathway analysis method is valuable in interpreting and annotating gene expression data, especially in cancer research (16,56). Cancer is a consequence of abnormalities in various biological pathways that interact in a complex network (35,56,57); therefore, cancer researchers usually try not only to identify the target pathway of a specific cancer type but also to find any other pathways that are relevant to the target pathway (58,59). For instance, alterations in several signaling pathways that regulate cell growth, cell cycle progression, and apoptosis are common hallmarks of different types of cancers (also called pancancer pathways) (60). Therefore, we assumed that the expression datasets (or the DEG lists derived from them) in cancers are likely to be significantly associated with pancancer pathways. To assess the ability of various pathway analysis methods to identify pancancer pathways on TCGA RNA-seq datasets in 16 cancers, we first collected a total of 69 pathways belonging to either the pathways in the ‘6.1 Cancer: overview’ category or pathways related to the pathways in this category as defined by the KEGG database (Supplementary Figure S1C and Supplementary Table S4; details in Materials and methods). To investigate the roles of significant associations in a cancer-wide landscape, the significant associations across 16 cancer types were summarized using Stouffer's unweighted Z-transform test. Among the 69 pancancer pathways (or their corresponding gene sets), Gscore displayed the most efficient ability (∼100%) to identify significantly impacted pancancer pathways, followed by NEA (95.6%) and ROntoTools (41%; Figure 3A). Moreover, the meta-z scores of associations with pancancer pathways for Gscore showed a positive correlation with those of each of the TB methods (SPIA (τ = 0.36, P = Inline graphic, Kendall's τ coefficient), NEA (τ = 0.31, P = 1.5Inline graphic) and ROntoTools (τ = 0.15, P = 0.077)) as well as those for the ORA (τ = 0.20, P = 0.024) method (Figure 3B), suggesting that they are consistent to some extent.

Figure 3.

Figure 3.

Pancancer analysis of the significant associations between DEG lists and pancancer pathways across 16 cancer types using eight different methods. (A) Percentages of significantly impacted pancancer pathways among 69 pancancer pathways detected by different methods. A pancancer pathway was considered significantly impacted across 16 cancers when its meta-z score was ≥ 1.64 (nominal one-sided P value < 0.05). (B) Meta-z scores of associations with pancancer pathways identified by Gscore versus by the other seven methods. The meta-z score indicates the statistical significance of an association between the DEG list and each pancancer pathway across 16 cancers. The dots in quadrants I and III represent the significant (meta-z ≥ 1.64) and nonsignificant (meta-z < 1.64) associations across 16 cancers, respectively, identified by Gscore and other methods. Kendall's τ coefficient is used to measure the correspondence in each plot (red: positive correlation; green: negative correlation). (C) Heatmap of meta-z scores for the top 10 shared pancancer pathways significantly impacted across 16 cancers using eight methods. These top 10 shared pancancer pathways are labeled in (B), in which the number symbols refer to the pathways listed in the bottom-right panel of (B). Circles are used to mark statistically significant associations (meta-z score ≥ 1.64).

Figure 3C shows the top 10 shared pancancer pathways significantly impacted across 16 cancers as identified by the eight methods. Some of these signaling pathways are implicated in the development and progression of multiple cancer types, including the MAPK signaling pathway (61), PPAR signaling pathway (62), calcium signaling pathway (63,64), and cAMP signaling pathway (65); other pathways generally relevant to sustaining proliferative signaling (e.g. cell cycle (66)), evading growth suppressors (e.g. p53 signaling (67,68)), and activation of invasion and metastasis (e.g. focal adhesion (69) and ECM-receptor interaction (70)) were identified as common pancancer signatures (68) (Figures 3B and 3C). Notably, the oxidative phosphorylation, basal transcription factors, and Notch signaling pathways play important roles in tumorigenesis (71–73) but were detected only by Gscore (Supplementary Figure S4). For example, previous studies have indicated that the oxidative phosphorylation system is upregulated in many tumors and is involved in tumor progression and aggressiveness at multiple stages of development (71,74). The Notch signaling pathway has been indicated to have a complex relationship with cancer, acting as either an oncogene (e.g. in breast cancer) or a tumor suppressor (e.g. in myeloid malignancy) depending on the specific tissue and cellular context (73). Tumor cells usually exhibit a higher demand for basal transcription than normal cells, and mutations that enhance transcription are frequently found in cancer cells (72). These results are reminiscent of the high cancer-wide concordance reported among genome-wide prognostic genes (75) and membrane protein-regulated pathways (35) and also imply that Gscore can overcome the limitations of the other methods to better identify associations with pancancer pathways.

Next, the joint average relative specificity similarity (joint-AvgRSS) scores (76,77) for GO biological process (BP) and cellular component (CC) terms were used to quantify whether two groups of genes (i.e. the DEG list and gene set of each pathway) are likely to be involved in similar biological processes and located in the same or adjacent cellular components (35,48,78) (Supplementary Note 1). Note that the joint-AvgRSS score could be evaluated only for those methods (i.e. Gscore, NEA, ROntoTools, SPIA, and ORA) that use a DEG list as input. Based on the RNA-seq datasets in 16 cancers and the 69 pancancer pathways, we observed that the significant associations identified by Gscore displayed the highest joint AvgRSS scores, followed by those of the NEA and SPIA methods (Supplementary Figure S5, left). Moreover, the NEA method exhibited the lowest joint-AvgRSS scores for the nonsignificant associations, followed by the Gscore and ROntoTools methods (Supplementary Figure S5, right). These results suggest that two groups of genes determined to have significant (or nonsignificant) associations by the Gscore and NEA methods tend to participate in similar (or distinct) processes and to be localized in the same or adjacent (or distant) cellular compartments.

To further validate the identified associations, we applied the different methods to 45 microarray datasets related to nine shared cancer types corresponding to the TCGA RNA-seq datasets. Using these datasets, the significant associations with the pancancer pathways (or their corresponding gene sets) were first identified using the different methods. We then calculated the overlap coefficient (OC) for significant associations identified using the corresponding TCGA RNA-seq and GEO microarray cancer datasets for each cancer type. The results showed that the median overlap coefficients for the Gscore and NEA methods were higher than 85%, outperforming those for the other methods (<60%; Supplementary Figure S6A). Consistent results were also found for the significant associations between all 347 KEGG pathways and DEG lists (Supplementary Figure S6B), reflecting that the Gscore and NEA methods generate highly reproducible expression profile annotation and interpretation results even when data from different gene expression platforms are used.

Comparison of association detection for individual DEGs using different pathway analysis methods

Among all the DEGs in a list that are significantly associated with a specific pathway, we hypothesized that a DEG is more likely to play a key role in impacting this pathway if it is coexpressed (or interacts) with most of the genes in this pathway but not those in other pathways. In other words, for a specific pathway, the statistical significance of associations for individual DEGs in the list offers an assessment for the prioritization of DEGs based on their impacts on that pathway. To examine whether the pathway analysis methods could distinguish associations for individual DEGs in a list, we applied the different methods to all of the individual DEGs derived from each TCGA RNA-seq dataset. Note that only the Gscore, NEA, ROntoTools, and SPIA methods, which allow querying of an individual gene, were used in this analysis. Similar to the results for the DEG list, the NEA method detected the highest percentages (5.6%) of significant associations among all possible pairs between 347 pathways and all individual DEGs from the 16 RNA-seq datasets, followed by Gscore (1.3%; Figure 4A and Supplementary Figure S7). We further found that the percentages of significant associations for individual DEGs identified by the ROntoTools and SPIA methods were very low (<0.3%; Supplementary Figure S7); thus, only the Gscore and NEA methods were considered in the subsequent analyses.

Figure 4.

Figure 4.

Quantification of Gscore and NEA methods for identifying associations between the individual DEGs and pathways using TCGA RNA-seq datasets in 16 cancers. (A) Distributions of the percentages of significant associations among all possible pairs between 347 pathways and all individual DEGs derived from 16 RNA-seq datasets. (B) Distributions of the accuracy, precision, recall, and false positive rate (FPR) for detecting the genes involved in 69 pancancer pathways in 16 cancers using the Gscore and NEA methods. Among a total of 3443 genes in 69 pancancer pathways, the DEGs identified from the RNA-seq data in each cancer type were used as query genes. For each pancancer pathway in a cancer type, the DEGs recorded in this pathway (i.e. involved genes) and the remaining DEGs (i.e. genes belonging to other pathways) among the query genes were considered the standard positive and negative cases, respectively. P values < 0.001 are indicated by a triple asterisk (Wilcoxon signed-rank test). (C) Deviation of the cumulative distribution from uniform (DCDU) for the scaled ranks of adverse (left) and favorable (right) prognostic genes derived from significant associations identified by the Gscore and NEA methods. The rank distribution of randomly ordered prognostic genes is plotted as a black solid line. The AUC is calculated as a measure of the level of deviation from uniformity. (D) Volcano plot of the meta-z scores of adverse (red: e.g. KIF18B gene-impacted cell cycle pathway) and favorable (green: e.g. NLRC3 gene-impacted chemokine signaling pathway) prognostic genes (x-axis) versus the meta-z scores of the FDR q values for associations between DEGs and pancancer pathways (y-axis) across 16 cancer types identified by the Gscore method. The gray dots represent nonsignificant associations (meta-z < 1.64 or nominal one-sided P > 0.05) and nonsignificant prognostic genes (|meta-z| < 1.96 or nominal two-sided P > 0.05).

We next asked whether the DEGs involved in the 69 pancancer pathways could be linked with their original pathways using the Gscore and NEA methods. For each pancancer pathway in a cancer type, the DEGs recorded in this pathway (i.e. involved DEGs) and the remaining DEGs (i.e. DEGs belonging to other pathways) among all the query pathways are considered positive and negative cases, respectively. Here, accuracy, precision, recall, and false positive rate (FPR) were evaluated by using equations S1-S4 in Supplementary Note 2. The results showed that Gscore achieved significantly higher accuracy (median of 0.863) and precision (median of 0.425) and significantly lower recall (median of 0.346) and FPR values (median of 0.051) than the NEA method (P values < 0.001, Wilcoxon signed-rank test; Figure 4B). These results suggest that Gscore could efficiently avoid identifying false positive associations but may also miss some positive associations.

To investigate whether DEGs and pancancer pathways exhibited prognostic correlations with human malignancies, we first assessed the correlation of each DEG with 10-year survival outcomes based on gene expression and clinical outcome data from TCGA. Next, we used the meta-z scores of associations for individual DEGs across the 69 pancancer pathways to evaluate and rank the significance of their pathway-wide enrichment. The ranks of adverse (left) and favorable (right) prognostic genes were further examined for the Gscore and NEA methods, and their cumulative distributions (40,79) were plotted (Figure 4C and Supplementary Note 3). Gscore displayed higher positive and negative area under the curve (AUC) values for benchmark gene sets with adverse (left) and favorable (right) prognostic outcomes, respectively, than gene sets identified by the NEA method and random gene sets. This finding implies that Gscore could reflect that genes are more likely to be correlated with adverse (or favorable) survival when involved in multiple (or specific) pancancer pathways.

To further examine the relationship between prognostic significance and association significance, we defined the genes involved in significant associations across 16 cancer types (i.e. meta-z ≥ 1.64) as adverse and favorable prognostic genes when their meta-z scores of prognostic significance were > 1.96 and < −1.96, respectively. Statistical analysis demonstrated that the proportion of adverse and favorable prognostic genes was higher among the genes involved in the significant associations with the pancancer pathways across 16 cancer types identified by Gscore than among the genes involved in the nonsignificant associations (P = Inline graphic, chi-square test) and superior to those identified by the NEA method (P = Inline graphic; Supplementary Figure S8). Next, we described some of the most significant associations correlated with adverse or favorable outcomes. Numerous genes involved in cell cycle regulation (e.g. PLK1, TPX2, and KIF20A) were correlated with adverse outcomes and had strong association significance (Figure 4D, right). For example, PLK1 is categorized as an oncogene and is highly overexpressed in various cancer cells, including gastric, breast, and liver cancers (80–82); TPX2 is also overexpressed in cancers and is considered a diagnostic and prognostic marker for malignancies (83); and KIF20A has been observed to be upregulated in and capable of regulating malignant behavior in numerous cancer types (84–86). The expression of EREG (encoding proepiregulin), which belongs to the ErbB family of ligands and is involved in the MAPK signaling pathway, was most frequently correlated with adverse outcomes. EREG is overexpressed in many human cancers and has been implicated in tumor progression (e.g. resistance to apoptosis) and metastasis (87). Interestingly, the expression of several genes associated with various immune-related pathways (e.g. chemokine signaling pathway, T cell receptor signaling pathway, and PD-L1 expression and PD-1 checkpoint pathway in cancer) exhibited favorable prognostic associations (Figure 4D, left). For instance, the expression of NLRC3, which encodes NLR family CARD domain-containing protein 3 and functions as a potential tumor suppressor through the inhibition of cellular proliferation and stem-cell-derived organoid formation (88) and the regulation of inflammation (89) in several pathways (e.g. chemokine signaling), was most frequently correlated with favorable outcomes. In view of the above results, our Gscore method could provide insights for future investigations of how pathway-wide and cancer-wide associations for individual genes reflect prognostic outcomes.

Annotation and interpretation of gene expression data from cells and patient samples with COVID-19 infection

As of 21 June 2023, the World Health Organization (WHO) reported more than 768 million infections and 6.9 million deaths related to COVID-19. It is necessary to understand the disease mechanisms to control the pandemic. To further apply and validate our Gscore and the other methods to this important problem, we first collected four RNA-seq datasets from samples derived from patients with or without COVID-19 (or cells with or without SARS-CoV-2 infection) from the GEO database, including patient-derived plasma and leukocyte (PPL) samples (GSE157103) (41), patient-derived autopsy (PA) samples (GSE150316) (42), human bronchial epithelial (NHBE) cells (GSE147507) (43,44), and alveolar type 2 progenitor (AT2) cells (GSE160435) (45). Additionally, we obtained 17 COVID-19-related pathways and the corresponding gene sets in humans, including the target pathway (i.e. Coronavirus disease—COVID-19, KEGG ID: hsa05171) and its 16 related pathways as defined in the KEGG database (Supplementary Table S7). By applying different pathway analysis methods to each RNA-seq dataset, we detected the associations with all the KEGG human pathways (or their gene sets) to examine whether the COVID-19-related pathways are significantly impacted in a given phenotype. Statistical analysis showed that the proportions of COVID-19-related pathways among the significantly impacted pathways detected by Gscore were higher than those among the nonsignificantly impacted pathways on the datasets of PPL samples (P value = 0.023, Fisher's exact test), PA samples (P value = 0.072), and NHBE cells (P value = Inline graphic), outperforming the other methods (Figures 5A and B and Supplementary Figures S9A and S9B). On the AT2 cell dataset, the proportions of COVID-19-related pathways achieved statistical significance (P value < 0.05 or Inline graphic, Fisher's exact test) only when using Gscore, ROntoTools, SPIA and PADOG. Notably, Gscore also achieved the best performance in terms of rank (i.e. the lowest median rank values) for the target pathway ‘coronavirus disease - COVID-19′ in these four datasets, followed by the SPIA and ORA methods (Supplementary Figure S9C).

Figure 5.

Figure 5.

Detection of associations between COVID-19-related pathways and the DEG lists derived from RNA-seq datasets of plasma and leukocyte (PPL) samples and autopsy (PA) samples from patients with or without COVID-19 infection using eight methods. (A) Bar chart (top panel) showing the statistical analysis of whether the proportion of COVID-19-related pathways was higher among the significantly impacted pathways than among the nonsignificantly impacted pathways for two different DEG lists. The dashed line indicates the statistical significance threshold of P value < 0.05 (Inline graphic, Fisher's exact test). (B) Dot plot showing the significant associations between 17 COVID-19-related pathways and the DEG lists derived from two RNA-seq datasets. The dot size is proportional to the rank of each COVID-19-related pathway identified by the corresponding method. The color bar indicates the nominal P value for the PADOG method and FDR q values for the other methods. (C) Association between the coronavirus disease - COVID-19 pathway (hsa05171) and the list of top 100 DEGs derived from the PPL sample dataset detected by Gscore. In this list, 26 of 100 DEGs were coexpressed with at least one DEG in the pathway; the top 3 ranked DEGs (RSAD2, EPSTI1, and IFI27) with the higher expression fold changes and their coexpressed gene pairs are shown. The squares with colored and bold letters represent the DEGs in the list and the pathway. Orange dotted lines indicate the coexpressed gene pairs (|Pearson's r| ≥ 0.7) between DEGs in the list and the pathway. In this pathway, the upregulated (red) and downregulated (green) DEGs coexpressed with DEGs in the list and their interacting neighbors are labeled with gene names.

A recent study demonstrated that patients with COVID-19 had severe symptoms related to hyperinflammation and suggested a total of 32 impacted pathways based on the PA sample and NHBE cell datasets (90). In the cell-derived datasets, the Gscore, NEA, ROnoTools, SPIA, and ORA methods detected ≥ 25 (90.6% for NHBE cells and 78.1% for AT2 cells), 32 (100% and 100%), ≥23 (87.5% and 71.9%), ≥26 (81.3% and 84.4%), and ≥ 25 (90.6% and 78.1%) of the 32 impacted pathways, respectively. In the patient-derived datasets, Gscore identified significant associations with ≥ 15 (46.9%) of the 32 impacted pathways, more than most of the other methods (Supplementary Figure S9D). In short, Gscore is better able to detect not only the target pathway of COVID-19 but also the related pathways in both patient- and cell-derived datasets.

Our Gscore method also revealed possible mechanisms for how DEGs are involved in the target pathway of COVID-19. Among the top 100 DEGs, 26 DEGs had a high correlation (|Pearson's r ≥ 0.7|) with those in the target pathway of COVID-19 (Figure 5C), and four of them were directly included in this pathway. Among these 26 DEGs, RSAD2 (encoding radical S-adenosyl methionine domain-containing protein 2) was strongly correlated with several upstream (e.g. STAT1/2) and downstream (e.g. OAS1/2/3 and MX1) genes in the interferon-stimulated JAK-STAT signaling pathway (Figure 5C). Previous studies have suggested that the interferon-driven antiviral response in the nasopharynx is triggered by SARS-CoV-2 infection and results in the upregulation of RSAD2 (an interferon-stimulated gene), which exerts antiviral functions (91,92). Moreover, another two genes, IFI27 (encoding interferon alpha-inducible protein 27, mitochondrial) and EPSTI1 (encoding epithelial-stromal interaction protein 1), have numerous strong correlations with the genes in the target pathway of COVID-19. IFI27 is considered one of the canonical inflammatory signature genes in patients with COVID-19 (93), and transcriptome data revealed the upregulation of EPSTI1 in bronchoalveolar lavage cells and peripheral blood mononuclear cells from patients with severe COVID-19 (94).

Additionally, we found that the proportions of COVID-19-related genes among the DEGs associated with ≥1 COVID-19-related pathway were significantly higher than those among the DEGs not associated with any COVID-19-related pathway by the Gscore method, irrespective of clinical sample source (Figure 6A), and Gscore also outperformed the NEA method. Compared to the DEGs with weak associations in specific pathways, the DEGs with strong associations across multiple pathways had higher counts among 722 gene sets collected from the COVID-19 Drug and Gene Set Library (95); furthermore, they were more likely to be included among the COVID-19-related genes assembled from the UniProtKB (96) and DisGeNET (97) databases (Figure 6B). In summary, Gscore using coexpression networks compensates for the limitations of the other methods and provides routes of access for gene expression profile annotation and interpretation of a DEG list of interest and its individual members.

Figure 6.

Figure 6.

Quantification of Gscore and NEA methods for identifying associations between individual DEGs and COVID-19-related pathways using two RNA-seq datasets (left: PPL samples; right: PA samples) relevant to COVID-19. (A) Bar charts showing statistical analyses of whether the proportions of COVID-19-related genes are higher among the DEGs significantly associated with ≥1 COVID-19-related pathway than among the DEGs not significantly associated with any COVID-19-related pathways. From a total of 26162 genes in 722 gene sets in the COVID-19 Drug and Gene Set Library, we derived 923 and 173 COVID-19-related genes with counts within the top 5% and 1% (i.e. most common genes), respectively. Moreover, a merged set consisting of 896 COVID-19-related genes was assembled from the UniProtKB and DisGeNET databases. The dashed line represents the statistical significance threshold of P value < 0.05 (Inline graphic, Fisher's exact test). (B) Heatmaps of significant associations between DEGs and COVID-19-related pathways detected by the Gscore method. The FDR q values were transformed into z scores to represent the association significance. The COVID-19-related genes recorded in UniProtKB (blue), DisGeNET (orange), or both databases and the COVID-19 Drug and Gene Set Library (the bar chart showing the counts) are marked in the right panels. The number symbols refer to the pathways listed in the corresponding bottom panels.

Implementation of the Gscore method

To offer Gscore as an online analytical tool, we constructed a server (https://gscore.ibsb.nycu.edu.tw/) through which to interpret and characterize input gene expression data by assessing the statistical significance of the association between the DEG list and a collection of gene sets. The Gscore online tool provides a user-friendly interface with a clearly defined procedure, statistically rigorous hypothesis testing, and intuitive visualization of the findings (Figure 7). Users can first upload their own expression dataset and then either upload a collection of gene sets in several organisms (i.e. Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae or Danio rerio) or select from curated gene sets collected from public databases (37,51,53,77,98–103). Similar to many tools that employ pathway analysis methods, Gscore produces heatmaps, dot plots, network graphs, and tables as output for the visualization of significant associations and coexpressed gene pairs. These output data are available to download for further analysis. An Apache HTTP server running on an Ubuntu system was built for the Gscore online tool, and the webpage was developed using PHP and JavaScript. The Gscore online tool is compatible with the most common web browsers, including Windows Edge, Google Chrome, Mozilla Firefox, and Safari, with no additional plugin installation required. Cytoscape.js (104) is employed for network visualization and also runs without any extra plugins.

Figure 7.

Figure 7.

Overview of the Gscore web server (https://gscore.ibsb.nycu.edu.tw/) for annotating and interpreting gene expression data. Users can either upload their own expression profile and gene set data or select curated data collected from several public databases. Gscore offers a user-friendly interface providing a clearly defined procedure, statistically rigorous hypothesis testing, and intuitive visualization of the findings (e.g. interactive tables, dot plots, heatmaps, and network graphs).

Discussion

Gaining biological insight from genome-wide expression profiles remains a major challenge, and our Gscore method represents a powerful analytical approach for complementing current pathway analysis methods, delineating the relationships between DEG lists (or their individual DEGs) and gene sets defined by prior biological knowledge and aiding in the interpretation of gene expression data.

Gscore has unique advantages over related methods and tools (5,6,11,12,16–18,23–25,27–29). First, using 75 benchmark microarray datasets related to 15 diseases and 10 TCGA RNA-seq datasets in 10 cancer types, Gscore performed better than the other TB methods and most non-TB methods, except PADOG, in identifying the target pathways as significantly impacted (or associated) and ranking them at the top of the impacted pathway list (Figure 2B). Moreover, Gscore achieved the best performance in ranking the target pathway (i.e. coronavirus disease - COVID-19) in four COVID-19 RNA-seq datasets, followed by another TB method, SPIA (Supplementary Figure S9C). These results are consistent with previous findings (1,2,105) that TB methods considering the dependencies and interactions between genes (or proteins) are better able than non-TB methods to detect significantly impacted pathways for further annotating and interpreting gene expression profiles.

Second, Gscore showed good detection ability and reliable results because it does not require users to input a reference network and is not restricted to pathway data formatted in the manner specified by certain databases, which are limitations of the other TB methods. For example, the commonly used human protein–protein interaction (PPI) network from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database (106) is one of the suggested reference networks in EviNet (30), the web tool for the NEA method. However, previous work suggested that this network is inappropriate for cancer data analysis since it includes data from homologous genes in prokaryotic organisms (50). It can be expected that the selection of appropriate reference networks for different research fields will become a challenging issue for users. Additionally, most pathway knowledge bases and PPI networks are constructed by curating the results of experiments conducted in distinct cell types at different time points under diverse conditions; thus, they are considered aggregate (1,40). In general, aggregate networks require frequent updates and integration to be applicable to diverse research topics; specifically, they may fail to promptly include interactions on specific conditions and cells when there is a lack of corresponding experimental data (1), such as emerging infectious diseases like COVID-19. Due to the condition-specific and cell-specific nature of gene expression profiles, the true pathway/network topology depends on the condition being studied and the cell type (1,33). The use of a coexpression network and query DEG list derived from the same gene expression data could offer a way to resolve the above issues (32,34,35,107,108). Thus, we further compared the performances of Gscore when using the dataset-derived coexpression network and the aggregate network, merged6_and_wir1_HC2, based on the ability to identify target pathways in 50 microarray datasets related to 10 cancer types, 25 microarray datasets relevant to 5 other diseases, and 4 COVID-19 RNA-seq datasets. Notably, the merged6_and_wir1_HC2 network includes the wir1 network, which has been suggested as beneficial for cancer data analysis (50). Compared to using the dataset-derived coexpression network, we found that, as expected, Gscore achieved a better performance in cancer-related datasets but worse in datasets related to 5 other diseases and COVID-19 when using the merged6_and_wir1_HC2 network as the reference network (Supplementary Figure S10). Next, to examine whether Gscore performs better when using the dataset-derived coexpression network than when using the aggregate coexpression network, we collected an aggregate coexpression network built by using the microarray and RNA-seq data of 245698 samples from the COXPRESdb database (ver 8.1) (109). By evaluating performance based on the ability to identify target pathways, we observed that Gscore performed better on both the microarray and RNA-seq datasets when using the dataset-derived coexpression network than when using the aggregate coexpression network (Supplementary Figure S11). These results suggest that the Gscore using the dataset-derived coexpression network offers an alternative and promising solution to address the challenges of reference network selection and data integration before obtaining the complete biological network (i.e. interactome).

Third, although prior knowledge about gene characteristics is available through public online databases (e.g. GO (77) and GeneCards (110)), and many pathway analysis methods have been developed (Supplementary Table S9), interpreting a long list of DEGs and the individual DEGs within the list is still a cumbersome and challenging task (111). Network topology in the context of gene or protein interactions allows researchers to examine how individual elements impact larger functional units, enabling the identification of key genes (or proteins) that may play critical roles in biological processes or even diseases (57). Therefore, one of our aims in developing the Gscore method was to enable the detection of specific gene sets significantly associated with the DEG list and each of its DEGs. For example, we observed that compared with the NEA method, Gscore had higher accuracy and precision and lower false positive rate and lower recall in distinguishing the significant associations between 69 pancancer pathways and their involved DEGs (Figure 4B). In contrast, the NEA method is suggested for use for researchers who would like to find as many true associations as possible (i.e. high recall). In addition, when applied to assess the RNA-seq data of patients with COVID-19 (or cells with SARS-CoV-2 infection), we also found that Gscore could comprehensively associate DEG lists with COVID-19-related pathways (Figures 5A and B and Supplementary Figures S9A and S9B) and further determine which of these DEGs are more likely to be COVID-19-related genes (Figure 6).

Fourth, multiple RNA-seq and microarray datasets for human cancers were included to examine the performance and validate the reproducibility of the Gscore method. The overlap analysis for the significant associations identified using the TCGA RNA-seq and GEO microarray datasets in nine cancer types showed highly robust detection ability for the Gscore and NEA methods (Supplementary Figure S6). We further found that the NEA method inferred the highest percentages of significant associations from both RNA-seq and GEO microarray datasets among the methods tested (Figure 2A); this may explain why NEA exhibited the highest overlap coefficients. Among all the methods, NEA achieved the lowest median P values (<Inline graphic), followed by CePaGSA (median P values = 0.001; Supplementary Figure S3). These results are reminiscent of those of a previous study that reported that a method is likely to generate low P values even when the pathway is not significantly impacted if the null distribution of P values for this pathway yielded by a method is biased toward zero (2,112).

Finally, integrating Gscore with a meta-z score approach across numerous malignancies not only facilitates pancancer analysis to reveal which DEGs and associated pathways are common tumor hallmarks (Figure 4D) but also uncovers important pancancer pathways that the other methods may miss (Figure 3B and Supplementary Figure S4). Our Gscore method further offers guidance for observing how pathway-wide and cancer-wide associations for individual genes reflect clinical outcomes, such as alterations in prognostic correlations (Figures 4C and D). For example, genes involved in multiple and specific pancancer pathways are frequently correlated with adverse and favorable survival outcomes, respectively. The identification of cell cycle associated with many DEGs, including PLK1, ANLN and KIF18B/20A, across human cancers demonstrates that the most essential characteristic of cancer cells is sustained proliferation, and that this may lead to adverse survival outcomes (68). Conversely, several tumor suppressors (e.g. NLRC3 (88)) associated with immune-relevant pathways (e.g. chemokine signaling and T-cell receptor signaling) were found to be the most frequently correlated with favorable outcomes in certain cancers, reflecting the roles of the immune system in antagonizing tumor development and progression (68,113).

Gscore has several limitations, challenges, and perspectives. First, to evaluate the coexpressed gene pairs across samples of the case class and determine the DEGs between two classes via statistically rigorous hypothesis testing, the sample sizes of the control and case classes should be at least three and five, respectively. To examine the influence of sample size of gene expression data on the Gscore method, we utilized 16 TCGA RNA-seq datasets and 75 GEO microarray datasets to execute simulation analysis to detect significant associations based on the gene expression profiles across all tumor samples and across randomly selected samples of different sizes. The results suggested that Gscore performed significantly better when using at least five samples (versus only three samples) in the case class (P value Inline graphic, Mann–Whitney U test; Supplementary Figure S12). We believe that the continuous progress in single-cell sequencing and next-generation sequencing technologies (114,115) will offer expression data from large numbers of samples to allow this limitation to be overcome and concomitantly increase the need for data interpretation methods. A second potential constraint of Gscore is that our method may neglect groups of genes that have causal relationships but are not strongly coexpressed because coexpression is less evident in cascade signal transduction, which is usually hierarchical in nature (35). Third, Gscore does not consider the node degree in the gene coexpression network. Indeed, node degrees in the network allow the identification of hub genes that are highly connected with other nodes and potentially represent key regulators or drivers in a specific pathway or even phenotype (116). Nevertheless, a major challenge remains in considering node degree, namely, that this approach can be sensitive to false positive edges if the underlying coexpression network contains spurious or incorrect connections, which may be driven by confounding variables (117); thus, the inclusion of node degree can be expected to affect the reliability of the results. However, Gscore can be further improved in the future by considering the node degrees after reducing false edge discovery in the coexpression network, for example, by using partial correlation or conditional mutual information instead of Pearson correlation coefficient. Finally, another challenge of the Gscore online tool is that it is time-consuming since Gscore must construct a coexpression network for every input expression dataset and simultaneously identify the impacted gene set for each DEG and the entire DEG list. Even so, by following the well-defined procedure provided with the Gscore online tool, users can obtain analysis outputs with intuitive visualizations within several minutes on average.

In conclusion, our results illuminate a strategy for extracting insights from gene expression data and provide numerous clues for further investigation and clinical translation for application in cancer and COVID-19 research. Our Gscore approach also addresses some of the limitations of other widely used methods to further improve the annotation and characterization of high-throughput expression data. Moreover, our Gscore online tool provides an easy-to-use framework to facilitate the annotation and characterization of individual DEGs, DEG lists, and genome-wide expression profiles based on existing biological knowledge.

Supplementary Material

gkad1187_supplemental_file

Acknowledgements

We thank Miss Yan-Hua Chen for assisting with developing part of the functions of the Gscore online tool.

Contributor Information

Lan-Yun Chang, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan.

Meng-Zhan Lee, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan.

Yujia Wu, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan.

Wen-Kai Lee, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan.

Chia-Liang Ma, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan.

Jun-Mao Chang, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan.

Ciao-Wen Chen, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan.

Tzu-Chun Huang, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan.

Chia-Hwa Lee, School of Medical Laboratory Science and Biotechnology, College of Medical Science and Technology, Taipei Medical University, New Taipei City 235, Taiwan; Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; TMU Research Center of Cancer Translational Medicine, Taipei Medical University, Taipei 110, Taiwan; Ph.D. Program in Medical Biotechnology, College of Medical Science and Technology, Taipei Medical University, New Taipei City 235, Taiwan.

Jih-Chin Lee, Department of Otolaryngology-Head and Neck Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei 110, Taiwan.

Yu-Yao Tseng, Department of Food Science, Nutrition, and Nutraceutical Biotechnology, Shih Chien University, Taipei 104, Taiwan.

Chun-Yu Lin, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; Cancer and Immunology Research Center, National Yang Ming Chiao Tung University, Taipei 112, Taiwan; Institute of Data Science and Engineering, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; School of Dentistry, Kaohsiung Medical University, Kaohsiung 807, Taiwan.

Data availability

The Gscore online tool is available at https://gscore.ibsb.nycu.edu.tw/. This online tool is free and open to all users and there is no login requirement. Gscore and the code used for analysis are implemented in Python and available at https://github.com/SysMednet/Gscore (permanent DOI: https://doi.org/10.5281/zenodo.8260953). The accession numbers of the gene expression datasets assembled from TCGA and GEO databases are listed in Supplementary Tables S1, S2 and S3.

Supplementary data

Supplementary Data are available at NAR Online.

Funding

Young Scholar Fellowship Program by the National Science and Technology Council (NSTC) in Taiwan [NSTC 112-2636-B-A49-007]; ‘Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B)’ and ‘Cancer and Immunology Research Center' [112W31101] from The Featured Areas Research Center Program within the framework of the Higher Education SPROUT Project by the Ministry of Education (MOE) in Taiwan; Ministry of Science and Technology (MOST) in Taiwan [MOST-111-2628-B-038-002, in part] for Dr Lee; NSTC [NSTC-112-2320-B-038-013] for Dr Lee; Shih Chien University and Genesys Logic [USC-109-03-02017, USC-111-05-02001] for Dr Tseng. Funding for open access charge: National Science and Technology Council (NSTC) in Taiwan [NSTC 112-2636-B-A49-007].

Conflict of interest statement. None declared.

References

  • 1. Khatri P., Sirota M., Butte A.J.. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 2012; 8:e1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Nguyen T.M., Shafi A., Nguyen T., Draghici S.. Identifying significantly impacted pathways: a comprehensive review and assessment. Genome Biol. 2019; 20:203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Tarca A.L., Bhatti G., Romero R.. A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLoS One. 2013; 8:e79217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Khatri P., Draghici S.. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005; 21:3587–3595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Khatri P., Draghici S., Ostermeier G.C., Krawetz S.A.. Profiling gene expression using onto-express. Genomics. 2002; 79:266–270. [DOI] [PubMed] [Google Scholar]
  • 6. Draghici S., Khatri P., Martins R.P., Ostermeier G.C., Krawetz S.A.. Global functional profiling of gene expression. Genomics. 2003; 81:98–104. [DOI] [PubMed] [Google Scholar]
  • 7. Dahlquist K.D., Salomonis N., Vranizan K., Lawlor S.C., Conklin B.R.. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat. Genet. 2002; 31:19–20. [DOI] [PubMed] [Google Scholar]
  • 8. Hosack D.A., Dennis G. Jr, Sherman B.T., Lane H.C., Lempicki R.A. Identifying biological themes within lists of genes with EASE. Genome Biol. 2003; 4:R70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Castillo-Davis C.I., Hartl D.L.. GeneMerge–post-genomic analysis, data mining, and hypothesis testing. Bioinformatics. 2003; 19:891–892. [DOI] [PubMed] [Google Scholar]
  • 10. Berriz G.F., King O.D., Bryant B., Sander C., Roth F.P.. Characterizing gene sets with FuncAssociate. Bioinformatics. 2003; 19:2502–2504. [DOI] [PubMed] [Google Scholar]
  • 11. Al-Shahrour F., Diaz-Uriarte R., Dopazo J.. FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics. 2004; 20:578–580. [DOI] [PubMed] [Google Scholar]
  • 12. Beissbarth T., Speed T.P.. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004; 20:1464–1465. [DOI] [PubMed] [Google Scholar]
  • 13. Huang da W., Sherman B.T., Lempicki R.A.. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009; 4:44–57. [DOI] [PubMed] [Google Scholar]
  • 14. Sherman B.T., Hao M., Qiu J., Jiao X., Baseler M.W., Lane H.C., Imamichi T., Chang W.. DAVID: A web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022; 50:W216–W221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Liao Y., Wang J., Jaehnig E.J., Shi Z., Zhang B.. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 2019; 47:W199–W205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S.et al.. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005; 102:15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Efron B., Tibshirani R.. On testing the significance of sets of genes. Ann. Appl. Stat. 2007; 1:107–129. [Google Scholar]
  • 18. Tarca A.L., Draghici S., Bhatti G., Romero R.. Down-weighting overlapping genes improves gene set analysis. BMC Bioinf. 2012; 13:136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Mitrea C., Taghavi Z., Bokanizad B., Hanoudi S., Tagett R., Donato M., Voichita C., Draghici S.. Methods and approaches in the topology-based analysis of biological pathways. Front. Physiol. 2013; 4:278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Nguyen T., Mitrea C., Draghici S.. Network-Based Approaches for Pathway Level Analysis. Curr. Protoc. Bioinformatics. 2018; 61:8.25.1–8.25.24. [DOI] [PubMed] [Google Scholar]
  • 21. Draghici S., Khatri P., Tarca A.L., Amin K., Done A., Voichita C., Georgescu C., Romero R.. A systems biology approach for pathway level analysis. Genome Res. 2007; 17:1537–1545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Khatri P., Draghici S., Tarca A.L., Hassan S.S., Romero R.. A System Biology Approach for the Steady-State Analysis of Gene Signaling Networks. Progress in Pattern Recognition, Image Analysis and Applications: 12th Iberoamericann Congress on Pattern Recognition, CIARP 2007, Valparaiso, Chile, November 13-16, 2007. Proceedings. 2007; 12:32–41. [Google Scholar]
  • 23. Tarca A.L., Draghici S., Khatri P., Hassan S.S., Mittal P., Kim J.S., Kim C.J., Kusanovic J.P., Romero R.. A novel signaling pathway impact analysis. Bioinformatics. 2009; 25:75–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Gu Z., Liu J., Cao K., Zhang J., Wang J.. Centrality-based pathway enrichment: A systematic approach for finding significant pathways dominated by key genes. BMC Syst. Biol. 2012; 6:56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Gu Z., Wang J.. CePa: an R package for finding significant pathways weighted by multiple network centralities. Bioinformatics. 2013; 29:658–660. [DOI] [PubMed] [Google Scholar]
  • 26. Dutta B., Wallqvist A., Reifman J.. PathNet: A tool for pathway analysis using topological information. Source Code Biol. Med. 2012; 7:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Draghici S., Khatri P., Bhavsar P., Shah A., Krawetz S.A., Tainsky M.A.. Onto-Tools, the toolkit of the modern biologist: onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res. 2003; 31:3775–3781. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Voichita C., Donato M., Draghici S.. Incorporating Gene Significance in the Impact Analysis of Signaling Pathways. 2012 11th International Conference on Machine Learning and Applications. 2012; 1:126–131. [Google Scholar]
  • 29. Alexeyenko A., Lee W., Pernemalm M., Guegan J., Dessen P., Lazar V., Lehtio J., Pawitan Y.. Network enrichment analysis: extension of gene-set enrichment analysis to gene networks. BMC Bioinf. 2012; 13:226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Jeggari A., Alekseenko Z., Petrov I., Dias J.M., Ericson J., Alexeyenko A.. EviNet: A web platform for network enrichment analysis with flexible definition of gene sets. Nucleic Acids Res. 2018; 46:W163–W170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Jeggari A., Alexeyenko A.. NEArender: an R package for functional interpretation of ‘omics’ data via network enrichment analysis. BMC Bioinf. 2017; 18:118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. van Dam S., Vosa U., van der Graaf A., Franke L., de Magalhaes J.P.. Gene co-expression analysis for functional classification and gene-disease predictions. Brief Bioinform. 2018; 19:575–592. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Aittokallio T. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform. 2010; 11:253–264. [DOI] [PubMed] [Google Scholar]
  • 34. Zhang B., Horvath S.. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet Mol. Biol. 2005; 4:Article17. [DOI] [PubMed] [Google Scholar]
  • 35. Lin C.Y., Lee C.H., Chuang Y.H., Lee J.Y., Chiu Y.Y., Wu Lee Y.H., Jong Y.J., Hwang J.K., Huang S.H., Chen L.C.et al.. Membrane protein-regulated networks across human cancers. Nat. Commun. 2019; 10:3131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Chuang Y.H., Lee C.H., Lin C.Y., Liu C.L., Huang S.H., Lee J.Y., Chiu Y.Y., Lee J.C., Yang J.M.. An integrated genomic strategy to identify CHRNB4 as a diagnostic/prognostic biomarker for targeted therapy in head and neck cancer. Cancers (Basel). 2020; 12:1324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Kanehisa M., Goto S.. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28:27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Barrett T., Wilhite S.E., Ledoux P., Evangelista C., Kim I.F., Tomashevsky M., Marshall K.A., Phillippy K.H., Sherman P.M., Holko M.et al.. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. 2013; 41:D991–D995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Cancer Genome Atlas Research Network Weinstein J.N., Collisson E.A., Mills G.B., Shaw K.R., Ozenberger B.A., Ellrott K., Shmulevich I., Sander C., Stuart J.M. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013; 45:1113–1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Chen H.H., Hsueh C.W., Lee C.H., Hao T.Y., Tu T.Y., Chang L.Y., Lee J.C., Lin C.Y.. SWEET: a single-sample network inference method for deciphering individual features in disease. Brief Bioinform. 2023; 24:bbad032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Overmyer K.A., Shishkova E., Miller I.J., Balnis J., Bernstein M.N., Peters-Clarke T.M., Meyer J.G., Quan Q., Muehlbauer L.K., Trujillo E.A.et al.. Large-Scale Multi-omic Analysis of COVID-19 Severity. Cell Syst. 2021; 12:23–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Desai N., Neyaz A., Szabolcs A., Shih A.R., Chen J.H., Thapar V., Nieman L.T., Solovyov A., Mehta A., Lieb D.J.et al.. Temporal and spatial heterogeneity of host response to SARS-CoV-2 pulmonary infection. Nat. Commun. 2020; 11:6319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Blanco-Melo D., Nilsson-Payant B.E., Liu W.C., Uhl S., Hoagland D., Moller R., Jordan T.X., Oishi K., Panis M., Sachs D.et al.. Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19. Cell. 2020; 181:1036–1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Daamen A.R., Bachali P., Owen K.A., Kingsmore K.M., Hubbard E.L., Labonte A.C., Robl R., Shrotri S., Grammer A.C., Lipsky P.E.. Comprehensive transcriptomic analysis of COVID-19 blood, lung, and airway. Sci. Rep. 2021; 11:7052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Mulay A., Konda B., Garcia G. Jr, Yao C., Beil S., Villalba J.M., Koziol C., Sen C., Purkayastha A., Kolls J.Ket al.. SARS-CoV-2 infection of primary human lung epithelium for COVID-19 modeling and drug discovery. Cell Rep. 2021; 35:109055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Bokanizad B., Tagett R., Ansari S., Helmi B.H., Draghici S.. SPATIAL: A System-level PAThway Impact AnaLysis approach. Nucleic Acids Res. 2016; 44:5034–5044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Bandyopadhyay S., Mehta M., Kuo D., Sung M.K., Chuang R., Jaehnig E.J., Bodenmiller B., Licon K., Copeland W., Shales M.et al.. Rewiring of genetic networks in response to DNA damage. Science. 2010; 330:1385–1389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Lin C.Y., Lin Y.W., Yu S.W., Lo Y.S., Yang J.M.. MoNetFamily: A web server to infer homologous modules and module-module interaction networks in vertebrates. Nucleic Acids Res. 2012; 40:W263–W270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Benjamini Y., Hochberg Y.. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 1995; 57:289–300. [Google Scholar]
  • 50. Merid S.K., Goranskaya D., Alexeyenko A.. Distinguishing between driver and passenger mutations in individual cancer genomes by network enrichment analysis. BMC Bioinf. 2014; 15:308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Giurgiu M., Reinhard J., Brauner B., Dunger-Kaltenbach I., Fobo G., Frishman G., Montrone C., Ruepp A.. CORUM: the comprehensive resource of mammalian protein complexes-2019. Nucleic Acids Res. 2019; 47:D559–D563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Hornbeck P.V., Kornhauser J.M., Tkachev S., Zhang B., Skrzypek E., Murray B., Latham V., Sullivan M.. PhosphoSitePlus: A comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2012; 40:D261–D270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Liberzon A., Birger C., Thorvaldsdottir H., Ghandi M., Mesirov J.P., Tamayo P.. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015; 1:417–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Alexeyenko A., Sonnhammer E.L.. Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res. 2009; 19:1107–1116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Whitlock M.C. Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 2005; 18:1368–1373. [DOI] [PubMed] [Google Scholar]
  • 56. Zuo S., Wei M., Wang S., Dong J., Wei J.. Pan-cancer analysis of immune cell infiltration identifies as prognostic immune-cell characteristic score (ICCS) in lung adenocarcinoma. Front. Immunol. 2020; 11:1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Barabasi A.L., Gulbahce N., Loscalzo J.. Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 2011; 12:56–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Wang Y., Jin H.Y., Fang M.Z., Wang X.F., Chen H., Huang S.L., Kong D.S., Li M., Zhang X., Sun Y.et al.. Epigallocatechin gallate inhibits dimethylhydrazine-induced colorectal cancer in rats. World J. Gastroenterol. 2020; 26:2064–2081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Hu Z., Ning M., Qin S., Yu K.. Exploring the molecular mechanism of Lycium barbarum L. against breast cancer based on network pharmacology. J. Funct. Foods. 2023; 105:105545. [Google Scholar]
  • 60. Sanchez-Vega F., Mina M., Armenia J., Chatila W.K., Luna A., La K.C., Dimitriadoy S., Liu D.L., Kantheti H.S., Saghafinia S.et al.. Oncogenic signaling pathways in the cancer genome atlas. Cell. 2018; 173:321–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Santarpia L., Lippman S.M., El-Naggar A.K.. Targeting the MAPK-RAS-RAF signaling pathway in cancer therapy. Expert Opin. Ther. Targets. 2012; 16:103–119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Cheng H.S., Yip Y.S., Lim E.K.Y., Wahli W., Tan N.S.. PPARs and tumor microenvironment: the emerging roles of the metabolic master regulators in tumor stromal-epithelial crosstalk and carcinogenesis. Cancers (Basel). 2021; 13:2153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Cui C., Merritt R., Fu L., Pan Z.. Targeting calcium signaling in cancer therapy. Acta Pharm Sin B. 2017; 7:3–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Monteith G.R., Davis F.M., Roberts-Thomson S.J.. Calcium channels and pumps in cancer: changes and consequences. J. Biol. Chem. 2012; 287:31666–31673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Ahmed M.B., Alghamdi A.A.A., Islam S.U., Lee J.S., Lee Y.S.. cAMP signaling in cancer: a PKA-CREB and EPAC-centric approach. Cells. 2022; 11:2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Matthews H.K., Bertoli C., de Bruin R.A.M.. Cell cycle control in cancer. Nat. Rev. Mol. Cell Biol. 2022; 23:74–88. [DOI] [PubMed] [Google Scholar]
  • 67. Stegh A.H. Targeting the p53 signaling pathway in cancer therapy - the promises, challenges and perils. Expert Opin. Ther. Targets. 2012; 16:67–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Hanahan D., Weinberg R.A.. Hallmarks of cancer: the next generation. Cell. 2011; 144:646–674. [DOI] [PubMed] [Google Scholar]
  • 69. Provenzano P.P., Keely P.J.. The role of focal adhesion kinase in tumor initiation and progression. Cell Adh. Migr. 2009; 3:347–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Bao Y., Wang L., Shi L., Yun F., Liu X., Chen Y., Chen C., Ren Y., Jia Y.. Transcriptome profiling revealed multiple genes and ECM-receptor interaction pathways that may be associated with breast cancer. Cell. Mol. Biol. Lett. 2019; 24:38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Faubert B., Solmonson A., DeBerardinis R.J.. Metabolic reprogramming and cancer progression. Science. 2020; 368:eaaw5473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Villicana C., Cruz G., Zurita M.. The basal transcription machinery as a target for cancer therapy. Cancer Cell Int. 2014; 14:18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Lobry C., Oh P., Mansour M.R., Look A.T., Aifantis I.. Notch signaling: switching an oncogene to a tumor suppressor. Blood. 2014; 123:2451–2459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Bi G., Bian Y., Liang J., Yin J., Li R., Zhao M., Huang Y., Lu T., Zhan C., Fan H.et al.. Pan-cancer characterization of metabolism-related biomarkers identifies potential therapeutic targets. J. Transl. Med. 2021; 19:219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Gentles A.J., Newman A.M., Liu C.L., Bratman S.V., Feng W., Kim D., Nair V.S., Xu Y., Khuong A., Hoang C.D.et al.. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat. Med. 2015; 21:938–945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Wu X., Zhu L., Guo J., Zhang D.Y., Lin K.. Prediction of yeast protein-protein interaction network: insights from the Gene Ontology and annotations. Nucleic Acids Res. 2006; 34:2137–2150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T.et al.. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000; 25:25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Chen C.C., Lin C.Y., Lo Y.S., Yang J.M.. PPISearch: A web server for searching homologous protein-protein interactions across multiple species. Nucleic Acids Res. 2009; 37:W369–W375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Clark N.R., Hu K.S., Feldmann A.S., Kou Y., Chen E.Y., Duan Q., Ma’ayan A.. The characteristic direction: a geometrical approach to identify differentially expressed genes. BMC Bioinf. 2014; 15:79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Yamada S., Ohira M., Horie H., Ando K., Takayasu H., Suzuki Y., Sugano S., Hirata T., Goto T., Matsunaga T.et al.. Expression profiling and differential screening between hepatoblastomas and the corresponding normal livers: identification of high expression of the PLK1 oncogene as a poor-prognostic indicator of hepatoblastomas. Oncogene. 2004; 23:5901–5911. [DOI] [PubMed] [Google Scholar]
  • 81. Maire V., Nemati F., Richardson M., Vincent-Salomon A., Tesson B., Rigaill G., Gravier E., Marty-Prouvost B., De Koning L., Lang G.et al.. Polo-like kinase 1: A potential therapeutic option in combination with conventional chemotherapy for the management of patients with triple-negative breast cancer. Cancer Res. 2013; 73:813–823. [DOI] [PubMed] [Google Scholar]
  • 82. Jeong S.B., Im J.H., Yoon J.H., Bui Q.T., Lim S.C., Song J.M., Shim Y., Yun J., Hong J., Kang K.W.. Essential role of polo-like kinase 1 (Plk1) oncogene in tumor growth and metastasis of tamoxifen-resistant breast cancer. Mol. Cancer Ther. 2018; 17:825–837. [DOI] [PubMed] [Google Scholar]
  • 83. Neumayer G., Belzil C., Gruss O.J., Nguyen M.D.. TPX2: of spindle assembly, DNA damage response, and cancer. Cell. Mol. Life Sci. 2014; 71:3027–3047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Imai K., Hirata S., Irie A., Senju S., Ikuta Y., Yokomine K., Harao M., Inoue M., Tomita Y., Tsunoda T.et al.. Identification of HLA-A2-restricted CTL epitopes of a novel tumour-associated antigen, KIF20A, overexpressed in pancreatic cancer. Br. J. Cancer. 2011; 104:300–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Khongkow P., Gomes A.R., Gong C., Man E.P., Tsang J.W., Zhao F., Monteiro L.J., Coombes R.C., Medema R.H., Khoo U.S.et al.. Paclitaxel targets FOXM1 to regulate KIF20A in mitotic catastrophe and breast cancer paclitaxel resistance. Oncogene. 2016; 35:990–1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Xiong M., Zhuang K., Luo Y., Lai Q., Luo X., Fang Y., Zhang Y., Li A., Liu S.. KIF20A promotes cellular malignant behavior and enhances resistance to chemotherapy in colorectal cancer through regulation of the JAK/STAT3 signaling pathway. Aging (Albany NY). 2019; 11:11905–11921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Sunaga N., Kaira K.. Epiregulin as a therapeutic target in non-small-cell lung cancer. Lung Cancer (Auckl). 2015; 6:91–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Karki R., Man S.M., Malireddi R.K.S., Kesavardhana S., Zhu Q., Burton A.R., Sharma B.R., Qi X., Pelletier S., Vogel P.et al.. NLRC3 is an inhibitory sensor of PI3K-mTOR pathways in cancer. Nature. 2016; 540:583–587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Ting J.P., Willingham S.B., Bergstralh D.T.. NLRs at the intersection of cell death and immunity. Nat. Rev. Immunol. 2008; 8:372–379. [DOI] [PubMed] [Google Scholar]
  • 90. Draghici S., Nguyen T.M., Sonna L.A., Ziraldo C., Vanciu R., Fadel R., Morrison A., Kenney R.M., Alangaden G., Ramesh M.et al.. COVID-19: disease pathways and gene expression changes predict methylprednisolone can improve outcome in severe cases. Bioinformatics. 2021; 37:2691–2698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91. Zhou Z., Ren L., Zhang L., Zhong J., Xiao Y., Jia Z., Guo L., Yang J., Wang C., Jiang S.et al.. Heightened Innate Immune Responses in the Respiratory Tract of COVID-19 Patients. Cell Host Microbe. 2020; 27:883–890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92. Lieberman N.A.P., Peddu V., Xie H., Shrestha L., Huang M.L., Mears M.C., Cajimat M.N., Bente D.A., Shi P.Y., Bovier F.et al.. In vivo antiviral host transcriptional response to SARS-CoV-2 by viral load, sex, and age. PLoS Biol. 2020; 18:e3000849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93. Wilk A.J., Rustagi A., Zhao N.Q., Roque J., Martinez-Colon G.J., McKechnie J.L., Ivison G.T., Ranganath T., Vergara R., Hollis T.et al.. A single-cell atlas of the peripheral immune response in patients with severe COVID-19. Nat. Med. 2020; 26:1070–1076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94. Shaath H., Vishnubalaji R., Elkord E., Alajez N.M.. Single-cell transcriptome analysis highlights a role for neutrophils and inflammatory macrophages in the pathogenesis of severe COVID-19. Cells. 2020; 9:2374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95. Kuleshov M.V., Stein D.J., Clarke D.J.B., Kropiwnicki E., Jagodnik K.M., Bartal A., Evangelista J.E., Hom J., Cheng M., Bailey A.et al.. The COVID-19 Drug and Gene Set Library. Patterns (N Y). 2020; 1:100090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96. Boutet E., Lieberherr D., Tognolli M., Schneider M., Bairoch A.. UniProtKB/Swiss-Prot. Methods Mol. Biol. 2007; 406:89–112. [DOI] [PubMed] [Google Scholar]
  • 97. Pinero J., Bravo A., Queralt-Rosinach N., Gutierrez-Sacristan A., Deu-Pons J., Centeno E., Garcia-Garcia J., Sanz F., Furlong L.I.. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 2017; 45:D833–D839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98. Croft D., Mundo A.F., Haw R., Milacic M., Weiser J., Wu G., Caudy M., Garapati P., Gillespie M., Kamdar M.R.et al.. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014; 42:D472–D477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99. Piñero J., Bravo À., Queralt-Rosinach N., Gutiérrez-Sacristán A., Deu-Pons J., Centeno E., García-García J., Sanz F., Furlong L.I.. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 2017; 45:D833–d839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100. Wishart D.S., Knox C., Guo A.C., Shrivastava S., Hassanali M., Stothard P., Chang Z., Woolsey J.. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006; 34:D668–D672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101. Papadatos G., Overington J.P.. The ChEMBL database: a taster for medicinal chemists. Future Med Chem. 2014; 6:361–364. [DOI] [PubMed] [Google Scholar]
  • 102. Chen X., Liu M., Gilson M.K.. BindingDB: a web-accessible molecular recognition database. Comb. Chem. High Throughput Screen. 2001; 4:719–725. [DOI] [PubMed] [Google Scholar]
  • 103. Meldal B.H.M., Bye A.J.H., Gajdos L., Hammerova Z., Horackova A., Melicher F., Perfetto L., Pokorny D., Lopez M.R., Turkova A.et al.. Complex Portal 2018: extended content and enhanced visualization tools for macromolecular complexes. Nucleic Acids Res. 2019; 47:D550–D558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104. Franz M., Lopes C.T., Huck G., Dong Y., Sumer O., Bader G.D.. Cytoscape.js: a graph theory library for visualisation and analysis. Bioinformatics. 2016; 32:309–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105. Ma J., Shojaie A., Michailidis G.. A comparative study of topology-based pathway enrichment analysis methods. BMC Bioinf. 2019; 20:546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106. Szklarczyk D., Gable A.L., Lyon D., Junge A., Wyder S., Huerta-Cepas J., Simonovic M., Doncheva N.T., Morris J.H., Bork P.et al.. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019; 47:D607–D613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107. Chen S., Yang D., Lei C., Li Y., Sun X., Chen M., Wu X., Zheng Y.. Identification of crucial genes in abdominal aortic aneurysm by WGCNA. PeerJ. 2019; 7:e7873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108. Chen Z., Liu G., Hossain A., Danilova I.G., Bolkov M.A., Liu G., Tuzankina I.A., Tan W.. A co-expression network for differentially expressed genes in bladder cancer and a risk score model for predicting survival. Hereditas. 2019; 156:24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109. Obayashi T., Kodate S., Hibara H., Kagaya Y., Kinoshita K.. COXPRESdb v8: an animal gene coexpression database navigating from a global view to detailed investigations. Nucleic Acids Res. 2023; 51:D80–D87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110. Stelzer G., Rosen N., Plaschkes I., Zimmerman S., Twik M., Fishilevich S., Stein T.I., Nudel R., Lieder I., Mazor Y.et al.. The GeneCards Suite: from Gene Data Mining to Disease Genome Sequence Analyses. Curr. Protoc. Bioinformatics. 2016; 54:1.30.1–1.30.33. [DOI] [PubMed] [Google Scholar]
  • 111. Maleki F., Ovens K., Hogan D.J., Kusalik A.J.. Gene Set Analysis: challenges, Opportunities, and Future Research. Front. Genet. 2020; 11:654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112. Donato M., Xu Z., Tomoiaga A., Granneman J.G., Mackenzie R.G., Bao R., Than N.G., Westfall P.H., Romero R., Draghici S.. Analysis and correction of crosstalk effects in pathway analysis. Genome Res. 2013; 23:1885–1893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113. DeNardo D.G., Andreu P., Coussens L.M.. Interactions between lymphocytes and myeloid cells regulate pro- versus anti-tumor immunity. Cancer Metastasis Rev. 2010; 29:309–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114. Shapiro E., Biezuner T., Linnarsson S.. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat. Rev. Genet. 2013; 14:618–630. [DOI] [PubMed] [Google Scholar]
  • 115. Navin N., Kendall J., Troge J., Andrews P., Rodgers L., McIndoo J., Cook K., Stepansky A., Levy D., Esposito D.et al.. Tumour evolution inferred by single-cell sequencing. Nature. 2011; 472:90–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116. Barabasi A.L., Oltvai Z.N.. Network biology: understanding the cell's functional organization. Nat. Rev. Genet. 2004; 5:101–113. [DOI] [PubMed] [Google Scholar]
  • 117. Parsana P., Ruberman C., Jaffe A.E., Schatz M.C., Battle A., Leek J.T.. Addressing confounding artifacts in reconstruction of gene co-expression networks. Genome Biol. 2019; 20:94. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkad1187_supplemental_file

Data Availability Statement

The Gscore online tool is available at https://gscore.ibsb.nycu.edu.tw/. This online tool is free and open to all users and there is no login requirement. Gscore and the code used for analysis are implemented in Python and available at https://github.com/SysMednet/Gscore (permanent DOI: https://doi.org/10.5281/zenodo.8260953). The accession numbers of the gene expression datasets assembled from TCGA and GEO databases are listed in Supplementary Tables S1, S2 and S3.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES