Abstract
Cancer transcriptomic data are widely leveraged to evaluate the prognostic relevance of targeted genes. However, most basic and translational studies continue to rely on univariable survival analysis, which often fails to capture the full prognostic potential of genes or account for their biological context. Recognizing the complexity of revealing multifaceted prognostic effects, especially when incorporating covariates and variable thresholds, we present the Cancer Gene Prognosis Atlas (CGPA), an interactive tool specifically designed for basic and molecular cancer researchers. CGPA provides an intuitive, user-friendly interface that enables in-depth, customizable prognostic analysis across cancer types. Beyond single-gene analyses, it supports data-driven exploration of gene pairs and gene-hallmark relationships, providing insights into key mechanisms such as synthetic lethality and immunosuppression. CGPA further extends its capabilities to assess multi-gene panels using both public and user-provided data and includes a dedicated portal for cancer immunotherapy datasets. Collectively, CGPA’s comprehensive yet user-friendly toolkit empowers researchers to interrogate the prognostic landscape of genes with precision, tailor analyses to specific biological hypotheses, and accelerate biomarker discovery and validation through the integration of both mechanistic-informed and data-driven approaches.
INTRODUCTION
Tumor transcriptome databases have become indispensable resources for both preclinical research and clinical testing. Despite the rapidly growing amount of data in repositories such as GEO and SRA, the Cancer Genome Atlas (TCGA) remains the largest and most commonly used pan-cancer database, which contains curated transcriptome data from over 11,000 primary cancer samples across 33 major cancer types1,2. While the official TCGA portal (Genomic Data Commons, or GDC) provides easy access to these data, users still need considerable data-processing skills to effectively navigate and integrate the information. For example, Level-3 RNA-seq gene expression (e.g., RPKM or TPM) data are stored in multiple files per patient. To address this, cBioPortal3 has become a popular and user-friendly alternative for initial data exploration. It provides comprehensive gene and mutation spectrum analyses but offers limited functionality for evaluating associations between gene expression (GE) and clinical outcomes. Another online platform, GEPIA4, provides advanced tools for gene-based prognostic analysis using TCGA and GTEx data. However, access to this resource is restricted for some U.S. institutions due to its server location.
Analyzing continuous gene expression data presents distinct challenges compared to binary mutational data. The prognostic modelling of gene expression data is inherently sensitive to cutoff selection, yet most current tools rely on fixed cutoffs (e.g., median or quartiles) with limited flexibility. Determining an optimal threshold for these expression levels, to classify patients into different risk groups, can significantly impact study outcomes, potentially leading to varying conclusions even with identical datasets. Many published studies, especially in basic science settings, have only shown the univariable association between survival and GE levels without adjusting for any clinical covariates, typically visualized by the separation of Kaplan Meier (KM) curves from high- and low-GE patient groups. Such practices can inadvertently yield inaccurate or even misleading conclusions, especially considering the prognostic association can be confounded by many factors such as sex, known cancer subtypes, tumor grade, and stage.
A second motivation stems from the observed limitations of current tools for constructing multi-gene prognostic signatures. Many investigators are increasingly transitioning from genome-wide searches to hypothesis-driven approaches, focusing on specific gene pathways and gene panels that represent targeted biological mechanisms. However, the development and validation of such panels require sophisticated analytical tools that can cohesively evaluate the joint impact of all included genes. Finally, most existing cancer GE data are derived from bulk tumor tissue and are strongly influenced by cellular composition, including tumor purity and immune cell infiltration in the tumor microenvironment (TME). Accordingly, many top survival-associated genes, especially in immunogenic cancer types, are not ‘functional’ themselves but merely correlated with TME features. We reason that this issue can be alleviated by incorporating multiple genes and molecular scores simultaneously in the survival model. In line with this multi-gene perspective, there is growing interest in gene-pair analysis, which can reveal synergistic or antagonistic gene interactions that shape cancer progression and outcomes. One such example is synthetic lethality, where combined deficiencies in two genes lead to cell death and improved patient survival.
In response to the above-discussed challenges, we introduce CGPA, an online tool designed to address the complex needs of gene-centric biomarker discovery and validation in real-world analyses. CGPA provides fast access and an intuitive query interface for assessing the prognostic significance of gene signatures at the pan-cancer level. The tool fills a gap in the community as the first interactive tool for generating publication-ready results from multivariable and multi-gene survival models. CGPA contains several innovative functions to analyze the correlation structure among targeted genes and to generate and test pooled GE signatures from customizable gene panels. Taken together, CGPA is a timely and unique resource that will complement existing databases in cancer genomics and will immediately facilitate the process of gene validation, prognostic signature discovery, and therapeutic targets exploration.
MATERIALS AND METHODS
Pan-cancer gene expression and survival data
We leveraged the Pan-Cancer normalized gene expression data from the PanCancer Atlas (RRID: SCR_014514; https://gdc.cancer.gov/about-data/publications/pancanatlas), which covers 20,501 genes across 33 cancer types and 10,074 patients. In these 33 cancer types, eight cancers (LUAD, BRCA, HNSC, KIRC, LGG, LUSC, THCA, UCEC) have over 500 samples, and ten other cancer types have more than 200 samples, providing a robust resource for pan-cancer prognostic analysis. In the analysis of gene expression data, we retained only samples labeled as “01” (primary tumor, preferred) and “06” (metastatic). If a patient had multiple primary tumor samples, we selected the one with the highest total gene expression to ensure consistency in downstream analyses. The survival outcomes used in this research were based on data from the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR). This resource offers a standardized dataset that incorporates curated survival endpoints (including OS, PFI, DFI, and DSS), along with endpoint usage recommendations tailored for each cancer type, which we followed in all analyses.
Pan-cancer gene-based survival analysis
We conducted a pan-cancer gene-based survival analysis using univariable Cox modeling, focusing on overall survival (OS) and progression-free interval (PFI). The results were visually represented using forest plots, with cancer types ranked based on hazard ratios (HR). HR<1 indicated protective effects, while HR>1 suggested shorter survival times; significance was determined at α = 0.05. For cancer types that exhibited significant associations in the forest plots, we further explored the survival outcomes using KM plots based on the optimal cutoff of gene expression values. The optimal cutoff was determined using the “surv_cutpoint” function implemented in the ‘survminer’ package, which employs the maximally selected rank statistics to identify the cutoff that separates patients into groups with most significantly different survival outcomes. To investigate gene expression profiles across various cancer types and normal tissues, we used circular bar plots to rank cancer types based on normalized gene expression values. To gain insights into the biological and pathway context of the selected gene, a protein-protein interaction (PPI) network was constructed using the STRING (RRID: SCR_005223) API to contextualize gene function and pathways, allowing users to specify the number of interacting genes (default value: 10). These components on the CGPA prognosis snapshot page deliver an integrated summary of pan-cancer gene-based survival and expression patterns.
Single-gene multivariable analysis
To facilitate a more comprehensive single-gene multivariable analysis of gene-based survival, we provide a dedicated interactive survival analysis page equipped with highly customizable functions. This user-friendly page returns high-resolution KM plots and, beneath each plot, hazard ratios, 95% confidence intervals, and p-values from Cox proportional hazards models. Users can select key covariates to include in the multivariable model, including age, sex, tumor purity, and cytotoxic lymphocyte (CTL) scores. To ensure consistent estimation of tumor purity, we downloaded a complete set of tumor purity data for all TCGA samples, which can be accessed at the following link: http://cistrome.org/TIMER/misc/AGPall.zip. Eight TCGA cancer types lacked purity estimates and were excluded from models incorporating this covariate. Additionally, for seven sex-specific cancer types (CESC, OV, PRAD, TGCT, UCEC, UCS, and BRCA), sex was not integrated into the analysis. Gene expression stratification for KM visualization is customizable: median, quartiles, data-driven optimal cutoff (via the surv_cutpoint function in the survminer R package), or a user-specified cutoff. All modeling is performed using the coxph function from the survival R package. In the multivariable analysis tab, we also compiled a pre-calculated list of the top prognostic genes per cancer type for mRNA and long non-coding RNA (lncRNA), analyzed separately due to distinct data preprocessing.
Gene hallmark analysis
In our gene-hallmark analysis, we considered age, sex, the top mutated genes, copy number variations (CNV) specific to each cancer type, tumor mutational burden (TMB), immune scores (including CYT and CTL), and the androgen receptor (AR). We explored two models to test the interaction between these hallmarks and the selected gene: full model (Hazard ~ a.Gene + b. Hallmark + c.Gene × Hallmark) and partial (Hazard ~ a.Gene + c.Gene × Hallmark). Continuous hallmarks (e.g., TMB, CTL) were modeled as continuous predictors; categorical hallmarks (mutation/CNA) were encoded as binary variables. The significance of interaction effects was assessed using Wald tests, as implemented in the coxph function from the survival R package. All gene set signatures (e.g., CYT and CTL) were calculated based on mean values or single-sample gene set enrichment analysis (ssGSEA) using the GSVA R package. By default, hallmark scores such as CTL and CYT were calculated using predefined immune-related gene sets. Users may also upload pre-computed hallmark scores. All analyses offered the option to adjust for tumor purity. Results are visualized using circos plots, highlights significant gene-hallmark interactions at P < 0.05.
Alternative splicing prognostic analysis
CGPA includes ProgSplicing, a module for pan-cancer prognostic landscape of alternative splicing (AS) events. Splicing data were obtained from OncoSplicing.com5, which integrates two large-scale projects: the TCGA SpliceSeq (RRID:SCR_005267) project and the TCGA SplAdder project. For each AS event, CGPA fits a univariable Cox regression using PSI values as continuous predictors of overall survival. Results are summarized in a pan-cancer dot plot (rows = AS event; columns = TCGA cancer type). Significant associations (P < 0.05) are color-coded: blue dots indicate favorable prognosis (HR < 1), red dots indicate poorer prognosis (HR > 1), and gray dots indicate non-significant associations.
CGPA two-gene analysis
The CGPA toolkit includes a module for gene-pair prognostic analysis, incorporating both KM plots and bivariable Cox regression analyses. Similar to the single-gene module, expression can be dichotomized by median, quartiles, optimal cutoffs, or user-defined cutoffs. When an interaction is present, the survival curves diverge within strata of the anchoring gene (see Figure 4). This module thus enables exploration of phenomena such as synthetic viability and lethality. Synthetic viability, as observed with the VPS37A-DCLRE1B gene pair, reflects scenarios where high expression of both genes correlate with improved survival. Synthetic lethality, exemplified by UBE2Z-RNF117A, occurs when low expression of both genes predicts favorable prognostic, potentially informing targeted therapy development. The module also incorporates an innovative feature to examine the prognostic value of expression ratios between two genes, which can detect interaction relationships missed by additive models. A built-in correlation analysis between the two genes is also provided, allowing users to evaluate co-expression as contextual information before performing survival analysis.
Figure 4.

Prognostic relevance of gene-pair interactions as analyzed by CGPA. (A) Kaplan-Meier plot illustrating the survival impact of LAG3 levels in patients with high TCF7 expression, revealing a positive correlation between high LAG3 and improved survival, thereby indicating the prognostic significance of the TCF7-LAG3 interaction. (B) Kaplan-Meier plot for the TCF7-low group showing the lack of survival distinction based on varying LAG3 levels, suggesting a significant interaction effect of TCF7 and LAG3 on patient outcomes. (C) Visualization of synthetic viability with the VPS37A and DCLRE1B gene pair, where high expression levels of both genes correspond to better patient survival. (D) Representation of synthetic lethality in the context of UBE2Z and RNF117A gene pair, where low expression levels of both genes are associated with improved survival, implying a potential therapeutic strategy of dual-gene targeting.
Multi-gene prognostic Analysis
The CGPA introduces a module for multi-gene survival analysis, designed to dissect complex gene networks and gene sets with heterogeneous functions. To evaluate the prognostic significance of gene sets, we apply ssGSEA and a simple unweighted mean value of normalized gene expression values as methods to calculate combined scores. The framework allows for the utilization of both univariable Cox regression analysis and machine learning-based methods, specifically the gradient boosting method6, to identify top prognostic genes within the input set. Genes are ranked based on feature importance, enabling identification of the most influential predictors within the panel. The exploration of gene networks and subnetworks is conducted using EGAnet7, which identifies gene communities (or subnetworks) based on pairwise correlation matrices, using the Walktrap clustering algorithm by default. EGAnet classifies genes into subnetworks according to interaction patterns, thereby elucidating potential co-regulation or interaction mechanisms. To aid interpretation, CGPA supports pairwise inter-gene correlation visualization through the igraph R package, allowing users to explore network connectivity and co-expression within the input gene set. In addition, Principal Component Analysis (PCA) is used to examine the internal structure of the gene set. The PCA loading plot displays the contribution of each gene to the first two principal components, reflecting its correlation strength with the underlying variance. The angle between loading vectors indicates the degree of pairwise correlation (smaller angles suggest stronger positive correlations), offering an intuitive visual representation of gene-gene relationships. Together, these modules offer a flexible and integrated framework for both hypothesis-driven and data-driven discovery of gene set signatures in cancer.
Data and code availability
CGPA is publicly available as a web-based interactive tool at https://cgpa.moffitt.org/. This platform provides researchers with easy access to a comprehensive suite of tools for prognostic gene discovery and analysis in immunotherapy studies. The source code and associated R packages used in the development of CGPA have been deposited in a GitHub repository: https://github.com/wanglab1/CGPA. We also provide a Docker image, available on Figshare: https://figshare.com/articles/dataset/CGPA/30048187?file=57645685, which allows users to download and deploy the entire application locally. Additionally, all data included in the CGPA web tool, including both PanCancer TCGA datasets and immunotherapy datasets, is available for download at https://github.com/wanglab1/CGPA/tree/main/PanCancer_data and https://github.com/wanglab1/CGPA/tree/main/IOtherapy_data. The TCGA CDR outcome data (recommended for survival analysis) were downloaded from the “TCGA-CDR-SupplementalTableS1.xlsx” file available on the PanCanAtlas website: https://gdc.cancer.gov/about-data/publications/pancanatlas. The alternative splicing datasets generated by SpliceSeq and SpIAdder were acquired from oncosplicing.com, with supplementary data from all cancer types (not available on the website) provided by the OncoSplicing author team. The ICI datasets preprocessed and analyzed by this study are available from the Gene Expression Omnibus (GEO, RRID:SCR_005012) repository under the accession numbers: GSE111636, GSE248167, GSE173839, GSE194040, GSE241876, GSE179730, GSE162137, GSE165252, GSE159067, GSE195832, GSE67501, GSE135222, GSE221733, GSE126044, GSE100797, GSE115821, GSE131521, GSE78220, GSE91061, GSE96619, GSE202687. We also included pre-analyzed gene expression data from large ICI projects, including IMvigor210 (based on the R package “IMvigor210CoreBiologies”), harmonized KIRC datasets from the supplementary files of publication PMID: 32472114, Javeline101 from the supplementary files of publication PMID: 32895571, harmonized SKCM datasets (available at https://github.com/ParkerICI/MORRISON-1-public), and Kallisto data from https://github.com/hammerlab/multi-omic-urothelial-anti-pdl1/tree/master.
RESULTS
CGPA is a proactive and interactive tool for gene-centric prognostic analysis
The development of the CGPA platform (https://cgpa.moffitt.org/) was driven by the observation that many published studies depend solely on univariable Kaplan-Meier plots for the exploration and validation of the prognostic relevance of specific genes. Within this context, GraphPad Prism remains the most common tool among clinicians and non-statisticians. One common misconception is that multivariable analysis, by introducing more covariates and thus increased degrees of freedom, might reduce statistical power and render findings of marginal significance to less or non-significant outcomes. In contrast, CGPA’s proactive multivariable analysis, by adjusting for key covariates, reveals additional prognostic gene signals that may otherwise be overlooked. Building on this concept, CGPA incorporates a suite of innovative functions designed to advance gene-centric prognostic analysis. Key distinguishing features include: (1) CGPA provides a holistic overview of the pan-cancer prognostic landscape, enabling users to assess gene-level prognostic value across different cancers. (2) It offers customized gene-based survival analyses by including flexible cutoffs in Kaplan-Meier plots, and covariate-adjusted Kaplan-Meier and Cox models. (3) It provides a dedicated “ProgSplicing” tab for pan-cancer prognostic analysis of alternative splicing events. (4) The platform supports exploration of gene pairs and gene-hallmark interactions, revealing mechanisms like synthetic lethality and immunosuppression that are pivotal to cancer biology. (5) Additionally, CGPA includes a comprehensive multi-gene panel through a mechanism-to-machine approach. Users can upload their own preprocessed data set for custom prognostic analysis. We recommend using log2(TPM+1) transformed expression values accompanied by a separate phenotype file containing survival outcomes and covariates, formatted as specified on the website.
As illustrated in Figure 1, the CGPA web application contains three main modules. At the forefront is single-gene prognostic discovery. Factors like tumor purity, patient heterogeneity and hidden confounding variables might distort the association results. However, incorporating additional factors can lead to a reduction in statistical power. Therefore, a balanced use of both univariable and multivariable approaches provides a more robust and reliable assessment of prognostic significance. The second module focuses on gene-pair interactions. When an investigator has a specific target gene in mind, analyzing its interaction with another gene might offer additional context. This two-gene model facilitates the study of mechanisms such as synthetic lethality and synthetic viability. In contrast, most existing tools provide limited support for two-gene interactions8. The third module is dedicated to multi-gene panel discovery, addressing challenges in heterogeneous pathways where aggregate scores can obscure gene-level effects. CGPA overcomes this limitation by segmenting large gene groups into smaller, biologically relevant subnetworks. We summarize the key results from each module below to illustrate their functionality and utility.
Figure 1.

Overview of main functions in CGPA to facilitate gene-centric prognostic biomarker discovery and validation.
Single-gene module and UMOP genes
When a gene is queried in the search bar, CGPA first renders a prognostic summary across various cancer types. The prognostic significance of the gene is visualized through a forest plot ranked by the estimated effect size (hazard ratio), providing a pan-cancer overview of how the gene expression impacts OS and PFI. Each line in the forest plot represents a cancer type, with the position and length of the line indicating the estimated HR and its confidence interval, respectively. A line crossing the vertical line of no effect (hazard ratio of 1) suggests no significant prognostic effect, whereas lines to the left or right indicate favorable or adverse associations, respectively. As an ad-hoc exploration, cancer types with significant prognostic effects are further explored using the Kaplan-Meier plots, stratified by optimal cutoff points. These cutoff points are identified through Maximally Selected Rank Statistics9,10, which selects the threshold maximizing the difference in survival metrics between high and low expression groups. However, this approach carries a risk of overfitting, and researchers should be cautious when stratified group is imbalanced or show inconsistent effect directions within the HR estimate. The prognostic summary panel also incorporates figures that display cancer-level gene expression profiles in tumor and normal tissues, alongside the protein-protein interaction network based on the STRING database. A more customized and detailed examination of gene-based analysis is available through the “multivariable analysis” tab, where users can explore the impact of a gene on patient survival while adjusting for clinical covariates and tumor purity. This tab also display results from covariate-adjusted KM functions11 alongside multivariable Cox outputs.
In our analysis of pan-cancer transcriptome data, we identified a set of genes with prognostic potential that were consistently missed by univariable Cox regression. These are referred to as univariable missed-opportunity prognostic (UMOP) genes. Below, we highlight representative findings from major immunogenic cancers, including bladder cancer (BLCA), head and neck squamous cell carcinoma (HNSC), lung adenocarcinoma (LUAD), skin cutaneous melanoma (SKCM), and uterine corpus endometrial carcinoma (UCEC). Figure 2 illustrates this contrast through bar plots comparing −log10 p-values from univariable Cox models and multivariable Cox models that adjusts for age, sex, tumor purity, and CTL levels. In BLCA, UMOP biomarkers include genes invovled in a variety of cellular or immune functions including extracellular matrix remodeling (MMP15, HPSE), metabolic alternations (PDXK, BCAT2), immune response (C1QB, C1QC), signal transduction (MAP3K7, CAMK4), and critical gene regulation (MYCL1, ZNF436). In HNSC, top UMOP genes missed by the univariable model include DOCK6, PSMB9, B2M, CPVL, and CCDC99. Notably, several well-known immune-related genes (HLA-C, HLA-H, HLA-B, and HLA-F) also showed prognostic significance in the multivariable analysis. This result highlights that specific immune gene signatures may convey independent prognostic value, setting them apart from the broader, often confounded correlations typically observed in immunogenic cancers. Immune genes such as IL15, IFNB1, and SERPINA1, were similarly identified as UMOP genes, providing additional insights into the immune landscape of HNSC. In LUAD, we uncovered a set of immune modulator genes (such as TAP1 and GBP1) and interferon signaling mediators (STAT1), pointing to a role of immune surveillance and response in disease progression. In addition, we identified several genes integral to DNA replication and repair processes, such as PCNA and RFWD3, which may indicate genomic instability as a prognostic indicator in lung cancer. Furthermore, the identification of SOD2, a regulator of oxidative stress responses, and MT1X, which is involved in metal ion homeostasis, underscores the interplay between cellular defense mechanisms and metal regulation. In SKCM, our analysis highlights HMOX1 and FTL, which are also involved in the cellular response to oxidative stress and iron homeostasis, respectively. Additionally, genes such as SIRPA, SIRPB1, VIPR1, ITLN1, and SELENBP1, are known to contribute to the interactions between cancer cells and the host’s immune system. In UCEC, the UMOP analysis identified genes that are related to immune surveillance (e.g., MICB, CLEC2D, VCAM1, TNFA1P6, and TNFSF4) and genes that are related to cellular stress response (e.g., PPP2CA, GSTA2, HDAC8, and DNAJB6).
Figure 2.

Identification of UMOP Genes across five immunogenic cancers (BLCA, HNSC, LUAD, SKCM and UCEC). Bar plots display the −log10 p-values from both univariable (gray) and multivariable (light green) Cox regression analyses, highlighting genes that exhibit significant prognostic potential in the multivariable context but do not reach statistical significance in the univariable analysis. Each panel corresponds to one cancer type, with genes on the x-axis and –log10 p-values on the y-axis.
To evaluate the biological relevance of the UMOP genes, we queried their functional significance based on the DepMap database, focusing on HNSC as an example. As shown in Supplementary Fig S1, LCE2A and MITD1 exhibited selective dependency in head and neck cancer cell lines, suggesting potential tumor-specific dependency. In contrast, CCDC99/SPDL1, IMMT, and THG1L demonstrated consistent dependency across multiple lineages, classifying them as common essential genes with broader functional roles. As a comparison, we assessed the prognostic value of these genes using three commonly used databases: the Human Protein Atlas, PRECOG, and GEPIA2. As expected, all five genes tended to be associated with worse survival outcomes in many cancer types, with higher gene expression often linked to poorer prognosis. LCE2A and MITD1 showed prognostic associations in limited number of cancer types, whereas CCDC99, IMMT, and THG1L were associated with overall survival across a broader range of cancers. Among the three resources, the Human Protein Atlas reported the fewest significant associations, due to its use of a fixed and stringent significance threshold (P<0.001). Importantly, none of these genes were annotated as prognostic in HNSC by any of the three databases, further demonstrating the added value of UMOP analysis. Finally, as summarized in Supplementary Table S1, several UMOP genes from SKCM (e.g., CNDP2, IER5, CBFA2T3, and STAT1) and HNSC (e.g., STAT1 and IL15) exhibited significant associations with overall survival in published immunotherapy (ICI) trial datasets.
Gene-hallmark interaction
CGPA offers an innovative feature for exploring gene-hallmark interactions through a dedicated “Gene-Hallmark Interaction” tab. This functionality allows users to investigate the dynamic interplay between targeted genes and cancer hallmark signature. The investigation of gene-cancer hallmark interactions is rooted in the evolving understanding of cancer as not merely a collection of random genetic aberrations, but as a complex biological system characterized by specific, definable traits, i.e., hallmarks of cancer12,13. The hallmark covariates integrate a wide array of factors including frequently mutated genes, CNV, TMB, immune scores (such as CYT14,15 and CTL8 levels), and the androgen receptor (AR) status. For each cancer type, we focused exclusively on the most prevalent mutations and copy number alterations (CNA) specific to that type; for instance, TP53 mutations and CDKN2A CNA in HNSC. Only alterations with a prevalence greater than 20% were included in each cancer type. The CYT, CTL, and other common immune scores were calculated based on predefined gene sets using ssGSEA16. The CYT score was calculated based on the expression of GZMA and PRF1, whereas the CTL score was computed from five genes: CD8A, CD8B, GZMA, GZMB, and PRF1. We implemented two distinct statistical models in this function tab: the Gene-Hallmark Interaction (GHI) full model and the GHI partial model. The full model evaluates the combined effect of genes and hallmarks, as well as their interactions, on patient survival (Hazard ~ a.Gene + b.Hallmark + c.Gene × Hallmark), while the partial model only includes the main effect from the gene (Hazard ~ a.Gene + c.Gene × Hallmark). This dual-model design allows users to explore the potential interplay of genes and hallmarks under different assumptions. All interaction analysis allows for tumor purity adjustment. The visualization of the interaction is facilitated through circos plots. When examining immune CTL scores as a hallmark, the model aligns with methodologies used in TIDE8 and ENLIGHT17, providing insights into immune evasion and response mechanisms. Figure 3 illustrates an example of the pan-cancer gene-hallmark interactome for gene IGSF8, a novel innate immune checkpoint and a potential immunotherapy target.
Figure 3.

Pan-cancer gene-hallmark interactome of gene IGSF8. In the circos plot, significant interactions (at a significance level of 0.05) are represented as links. Positive interaction effects are indicated in red, while negative effects are shown in blue.
Because the CGPA platform is designed for user-defined queries, multiple testing correction is not applied by default, and users are encouraged to apply appropriate methods (e.g., FDR using p.adjust in R) based on their analysis scope.
Gene-pair interaction module
The CGPA gene-pair module enables researchers to concurrently assess the survival impact of two genes, a powerful component not offered by most existing transcriptomic databases. Users can click the “Two gene search” on the main page, which opens a dedicated analysis portal tailored for the in-depth prognostic analysis using the selected gene pair. As noted previously, this module supports one-gene-anchored or conditional survival modelling, facilitates interaction analysis critical for identifying synthetic lethality and synthetic viability, and supports multivariable Cox models that explicitly accounts for interaction18. Results are visualized through KM plots and bivariable Cox regression analyses, with the latter incorporating interaction terms. The combined use of KM and Cox models is essential in real-world analysis. While KM plots are highly interpretable, they can be sensitive to the choice of expression cutoff. As demonstrated in Figure 4, the conventional approach to investigating gene interaction effects involves comparing patient survival across the four distinct quadrants created by dichotomizing gene expression levels into high and low categories. Figure 4A and 4B illustrate the interaction between TCF7 and LAG3. TCF7, a transcription factor important for T-cell development and differentiation, and LAG3, an immune checkpoint that co-regulates T-cell activation, together highlight a complex regulatory mechanism that could be pivotal for immune evasion by tumors. Targeting the TCF7-LAG3 axis could improve existing immune checkpoint inhibitors by promoting a more robust and effective anti-tumor immune response19. In Figure 4A, patients with high TCF7 expression show a clear survival differentiation based on LAG3 levels, with high LAG3 levels being associated with better survival. In contrast, Figure 4B, focusing on the TCF7-low group, shows no such distinction, indicating an interaction effect between TCF7 and LAG3. Figures 4C and 4D exemplify the concepts of synthetic lethality and synthetic viability. Synthetic lethality arises when the simultaneous low expression or inactivation of two genes leads to cell death, whereas the low expression or inactivation of either gene alone does not have this effect. Figure 4C presents a scenario of synthetic lethality with the UBE2Z and RNF117A gene pair, where patients with low expression of both genes exhibit improved survival compared to the other groups. This observation suggests that dual targeting of both genes may represent a potential therapeutic strategy. Figure 4D illustrates synthetic viability using the VPS37A and DCLRE1B gene pair. In this case, patients with high expression levels of both genes exhibit superior survival compared to other stratified groups.
Gene network and subnetwork analysis
In this section, we further discuss the utility of the gene network and subnetwork analysis implemented in CGPA. The functionality is particularly valuable when investigating large gene network or gene sets with heterogeneous biological functions20. Through gene set analysis, we can examine groups of genes that collectively influence patient outcomes. Gene network analysis extends this approach by considering the interactions and regulatory relationships between genes within a network. It allows us to identify central hub genes that play critical roles in tumorigenesis or immune-related functions, and to understand how disruptions in these networks contribute to disease progression. Existing tools, such as GEIPIA221 and SmulTCan22, lack the capability to resolve subnetworks and asses their contribution to prognostic model. To exemplify its utility, we employ a composite gene signature, denoted as ISG.HY, derived from the tumor IFN–stimulated gene signature (ISG.RS) and genes associated with hypoxia. Specifically, ISG.HY combines 38 genes from ISG.RS and 15 genes selected from a hypoxia signature. In Figure 5, Panel A illustrates the application of Exploratory Graph Analysis (EGA) to dissect the ISG.HY gene set in the TCGA HNSC det. EGA is an efficient subnetwork analysis tool, which segregates the ISG.HY gene set into three distinct subnetworks: ISG.HY1, ISG.HY2, and ISG.HY3. Remarkably, the analysis revealed that all hypoxia-related genes (such as VEGFA, ENO1 and MIF) were exclusively grouped in gene group 1 (ISG.HY.1). Other ISG.RS genes allocated to this group include might be co-regulated or might interact in the hypoxic response. For example, HLA-G is implicated in immune tolerance and MCL1 regulates apoptosis, both relevant to cell adaptation under low-oxygen conditions. Genes associated with the interferon response and immune regulation were further stratified into group 2 and 3. When assessing the overall ISG.HY signature using ssGSEA, no significant association with patient survival was observed. However, as shown in Figure 5B, the ISG.HY.1 subnetwork emerged as a significant signature (HR=1.88, p-value <0.001), whereas the other groups did not exhibit prognostic significance. This finding underscores the importance of subnetwork-level resolution in uncovering biologically and clinically meaningful signals that may be masked at the whole gene set level.
Figure 5.

Dissection of the ISG.HY gene signature using CGPA’s Exploratory Graph Analysis (EGA) function. (A) The CGPA-EGA tool is applied to stratify the ISG.HY composite gene signature—consisting of 38 genes from the IFN-stimulated gene signature and 15 genes associated with hypoxia—into three functionally distinct subnetworks: ISG.HY1, ISG.HY2, and ISG.HY3, as demonstrated in the TCGA HNSC dataset. (B) The forest plot contrasts the prognostic value of each subnetwork, revealing ISG.HY1 as a significant prognostic indicator with a hazard ratio of 1.88 (p-value <0.001). This analysis underscores the capability of CGPA’s network and subnetwork analysis in understanding of complex gene interactions and refining prognostic gene signatures for improved cancer prognosis.
Prognostic gene discovery in immunotherapy studies
CGPA includes a dedicated module for cancer immunotherapy datasets. We curated 33 published gene expression datasets comprising 2,854 tumor samples from patients treated with immune checkpoint inhibitors (ICIs), each with corresponding survival outcomes. This module integrates the full suite of CGPA functions. Users can perform single-gene discovery to evaluate the prognostic significance of individual genes, conduct multivariable survival analysis to evaluate the combined impact of multiple genes, identify gene pairs whose combined expression levels are predictive of survival, and interrogate gene networks and subnetworks to uncover key regulatory modules and pathways associated with ICI response.
DISCUSSION
We introduced CGPA as an innovative online tool and interactive analysis portal aimed at addressing critical challenges in cancer genomics research. In comparison to existing tools such as GEPIA, GEPIA221, and PRECOG23, CGPA offers a more comprehensive suite of advanced features, including UMOP gene identification, gene-pair analysis, and gene-set analysis. These capabilities are essential for uncovering novel prognostic biomarkers and understanding complex molecular mechanisms in cancer progression. A key contribution of CGPA is its flexibility in customizing survival models, enabling researchers to conduct Kaplan-Meier analysis with different cutoffs and Cox regression analysis with various clinical covariates and cancer hallmark signatures. This flexibility improves the accuracy and reliability of prognostic assessments and supports personalized treatment strategies.
In current basic science literature, univariable Kaplan-Meier curves and log-rank tests remain the dominant methods for demonstrating gene prognostic values. While visually straightforward, this approach can lead to biased results due to the arbitrary selection of gene expression cutoffs. A common misconception is that multivariable models, because they include additional covariates, are less powerful than univariable analyses. Our analysis explicitly identified these missed-opportunity genes, underscoring how significant insights can be overlooked in univariable analysis due to multiple reasons. Firstly, adjusting for confounding factors can reveal hidden associations between gene expression and survival outcomes, which might be diluted or distorted, for example by tumor purity. It is well known that gene expression from bulk tissue can be significantly influenced by tumor purity, which in some cases can itself be a prognostic factor itself. Secondly, multivariable analysis often implicitly handles the interaction effects between genes and covariates, such as CTL levels in our analysis. In the five immune-oncogenic cancer types we studied, it is well recognized that tumors classified as immune hot exhibit better prognosis than those classified as immune cold. However, survival differences highlighted by well-known immune genes like CD8A do not necessarily pinpoint the most crucial genes for biomarker discovery. This scenario is analogous to differential expression (DE) analysis when comparing tumor and normal tissues, where the most significantly differentially expressed genes are often keratins, yet these genes may not be the primary focus of interest. In our study across the five cancer types, the UMOP genes we discovered cover a diverse array of roles, including immune-related genes (HLA-C, HLA-H, HLA-B, and HLA-F), immune response genes (C1QB, C1QC, IL15, IFNB1, and SERPINA1), and immune modulators (TAP1 and GBP1). This diversity underscores the importance of multivariable analysis in uncovering critical genes that are missed by univariable analysis, thereby providing a more accurate and comprehensive understanding of the contribution of a targeted gene in cancer prognosis.
The TIDE8 method employs a similar Cox regression model focusing on the interaction between CTL and specific genes, using z-scores from interaction terms to assess the effect of gene-CTL interactions on T cell dysfunction. Our gene-hallmark interaction analysis tool significantly expands upon TIDE by incorporating a wider array of clinical and molecular hallmarks, including top mutations, copy number alterations, androgen receptors, hypoxia, and various gene expression-based signatures. Essentially, TIDE could be considered a specialized instance within our broader model framework when focusing solely on the CTL hallmark. Our holistic gene-hallmark analysis is crucial for understanding the complex biology of cancer and developing targeted therapies. By analyzing gene-cancer hallmark interactions, our method not only identifies potential therapeutic targets but also aids in understanding how certain genes may influence or be influenced by these hallmarks. Moreover, even in the absence of significant interaction effects, multivariable analysis adjusting for cancer hallmarks can help identify prognostic genes that are most complementary to existing cancer hallmark signatures. To account for confounding from tumor and cellular composition, CGPA incorporated pre-calculated tumor purity estimates and CTL score (an overall immune activity score). While CGPA integrates pre-calculated tumor purity and CTL scores, we intentionally did not hard-code immune cell composition estimates because different deconvolution methods yield inconsistent results. This gives users the flexibility to apply their own preferred approaches.
CGPA’s gene-pair analysis module offers valuable insights into the interplay between genes and their impact on patient survival. By integrating Kaplan-Meier plots and bivariable Cox regression analyses, CGPA enables researchers to identify synergistic or antagonistic gene interactions, shedding light on complex mechanisms underlying cancer progression. This functionality is particularly significant in uncovering synthetic lethality interactions, which may inform the development of targeted therapies and improve treatment outcomes for cancer patients. Additionally, CGPA’s gene-set analysis feature allows for the comprehensive evaluation of gene networks and pathways involved in cancer biology. Through ssGSEA and machine learning-based methods, CGPA facilitates the identification of gene subsets with significant prognostic value, enhancing our understanding of the molecular mechanisms driving cancer progression and guiding the development of novel therapeutic interventions. Overall, gene-pair analysis is most beneficial when investigating specific hypotheses about interactions between two genes, particularly in contexts such as synthetic lethality or functional compensation. Beyond these contexts, bivariable analysis offers the advantage of controlling for mutual background effects, enhancing interpretability for modestly correlated genes (e.g., GZMA and PRF1 in the CYT score), while also serving as a sensitivity analysis to reveal combinatory effects. In contrast, multi-gene (gene-set) analysis is more appropriate when the goal is to assess the aggregate prognostic contribution of a pathway or a gene-set signature, especially in cases where individual gene effects are modest but collectively informative. By supporting both approaches, CGPA enables users to select the most biologically appropriate analysis strategy for conducting targeted analyses at both the gene and pathway levels.
Finally, CGPA provides two additional important features that set it apart from existing databases: (1) It provides a designated tab “ProgSplicing” to further explore the prognostic pan-cancer landscape of alternative splicing events, helping users to investigate genes with adverse prognostic effects across cancer types; (2) CGPA offers a dedicated portal for exploring prognostic gene modules using meticulously curated immunotherapy datasets. Altogether, the CGPA acts as a streamlined proactive and interactive gene-centric platform, greatly simplifying the task of prognostic biomarker research in oncology for clinicians and basic scientists. It is the first of its kind to provide multi-context insights from the cancer gene prognosis atlas, thereby bridging a critical gap in translational cancer research.
Supplementary Material
Implications:
CGPA is a streamlined, interactive platform for multi-context gene-centric prognostic analysis, simplifying biomarker discovery and validation for molecular and basic cancer scientists, and bridging a critical gap in translational cancer research.
Acknowledgments
This work was supported in part by the State of Florida Bankhead-Coley Cancer Research Program, infrastructure research grant 23B16, and Biostatistics and Bioinformatics Shared Resources at the H. Lee Moffitt Cancer Center & Research Institute, an NCI-designed Comprehensive Cancer Center (P30-CA076292).
Footnotes
Conflict of interest: The authors declare no potential conflicts of interest.
Authors’ Disclosures
No disclosures were reported.
REFERENCES
- 1.Hoadley KA, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173, 291–304. e296 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sanchez-Vega F, et al. Oncogenic signaling pathways in the cancer genome atlas. Cell 173, 321–337. e310 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cerami E, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery 2, 401–404 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Tang Z, et al. GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses. Nucleic acids research 45, W98–W102 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhang Y, et al. OncoSplicing: an updated database for clinically relevant alternative splicing in 33 human cancers. Nucleic Acids Research 50, D1340–D1347 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ridgeway G Generalized Boosted Models: A guide to the gbm package. Update 1, 2007 (2007). [Google Scholar]
- 7.Csardi G & Nepusz T The igraph software package for complex network research. InterJournal, complex systems 1695, 1–9 (2006). [Google Scholar]
- 8.Jiang P, et al. Signatures of T cell dysfunction and exclusion predict cancer immunotherapy response. Nature medicine 24, 1550–1558 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hothorn T & Lausen B Maximally selected rank statistics in R. R News 2, 3–5 (2002). [Google Scholar]
- 10.Hothorn T & Lausen B On the exact distribution of maximally selected rank statistics. Computational Statistics & Data Analysis 43, 121–137 (2003). [Google Scholar]
- 11.Cao B & Kim J R Package AdjKM. Cif and Shiny Application for Creating the Covariate-Adjusted Kaplan-Meier and Cumulative Incidence Functions. Cif and Shiny Application for Creating the Covariate-Adjusted Kaplan-Meier and Cumulative Incidence Functions. [Google Scholar]
- 12.Hanahan D & Weinberg RA Hallmarks of cancer: the next generation. cell 144, 646–674 (2011). [DOI] [PubMed] [Google Scholar]
- 13.Hanahan D Hallmarks of cancer: new dimensions. Cancer discovery 12, 31–46 (2022). [DOI] [PubMed] [Google Scholar]
- 14.Rooney MS, Shukla SA, Wu CJ, Getz G & Hacohen N Molecular and genetic properties of tumors associated with local immune cytolytic activity. Cell 160, 48–61 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Jain MD, et al. Tumor interferon signaling and suppressive myeloid cells are associated with CAR T-cell failure in large B-cell lymphoma. Blood, The Journal of the American Society of Hematology 137, 2621–2633 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hänzelmann S, Castelo R & Guinney J GSVA: gene set variation analysis for microarray and RNA-seq data. BMC bioinformatics 14, 1–15 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Dinstag G, et al. Clinically oriented prediction of patient response to targeted and immunotherapies from the tumor transcriptome. Med 4, 15–30. e18 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wang X, Elston RC & Zhu X The meaning of interaction. Human heredity 70, 269–277 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Yu X, et al. Tumor expression quantitative trait methylation screening reveals distinct CpG panels for deconvolving cancer immune signatures. Cancer Research 82, 1724–1735 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Cao B, et al. A subnetwork-based framework for prioritizing and evaluating prognostic gene modules from cancer transcriptome data. iScience 26(2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tang Z, Kang B, Li C, Chen T & Zhang Z GEPIA2: an enhanced web server for large-scale expression profiling and interactive analysis. Nucleic acids research 47, W556–W560 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ozhan A, Tombaz M & Konu O SmulTCan: A Shiny application for multivariable survival analysis of TCGA data with gene sets. Computers in Biology and Medicine 137, 104793 (2021). [DOI] [PubMed] [Google Scholar]
- 23.Gentles AJ, et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nature medicine 21, 938–945 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
CGPA is publicly available as a web-based interactive tool at https://cgpa.moffitt.org/. This platform provides researchers with easy access to a comprehensive suite of tools for prognostic gene discovery and analysis in immunotherapy studies. The source code and associated R packages used in the development of CGPA have been deposited in a GitHub repository: https://github.com/wanglab1/CGPA. We also provide a Docker image, available on Figshare: https://figshare.com/articles/dataset/CGPA/30048187?file=57645685, which allows users to download and deploy the entire application locally. Additionally, all data included in the CGPA web tool, including both PanCancer TCGA datasets and immunotherapy datasets, is available for download at https://github.com/wanglab1/CGPA/tree/main/PanCancer_data and https://github.com/wanglab1/CGPA/tree/main/IOtherapy_data. The TCGA CDR outcome data (recommended for survival analysis) were downloaded from the “TCGA-CDR-SupplementalTableS1.xlsx” file available on the PanCanAtlas website: https://gdc.cancer.gov/about-data/publications/pancanatlas. The alternative splicing datasets generated by SpliceSeq and SpIAdder were acquired from oncosplicing.com, with supplementary data from all cancer types (not available on the website) provided by the OncoSplicing author team. The ICI datasets preprocessed and analyzed by this study are available from the Gene Expression Omnibus (GEO, RRID:SCR_005012) repository under the accession numbers: GSE111636, GSE248167, GSE173839, GSE194040, GSE241876, GSE179730, GSE162137, GSE165252, GSE159067, GSE195832, GSE67501, GSE135222, GSE221733, GSE126044, GSE100797, GSE115821, GSE131521, GSE78220, GSE91061, GSE96619, GSE202687. We also included pre-analyzed gene expression data from large ICI projects, including IMvigor210 (based on the R package “IMvigor210CoreBiologies”), harmonized KIRC datasets from the supplementary files of publication PMID: 32472114, Javeline101 from the supplementary files of publication PMID: 32895571, harmonized SKCM datasets (available at https://github.com/ParkerICI/MORRISON-1-public), and Kallisto data from https://github.com/hammerlab/multi-omic-urothelial-anti-pdl1/tree/master.
