Abstract
Identification of cancer patient subgroups using high throughput genomic data is of critical importance to clinicians and scientists because it can offer opportunities for more personalized treatment and overlapping treatments of cancers. In spite of tremendous efforts, this problem still remains challenging because of low reproducibility and instability of identified cancer subgroups and molecular features. In order to address this challenge, we developed InGRiD (Integrative Genomics Robust iDentification of cancer subgroups), a statistical approach that integrates information from biological pathway databases with high-throughput genomic data to improve the robustness for identification and interpretation of molecularly-defined subgroups of cancer patients. We applied InGRiD to the gene expression data of high-grade serous ovarian cancer from The Cancer Genome Atlas and the Australian Ovarian Cancer Study. The results indicate clear benefits of the pathway-level approaches over the gene-level approaches. In addition, using the proposed InGRiD framework, we also investigate and address the issue of gene sharing among pathways, which often occurs in practice, to further facilitate biological interpretation of key molecular features associated with cancer progression. The R package ‘INGRID’ implementing the proposed approach is currently available in our research group GitHub webpage (https://dongjunchung.github.io/INGRID/).
Keywords: Clustering, variable selection, biological pathway, integrative analysis, cancer genomics, gene set
1. Introduction
During the last decade, The Cancer Genome Atlas (TCGA) Consortium, supported by the U.S. National Cancer Institute (NCI) and the U.S. National Human Genome Research Institute (NHGRI), contributed to significant advances in cancer genomics research to improve our understanding of the molecular basis of cancer subgroups and their associations with prognosis 1–6. TCGA employs an integrative approach based on multiple genomic platforms, including somatic mutation, copy number alteration (CNA), DNA methylation, and gene expression, to identify shared molecular alterations between subsets of tumors. While this integrative approach provides unprecedented opportunities to investigate a large number of cancer patients from diverse molecular perspectives, the high dimensionality, complexity, and heterogeneity of the data pose challenges for the effective and robust analysis of these datasets. For example, each genomic platform profiles thousands to several ten thousands of genes and generates various data types, including continuous (gene expression and DNA methylation), categorical (CNA), and binary (somatic mutation) measurements. Moreover, driver molecular features are often specific to only certain cancer subgroups and data types 7–11.
The development of iCluster+ for the analysis of such cancer genomics datasets overcame many of these challenges by simultaneously identifying cancer subgroups and important genes from multiple genomic platforms within a unified framework 7 In spite of this, the following challenges still remain to be solved. First, the identification of key genes does not directly promote understanding of biological networks and additional downstream analyses are needed to characterize these genes in the context of biological networks. Second, gene-level identification has often been reported to not be well reproduced between different studies 12 Third, genes with weak effect sizes might not be identified in gene-level analyses 8. These limitations suggest a need for a statistical framework that improves the robustness in biological findings and the interpretation of molecularly-defined subgroups of cancer patients.
One approach to overcome these challenges is to investigate molecular characteristics of cancers at the pathway level, a strategy that has been reported to be more robust and reproducible across studies 12, 13. Summarizing molecular measurements at the pathway level potentially can also lead to improved statistical power because the dimension of molecular data is vastly reduced and moderate signals from multiple genes can be aggregated8, 14, 15. The developers of PARADIGM 9 made important progress in this direction by summarizing molecular activities in various genomic platforms as pathway-level activation scores. However, in the PARADIGM framework, advantages of pathway-level analyses were still not fully investigated in the context of cancer subgroup identification 16. The pathway index model tried to utilize the pathway information for the purpose of cancer subgroup identification 17 Specifically, this approach characterizes the prognostic risk of a pathway in two steps. In the first step (gene selection), genes are selected separately for each pathway by LASSO Cox regression and determined as either “cancer susceptible” or “cancer resistant” based on the signs of their LASSO coefficients. In the second step (pathway selection), a pathway index is calculated by comparing the mean expressions of cancer susceptible genes and cancer resistant genes. The number of pathways with positive indices can then be used to classify patients into low, moderate and high risk groups. However, the pathway index model can only tell if a pathway is “in” or “out” and it is not capable of determining the joint effects of pathways or the relative importance of each selected pathway.
In addition to these limitations in pathway-level approaches for cancer subtype identification, the pathway information itself further poses its own challenges. Specifically, despite various attempts to investigate molecular features at the pathway level, the issue of gene overlap between pathways is often ignored. For example, in the widely-used KEGG pathway database 18, 42% of genes in non-metabolic pathways are members of multiple pathways. Such pathway overlap generates inter-correlation among pathways, which can lead to reduced stability in pathway selection and parameter estimation. Moreover, if common genes are strongly associated with an outcome of interest, multiple pathways containing these genes can be artificially identified as important pathways. In this case, the results obtained from pathway-level analyses are difficult to interpret and it is not clear which pathway is truly involved in disease genesis and progression. One approach to deal with this issue is to divide pathways into unique gene sets and their intersections 19 but this simple approach cannot be applied to more than two pathways. Another approach is to focus on unique genes by down-weighting genes belonging to multiple pathways 20. However, this approach is biologically questionable because common genes might represent upstream regulators or important cross talks among different biological processes.
In order to address these challenges, we propose InGRiD (Integrative Genomics Robust iDentification of cancer subgroups), a statistical approach that improves the identification of molecularly-defined subgroups of cancer patients in the following aspects. First, we address the pathway overlap issue by redefining the gene set membership of common genes. This approach creates unique membership for each gene, thereby improving the stability and interpretability of the pathway analysis results. Second, we simultaneously represent the gene expression profiles of pathways and select key genes from each pathway by constructing pathway-level latent components using sparse partial least squares (SPLS) Cox regression. Finally, InGRiD allows simultaneous inference in multiple biological layers (pathways and genes) within a unified statistical framework without any additional laborious downstream analysis.
2. Data Description
We illustrate the usefulness of the proposed InGRiD approach using a cohort of high-grade serous ovarian cancer (HGSOC) patients from The Cancer Genome Atlas (TCGA) project 1. This cohort consists of 485 patients from whom survival times from diagnosis and gene expression measurements were obtained. The details on collecting and processing TCGA ovarian cancer data can be found elsewhere 1. None of the TCGA patients had grade 1 disease. The Australian Ovarian Cancer Study (AOCS) 21 was used as an independent dataset to evaluate reproducibility of the TCGA findings and includes gene expression measurements and survival times from 238 patients with HGSOC. In AOCS, we excluded patients with borderline tumors, endometrioid or other histology, missing vital status and grade 1 disease.
The TCGA data was downloaded from the cBio Portal (http://www.cbioportal.org/) using the R package cgdsr and we used z-scores for the mRNA expression data. The AOCS data was downloaded from the GEO database (Accession number: GSE9891) and we applied RMA (using the Bioconductor package affy), which provides background correction, normalization, and summarization for gene expression measurements 22 All the mRNA expression measures were centered and scaled to have unit variance according to standard practice. KEGG pathway annotations 18 were downloaded from the MSigDB database (http://software.broadinstitute.org/gsea/msigdb). In this paper, we considered only the genes that are profiled in the KEGG database, belong to the previously reported 15 core signaling pathways in cancer 23, and which are included in both the TCGA and AOCS datasets, which leaves a total of 1045 genes. These 15 core signaling pathways include known driver genes that appear in most cancer types and are proven regulators of core cellular processes including cell fate, cell survival, and genome maintenance 24. The involvement of these pathways in ovarian cancer is well documented 25–27.
3. Methods
InGRiD provides pathway-guided identification of patient subgroups based on gene expression measurements from patient tumors while utilizing patient survival information as the outcome variable. Let (Ti, δi) be the survival information (survival time and censoring indicator) of patient i, for i = 1, …, n, where Ti is observed if δi = 1 and Ti is censored if δi = 0. Denote xi = (xi1,…, xip)′ as the corresponding gene expression vector of patient i, where n < p. Let m represent the number of pathways to which the p genes belong. We consider the following proportional hazard model:
where ϕir relates the expression profile of pathway r to the survival risk of patient i. The term ϕir is given by
| (eq1) |
where U is the p × p diagonal matrix of known gene-level weights, Wr is the p × Kr matrix summarizing the expression data of pathway r as Kr latent variables, and βr is the corresponding coefficient vector for the Kr latent variables. By further defining the combined gene-level summarization matrix W = (W1,⋯, Wm) and the combined pathway-level coefficient vector β′ = (β1′,⋯,βm′)′, this model can be further expressed as
| (eq2) |
The m pathways in (eq2) are not mutually exclusive because each of the p genes can belong to more than one pathway. We use (eq2) as our base model to which further modifications can be made. We describe approaches to deal with the pathway overlapping issue in Section 3.1 while we estimate parameters in W and β sequentially, as described in Sections 3.3 and 3.4, respectively. We refer to our proposed approach as InGRiD hereafter and illustrate the entire workflow in Figure 1.
Figure 1.
Workflow of the proposed InGRiD approach.
3.1. Accounting for Overlapping Pathways
In this subsection, we consider two different approaches for dealing with the pathway overlapping issue, namely the gene-cluster approach and the down-weight approach. Consider the p unique genes that have been mapped to the m pathways based on a pathway annotation database such as KEGG. We denote the pathway membership of the jth gene by the vector , where zjr = 1{jth gene belongs to rth pathway} for r = 1,⋯,m. Finally, we denote the set of genes for the rth gene set as Gr.
The gene-cluster approach assigns the p genes into m+ gene sets (m+ > m) according to the following rules: First, the jth gene remains as a member of Gr if it is a core member of the rth pathway. The jth gene is defined as a “core member” of the rth pathway if zjr = 1 and zjr′ = 0 for all r′ ≠ r, i.e., it belongs to only the rth pathway.
Second, if a pathway has no core members, it will be dropped from further consideration. Let m′ be the number of pathways with at least one core member and it follows that m’ ≤ m. Third, if the jth gene is not a core member of any among the m pathways (i.e., it maps to more than one pathway), then this gene is re-assigned to one of the gene sets based on the Partitioning Around Medoids (PAM) algorithm28. Let be the binary distance between genes j and j′ located in cluster c. The PAM algorithm groups genes into k clusters by minimizing the within-cluster distance where C(.) indicates the cluster membership of gene j. Hence, by re-clustering the p genes into the m+ gene sets, the gene-cluster approach creates a unique membership for each gene.
Unlike the k-means approach which minimizes within-cluster distances with respect to cluster means, the PAM approach assigns cluster memberships based on the closeness of genes to a representative gene of each cluster. In order to determine the optimal number of clusters k, we utilize the silhouette width (sw), which measures the quality of the assignment of an individual element to its assigned cluster. A gene is considered to be well clustered if sw ≈ 1, whereas a gene is considered to be assigned to a wrong cluster if sw < 0. The optimal choice of k depends on both the genomic data at hand and the pathway annotations considered. In our data analysis, we chose k=4 by comparing the mean sw across k=1,…,10. We implemented PAM based on the build and swap algorithms using the R package cluster.
As an alternative to the gene-cluster approach, we also consider down-weighting overlapping genes. Let fj be the number of occurrences of gene j across all the pathways, . The down-weighting approach 20 determines the weight of gene j as
| (eq3) |
Note that the weight uj is a monotonically decreasing function of fj. Specifically, uj = 1 if a gene appears in all the m pathways whereas uj = 2 if it belongs to only a single pathway.
Using the notations of (eq1) and (eq2), the approaches described above can be summarized as follows. In the gene-cluster approach, the size of matrix W in (eq2) now becomes (which corresponds to in our real data analysis) instead of while U in (eq2) is set to be an identity matrix. In the down-weighting approach, the size of matrix W is as in (eq2) and the jth diagonal element of U is set to uj given in (eq3). As a baseline, we also consider a version of InGRiD without accounting for the pathway overlap issue. Here, U is again set to be an identity matrix and the size of matrix W still remains as . In the remainder of this paper, we refer to InGRiD based on the gene-cluster approach, the down-weighting approach, and without accounting for the pathway overlap issue as gcInGRiD, dwInGRiD and naïve InGRiD, respectively.
3.2. Gene Pre-filtering
To improve the signal-to-noise ratio and estimation stability, we first eliminate the most unlikely gene predictors by conducting a supervised pre-filtering using a Cox model applied to each gene separately and including only the genes with p-value < 0.5 for further analysis. The performance of InGRiD is not sensitive to the choice of p-value cutoff point as long as we are not overly aggressive in eliminating genes in the pre-filtering step (e.g., p-value<0.2). Specifically, different p-value cutoff points led to similar results in terms of molecular feature selection (Supplemental Tables 1 and 2) and survival-guided patient subgroup identification (Supplemental Figure 1).
3.3. Summarizing Gene Set Expression Profiles using a SPLS Cox Regression
In order to effectively construct the summarization matrices Wr in (eq1) by taking into account correlation among the genes in each gene set, we utilize a SPLS Cox regression model 29 The SPLS Cox model calculates deviance residuals from a Cox model without covariates and fits a SPLS regression model using deviance residuals as responses. SPLS regression was developed on the basis of partial least squares (PLS), which seeks direction vectors that do not only capture a large proportion of predictor variance but also explains a large proportion of association with responses. On the other hand, SPLS has an important distinction compared to PLS. Specifically, while PLS assumes that all the genes in Gr are associated with the survival outcome (i.e., all the elements in direction vectors are not zeros), SPLS sequentially constructs each direction vector only based on “important” genes by shrinking some of its elements to zero. Hence, SPLS provides simultaneous gene selection and dimension reduction. SPLS has been reported to have better predictive performance than PLS, especially in the context of genomic studies 30, 31.
We estimate each Wr in (eq1) by fitting a SPLS Cox regression model of patient survival on gene expression measurements separately for each gene set Gr. As a result, if gene j is not in Gr, all the elements in the jth row of Wr are zeros. Otherwise, the rows of Wr corresponding to the genes belonging to Gr are estimated using the SPLS Cox regression. This step significantly simplifies the pathway-level analysis described in detail in the next subsection because the length of in (eq2) is usually much smaller than the length of the original gene expression data . Note that in gcInGRiD, there is no overlap between gene sets and as a result, there are at most Kr nonzero elements for the row corresponding to a gene belonging to Gr in the estimated combined summarization matrix in (eq2). However, this is not the case for dwInGRiD and naïve InGRiD because they allow overlap between gene sets. We utilized the R package plsRcox as an implementation of the SPLS Cox regression.
In our data analysis, we set Kr = 1 for all r and determine only the optimal shrinkage parameter of SPLS for each gene set Gr using 10-fold cross validation. Fixing Kr = 1 significantly reduces the complexity of cross validation. In addition, fixing Kr = 1 makes each Wr in (eq1) a length p column vector instead of a p × Kr matrix. In this case, (eq2) can be further simplified as
Put another way, . Hence, InGRiD essentially determines the contribution of each gene in Gr to the patient- and pathway-specific hazards ϕir based on its gene-level contribution to the latent variables summarizing gene set expression data (wrj1) adjusted by its corresponding weight (uj) and its corresponding pathway-level effect (βr1).
Alternatively, both Kr and the shrinkage parameter can be determined using 10-fold cross validation. Supplemental Table 3 tabulates the top genes in each selected gene set by gclnGRiD with Kr not fixed and Supplemental Figure 2 shows the identified patient subgroups by gclnGRiD with Kr not fixed. These results indicate that fixing Kr = 1 has little impact on the molecular feature selection (Table 1) and patient subgroup identification of gclnGRiD (Figure 2).
Table 1. Top gene sets and genes selected by gcInGRiD based on the gene-cluster approach.
Gene sets are ranked based on their gene-set-level LASSO coefficient estimates. Gene sets consisting of common genes across pathways are written in bold face. The number within parenthesis in the column ‘Genes Selected’ refers to the total number of genes in each gene set.
| Gene Sets Selected | Genes Selected |
Gene Set Coefficient |
Top Three Genes | ||
|---|---|---|---|---|---|
| HEDGEHOG_SIGNALING | 12(20) | 0.167 | CSNK1G3 | GAS1 | CSNK1D |
| MAPK&APOPTOSIS | 37(68) | 0.093 | PPP3CA | RPS6KA2 | TGFBR1 |
| NUCLEOTIDE_EXCISION_REPAIR | 2(19) | 0.072 | GTF2H4 | DDB2 | |
| CELL_ADHESION_MOLECULES_CAMS | 59(103) | 0.065 | CD6 | CTLA4 | CLDN6 |
| NOTCH_SIGNALING | 1(36) | 0.062 | NOTCH4 | ||
| PHOSPHATIDYLINOSITOL_SIGNALING | 30(58) | 0.06 | PLCG1 | PIP4K2B | IMPA2 |
| CELL_CYCLE | 2(84) | 0.059 | MCM3 | ANAPC11 | |
| MISMATCH_REPAIR | 1(7) | 0.045 | SSBP1 | ||
| JAK_STAT_SIGNALING | 67(115) | 0.039 | SOCS5 | IL21R | IFNA21 |
| WNT_SIGNALING | 39(67) | 0.016 | APC | CSNK2A2 | FZD1 |
| BASE_EXCISION_REPAIR | 10(20) | 0.011 | MUTYH | UNG | APEX1 |
| MAPK_SIGNALING | 94(189) | 0.01 | PLA2G2D | PDGFRA | FGF7 |
| DNA-REPAIR | 3(22) | 0.005 | POLD2 | FEN1 | POLD3 |
| WNT&HEDGEHOG | 0(38) | 0 | |||
| TGF_BETA&CELL_CYCLE&WNT | 0(34) | 0 | |||
| NON_HOMOLOGOUS_END_JOINING | 0(10) | 0 | |||
| MTOR_SIGNALING | 0(29) | 0 | |||
| APOPTOSIS | 0(39) | 0 | |||
| TGF_BETA_SIGNALING | 0(43) | 0 | |||
Figure 2. Kaplan-Meier curves for patient subgroups identified by the pathway-level analysis (gcInGRiD) and the gene-level analysis (gene index count model) using the TCGA high-grade serous ovarian cancer data.
Patient subgroups are color-coded according to survival probabilities: red (high-risk of mortality), green (intermediate-risk of mortality) and black (low-risk of mortality). Values in parentheses in the legend represent the number of activated risk pathways associated with each subgroup.
3.4. Pathway-level Analysis using a LASSO-Penalized Cox Regression
Given the estimated summarization matrix , we now identify a parsimonious set of pathways associated with patient survival. Specifically, we fit a LASSO-penalized Cox regression 32 on latent variables derived from all the gene sets by minimizing the objective function with respect to β, where l(β) is the partial likelihood function and λ is a tuning parameter controlling the amount of shrinkage, which is again determined using 10-fold cross validation. We utilized the R package glmnet as an implementation of the LASSO-penalized Cox regression.
3.5. Patient Subgroup Identification
We identify patient subgroups by clustering patients based on the degree of activation of their “risk pathways”. First, we calculate the patient- and pathway-specific hazards defined in (eq1). Then, we consider the gene set Gr as a “risk pathway” for patient . Finally, a patient is categorized as “low-risk group” if is less than the first quartile, as “intermediate-risk group” is between the first and third quartiles, and as “high-risk group” if is higher than the third quartile.
4. Results
We first summarize the pathways and genes identified by InGRiD in Section 4.1 while patient subgroups identified by InGRiD are presented in Section 4.2 based on the TCGA dataset 1. The predictive performance of InGRiD is evaluated using the TCGA dataset 1 as internal validation (Section 4.3) and the AOCS data 21 as external validation (Section 4.4). In Section 4.5, we compare the pathway-level analysis with the gene-level analysis in the sense of robustness in the identification of pathways/genes and patient subgroups. In addition to InGRiD, we also consider the gene index count model that performs LASSO-penalized Cox regression at the gene level after pre-filtering and then categorizes prognostic risks of each individual patient based on the count of genes associated with increased hazards using a procedure similar to those described in Section 3.5.
4.1. Pathway and Gene Selection
We summarize pathway and gene selection results of gclnGRiD in Table 1, which shows the selected gene sets, the numbers of total and selected genes in each gene set, the gene-set-level LASSO coefficient estimates, and the top three genes in each gene set. Gene sets are ranked based on their LASSO coefficient estimates. The gene-cluster approach of gclnGRiD identified four additional gene sets, namely “MAPK&APOPTOSIS”, “DNA-REPAIR”, “WNT&HEDGEHOG”, and “TGF_BETA&CELL_CYCLE&WNT” (the list of genes belonging to each of these additional gene sets is provided in Supplemental Table 4). gclnGRiD identified 357 unique genes from 13 gene sets using the TCGA data, where these 13 gene sets include eleven KEGG pathways comprising only core genes and two gene sets generated from common genes shared across KEGG pathways (“MAPK&APOPTOSIS” and “DNA-REPAIR”).
First, the “MAPK&APOPTOSIS” cluster predominantly consists of genes from the MAPK (56 genes) and the APOPTOSIS pathways (41 genes), while other contributors to this cluster include HOSPHATIDYLINOSITOL_SIGNALING (11 genes), CELL_CYCLE (6 genes), MTOR_SIGNALING (16 genes), WNT_SIGNALING (23 genes), HEDGEHOG_SIGNALIN G (3 genes), T GF_BETA_SIGN ALIN G (5 genes), and JAK_STAT_SIGNALING (17 genes). Among the 37 cancer genes selected from “MAPK&APOPTOSIS”, 32 genes are from the MAPK pathway, 20 are from the APOPTOSIS pathway, 11 are from the WNT_SIGNALING pathway, and 8 are from the MTOR_SIGNALING pathway. Therefore, “MAPK&APOPTOSIS” mainly represents cross talk between MAPK and other cancer pathways.
Second, most genes in the “DNA-REPAIR” cluster are from pathways involved in DNA repair including BASE_EXCISION_REPAIR (12 genes), NUCLEOTIDE_ EXCISION_REPAIR (20 genes) and MISMATCH_REPAIR (14 genes). A few genes in the “DNA-REPAIR” cluster are from the NON_HOMOLOGOUS_END_JOINING (2 genes) and CELL_CYCLE (3 genes) pathways. The three genes selected from the “DNA-REPAIR” cluster are contributed by the BASE_EXCISION_REPAIR (3 genes), MISMATCH_REPAIR (2 genes), NUCLEOTIDE_EXCISION_REPAIR (2 genes), and NON_HOMOLOGOUS_END_JOINING (1 gene) pathways.
Two other clusters of overlapping genes are not selected by gcInGRiD. The gene set “WNT&HEDGEHOG” mostly consists of the genes from the WNT_SIGNALING (31 genes) and the HEDGEHOG_SIGNALING (28 genes) pathways, with a few genes from the PHOSPHATIDYLINOSITOL_SIGNALING_SYSTEM (4 genes), CELL_CYCLE (1 gene), NOTCH_SIGNALING (6 genes) and TGF_BETA_SIGN ALIN G (7 genes) pathways. Genes comprising the “TGF_BETA&CELL_CYCLE&WNT” cluster are mostly from the TGF_BETA_SIGNALING (28 genes), CELL_CYCLE (24 genes) and WNT_SIGNALING (19 genes) pathways.
Naïve InGRiD, which has no special considerations for overlapping pathways, selected twelve cancer pathways (Table 2), which are similar to those selected by gcInGRiD. However, not accounting for pathway overlapping created significant difficulties in interpreting the findings from naïve InGRiD. First, it is not clear whether a pathway is selected by naïve InGRiD because of its independent association with survival or due to the genes that belong to other pathways that are actually associated with survival. In contrast, gcInGRiD facilitates a better understanding on the biological processes underlying cancer by separating out the independent effect of a pathway from the cross talk among pathways. Second, it is also hard to interpret top-ranking genes selected by naïve InGRiD because sometimes these top-ranking genes appear in multiple selected pathways and coefficient estimates for these genes are different across different pathways. In contrast, gcInGRiD does not have this issue because each gene can appear only in one gene set.
Table 2. Top pathways and genes selected by naïve InGRiD which makes no considerations for the pathway overlap issue.
Pathways are ranked based on their pathway-level LASSO coefficient estimates. The number within parenthesis in the column ‘Genes Selected’ refers to the total number of genes in each gene set.
| Pathways Selected | Genes Selected |
Pathway Coefficient |
Top Three Genes | ||
|---|---|---|---|---|---|
| HEDGEHOG_SIGNALING | 29(56) | 0.117 | CSNK1G3 | GAS1 | BTRC |
| MTOR_SIGNALING | 1(52) | 0.089 | RPS6KA2 | ||
| NOTCH_SIGNALING | 1(47) | 0.08 | NOTCH4 | ||
| NUCLEOTIDE_EXCISION_REPAIR | 4(44) | 0.064 | POLD2 | GTF2H4 | DDB2 |
| CELL_CYCLE | 2(128) | 0.058 | MCM3 | ANAPC11 | |
| CELL_ADHESION_MOLECULES_CAMS | 59(134) | 0.057 | CD6 | CTLA4 | CLDN6 |
| JAK_STAT_SIGNALING | 80(155) | 0.052 | SOCS5 | IL21R | IFNA21 |
| PHOSPHATIDYLINOSITOL_SIGNALING | 34(76) | 0.051 | PLCG1 | PIP4K2B | IMPA2 |
| MAPK_SIGNALING | 127(267) | 0.04 | PLA2G2D | PPP3CA | RPS6KA2 |
| MISMATCH_REPAIR | 3(23) | 0.026 | POLD2 | SSBP1 | POLD3 |
| APOPTOSIS | 1(88) | 0.014 | PPP3CA | ||
| WNT_SIGNALING | 75(151) | 0.001 | PPP3CA | APC | CSNK2A2 |
| BASE_EXCISION_REPAIR | 0(35) | 0 | |||
| NON_HOMOLOGOUS_END_JOINING | 0(14) | 0 | |||
| TGF_BETA_SIGNALING | 0(86) | 0 | |||
Interestingly, dwInGRiD selected the same pathways and genes as the naïve approach (Table 3). This indicates that although down-weighting common genes reduces the similarities among pathways, it still does not fully resolve the overlap among pathways. Due to the increased interpretability of gclnGRiD compared to dwInGRiD and naïve InGRiD, we focus on gcInGRiD in the following subsections.
Table 3. Top pathways and genes selected by dwInGRiD based on the down-weighting approach.
Pathways are ranked based on their pathway-level LASSO coefficient estimates. The number within parenthesis in the column ‘Genes Selected’ refers to the total number of genes in each gene set.
| Pathways Selected | Genes Selected |
Pathway Coefficient |
Top Three Genes | ||
|---|---|---|---|---|---|
| HEDGEHOG_SIGNALING | 29(56) | 0.059 | CSNK1G3 | GAS1 | BTRC |
| MTOR_SIGNALING | 1(52) | 0.047 | RPS6KA2 | ||
| NOTCH_SIGNALING | 1(47) | 0.038 | NOTCH4 | ||
| NUCLEOTIDE_EXCISION_REPAIR | 4(44) | 0.035 | GTF2H4 | DDB2 | POLD2 |
| PHOSPHATIDYLINOSITOL_SIGNALING | 34(76) | 0.028 | PLCG1 | PIP4K2B | IMPA2 |
| CELL_CYCLE | 2(128) | 0.028 | MCM3 | ANAPC11 | |
| CELL_ADHESION_MOLECULES_CAMS | 59(134) | 0.028 | CD6 | CTLA4 | CLDN6 |
| JAK_STAT_SIGNALING | 80(155) | 0.027 | SOCS5 | IL21R | IFNA21 |
| MAPK_SIGNALING | 127(267) | 0.017 | PLA2G2D | RPS6KA2 | PDGFRA |
| MISMATCH_REPAIR | 3(23) | 0.016 | SSBP1 | POLD2 | POLD3 |
| APOPTOSIS | 1(88) | 0.007 | PPP3CA | ||
| WNT_SIGNALING | 75(151) | 0.004 | APC | PPP3CA | CSNK2A2 |
| BASE_EXCISION_REPAIR | 0(35) | 0 | |||
| NON_HOMOLOGOUS_END_JOINING | 0(14) | 0 | |||
| TGF_BETA_SIGNALING | 0(86) | 0 | |||
4.2. Patient Subgroup Identification
After identifying a parsimonious set of cancer-associated pathways, gcInGRiD categorizes patients by counting “activated risk pathways” or each patient. Figure 2 shows the observed survival curves for patient subgroups identified by gcInGRiD and the gene index count approach. In gcInGRiD, the low-risk group had 4 or fewer activated risk pathways, whereas the high-risk group has a minimum of 9 activated risk pathways. These cut points corresponded to the 25 and 75 percentiles of the risk pathway counts. The log rank test revealed highly significant differences between groups: high versus intermediate risk groups (p = 3e-5) and intermediate versus low risk groups (p = 5e-5). The median survival times for the high, intermediate and low risk groups were 31.6 months (95% CI: 27.6 – 38.4), 41.5 months (95% CI: 38.0 – 47.7) and 70.8 months (95% CI: 51.9 – 90.1), respectively. dwInGRiD and naïve InGRiD performed similarly as gcInGRiD (data not shown). The gene index count model is also capable of identifying patient subgroups, where the median survival times of the high, intermediate and low risk groups were 32.0 months (95% CI: 29.0 – 35.4), 44.9 months (95% CI: 38.3 – 51.3) and 64.0 months (95% CI: 52.0 – 76.9), respectively.
4.3. Internal Validation by Partitioning the TCGA Dataset
To further compare the pathway-level (gcInGRiD) and the gene-level (gene index count model) approaches, we conducted internal validation by randomly partitioning the TCGA data into training (n=243) and test data (n=242) 100 times. At each iteration, we fit the InGRiD and the gene index count model using the training data and evaluate their performances in predicting patient subgroups using the test data. Figure 3 shows that gcInGRiD has better predictive performance compared to the gene index count model. Specifically, the time-dependent AUC at 5 years is 0.59 (SD=0.06) and 0.49 (SD=0.06) for gcInGRiD and the gene index model, respectively (p<0.001, t-test). Comparisons at other time points produces similar results. gcInGRiD is also more likely to separate patient subgroups based on their predicted prognostic risks. The –log10 transformed p-values of a global F-test were 1.65 (SD=1.08) and 0.41 (SD=0.44) for the pathway-level and the gene-level analyses, respectively (p<0.001, t-test). The patient subgroups identified by gcInGRiD are also more distinct from each other in terms of restricted mean death time, which is the time a patient is expected to live during the 60 months of follow-up. When the “subgroup distance” is defined as the difference in restricted mean times between the high and the low risk groups, the subgroup distance is 6.0 (SD=4.1) months and −0.4 (SD=3.8) months for the pathway-level and the gene-level analyses, respectively (p<0.001, t-test).
Figure 3. Comparison of the predictive performances between the pathway-level analysis (gcInGRiD) and the gene-level analysis (gene index count model) using the TCGA data.
The predictive performances were compared based on the time-dependent area under the curve (AUC) at 60 months (left), the –log10 transformed p-value from global F-test (middle), and the subgroup distance (right). Subgroup distance is defined as the distance in the restricted mean event times between the high and low risk groups.
4.4. External Validation in an Independent Dataset
The performance of the pathway-level (gcInGRiD) and the gene-level (gene index count model) approaches were further evaluated using the independent AOCS data. Specifically, for gcInGRiD, the patient- and pathway-specific hazards were calculated for new observations in the AOCS data using the InGRiD parameters learned from the TCGA data. For the gene index count model, the cancer-related genes identified using the TCGA data were used to predict the prognostic risks of patients in the AOCS data. Figure 4 shows the Kaplan-Meier curves of predicted patient subgroups using the AOCS data. The gcInGRiD performed best at identifying patient subgroups (p=0.00061, global F-test) compared to dwInGRiD (p=0.0415, global F-test), naïve InGRiD (p=0.0478, global F-test), and the gene-index count model (p=0.00841, global F-test).
Figure 4. Kaplan-Meier curves of patient subgroups predicted by the pathway-level analysis (gcInGRiD) and the gene-level analysis (gene index count model) using the AOCS high-grade serous ovarian cancer data.
Patient subgroups are color-coded according to survival probabilities: red (high-risk of mortality), green (intermediate-risk of mortality) and black (low-risk of mortality). Values in parentheses in the legend represent the number of pathways associated with each subgroup.
4.5. Stability Analysis
Finally, we compared the stability in identifying cancer-associated molecular features and patient subgroups between the pathway-level (gcInGRiD) and the gene-level (gene index count model) approaches using the TCGA data (Figure 5). To test the stability in feature selection, we randomly sampled 50% (n=242) of the TCGA patients 100 times and evaluated the percentage of the molecular features that can still be identified using these subsamples. The pathway-level analysis showed significantly higher reproducibility in feature selection than the gene-level analysis (P<0.001, t-test). Specifically, among the gene sets selected by gcInGRiD (Table 1), 64% (SD=15%) of them could be reproduced. In contrast, only 15% (SD=14%) of the top-ranking genes selected by the gene index count model could be reproduced. Note that the gene index count model failed to select any genes 30% of the time, which was not the case for the pathway-level analysis.
Figure 5.
Comparison of the stability of the pathway-level analysis (gcInGRiD) and the gene-level analysis (gene index count model) in identifying molecular features and patient subgroups.
To test the stability in subgroup identification, we randomly sampled 80% of the genes from the TCGA data 100 times and examined the percentage of patients that can still be assigned to the same subgroups identified using the complete TCGA data. The pathway-level analysis was also significantly more reproducible in identifying patient subgroups (P<0.001, t-test), where 78% (SD=4%) and 71% (SD=8%) of patients were assigned to the same risk groups in gcInGRiD and the gene index count model, respectively.
5. Discussion
In this paper, we present InGRiD, a statistical framework for the simultaneous identification of cancer patient subgroups and key molecular features. InGRiD focuses on improving the robustness of biological findings and their interpretations by utilizing the pathway information and taking into account the pathway overlap issue. The application of InGRiD to the TCGA HGSOC data showed that InGRiD can effectively classify patients according to differences in survival probabilities and this risk group classification could be further reproduced in an external dataset (AOCS data). In addition, InGRiD with the gene-cluster approach (gcInGRiD) did not only identify key cancer pathways (e.g., MAPK and HEDGEHOG_SIGNALLING), but could also distinguish the interaction among several pathways from the independent effect of an individual pathway. Since tumors can develop resistance to therapies targeting a single pathway, the development of novel statistical approaches for identifying the interactions among pathways also provides insight into the development of anti-tumor agents that target multiple pathways involved in cancer. We further demonstrated the superior performance of the pathway-level analysis in selecting key molecular features and identifying patient subgroups compared to the gene-level analysis based on internal and external validations. Moreover, the pathway-level analysis proved to be more robust than the gene-level analysis in the sense that 1) the pathway-level analysis is more likely to identify similar features across datasets consisting of different patients; and 2) the pathway-level analysis is also more likely to assign a patient to the same subgroup upon potential perturbation of molecular features.
We are currently also working on improving InGRiD further by pursuing the following directions. First, in this paper, we considered only the gene expression data. While it is often reported that gene expression data are sufficiently informative for predicting patient risk, we also plan to utilize and integrate other genomic data types, such as copy number alterations, DNA methylation, and somatic mutations. This can potentially improve statistical power in patient risk prediction and also facilitate understanding of biological mechanisms associated with cancer progression. Second, the patient subgroups identified by InGRiD are currently based on differences in prognostic risks. In addition, we are also working on the classification of patients based on the weighted combination of selected pathways. Third, in this paper, we conservatively focused on the KEGG pathway annotation because the KEGG database is based on human curation and has been reported to be of high quality and stability 18. We plan to consider other pathway databases such as Reactome 33, WikiPathways 34, and BioPortal Pathway Ontology 35. This can exploit larger gene sets in the prediction model and also improve the quality of gene annotations. We expect that InGRiD will be a powerful approach to investigate cancer subtypes according to their molecular features and irrespective of the tumor anatomical site by identifying shared pathogeneses, which will offer opportunities for overlapping treatments across various cancer subtypes.
Supplementary Material
Acknowledgments
Funding
DC, LEK, GH, AL and ZS were partially supported by the NIH/NCI grant (R21 CA209848). DC, GH and AL were also partially supported by the NIH/NIGMS grant (R01 GM122078).
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
References
- 1.Cancer Genome Atlas Network. Integrated genomic analyses of ovarian carcinoma. Nature. 2011; 474: 609–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012; 487: 330–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cancer Genome Atlas Research Network. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014; 513: 202–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hoadley KA, Yau C, Wolf DM, et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014; 158: 929–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kandoth C, Schultz N, Cherniack AD, et al. Integrated genomic characterization of endometrial carcinoma. Nature. 2013; 497: 67–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490: 61–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mo Q, Wang S, Seshan VE, et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci U S A. 2013; 110: 4245–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tyekucheva S, Marchionni L, Karchin R and Parmigiani G. Integrating diverse genomic data using gene sets. Genome Biol. 2011; 12: R105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Vaske CJ, Benz SC, Sanborn JZ, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010; 26: i237–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Shen R, Olshen AB and Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009; 25: 2906–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Shen R, Wang S and Mo Q. Sparse integrative clustering of multiple omics data sets. Ann ApplStat. 2013; 7: 269–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Glaab E. Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification. Brief Bioinform. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kim S, Kon M and DeLisi C. Pathway-based classification of cancer subtypes. Biol Direct. 2012; 7: 21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chang YH, Chen CM, Chen HY and Yang PC. Pathway-based gene signatures predicting clinical outcome of lung adenocarcinoma. Sci Rep. 2015; 5: 10979. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Allison DB, Cui X, Page GP and Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006; 7: 55–65. [DOI] [PubMed] [Google Scholar]
- 16.Hofree M, Shen JP, Carter H, Gross A and Ideker T. Network-based stratification of tumor mutations. Nat Methods. 2013; 10: 1108–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Eng KH, Wang S, Bradley WH, Rader JS and Kendziorski C. Pathway index models for construction of patient-specific risk profiles. StatMed. 2013; 32: 1524–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kanehisa M and Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28: 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jiang Z and Gentleman R. Extensions to gene set enrichment. Bioinformatics. 2007; 23: 306–13. [DOI] [PubMed] [Google Scholar]
- 20.Tarca AL, Draghici S, Bhatti G and Romero R. Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics. 2012; 13: 136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tothill RW, Tinker AV, George J, et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008; 14: 5198–208. [DOI] [PubMed] [Google Scholar]
- 22.Bolstad BM, Irizarry RA, Astrand M and Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003; 19: 185–93. [DOI] [PubMed] [Google Scholar]
- 23.Jones S, Zhang X, Parsons DW, et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science. 2008; 321: 1801–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA Jr. and Kinzler KW Cancer genome landscapes. Science. 2013; 339: 1546–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Winterhoff BJ, Maile M, Mitra AK, et al. Single cell sequencing reveals heterogeneity within ovarian cancer epithelium and cancer associated stromal cells. Gynecol Oncol. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Scarbrough PM, Weber RP, Iversen ES, et al. A cross-cancer genetic association analysis of the DNA repair and DNA damage signaling pathways for lung, ovary, prostate, breast, and colorectal cancer. Cancer Epidemiol Biomarkers Prev. 2016; 25: 193–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Verhaak RG, Tamayo P, Yang JY, et al. Prognostically relevant gene signatures of high-grade serous ovarian carcinoma. J of Clin Invest. 2013; 123: 517–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Reynolds AP, Richards G, de la Iglesia B and Rayward-Smith VJ. Clustering rules: a comparison of partitioning and hierarchical clustering algorithms. J Math Model Algorithm. 2006; 5: 475–504. [Google Scholar]
- 29.Bastien P, Bertrand F, Meyer N and Maumy-Bertrand M. Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data. Bioinformatics. 2015; 31: 397–404. [DOI] [PubMed] [Google Scholar]
- 30.Chun H and Keles S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. JR Stat Soc Series B Stat Methodol. 2010; 72: 3–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Chung D and Keles S. Sparse partial least squares classification for high dimensional data. Stat Appl Genet Mol Biol. 2010; 9: Article17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997; 16: 385–95. [DOI] [PubMed] [Google Scholar]
- 33.Croft D, O’Kelly G, Wu G, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011; 39: D691–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Kutmon M, Riutta A, Nunes N, et al. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Res. 2016; 44: D488–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Whetzel PL, Noy NF, Shah NH, et al. BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res. 2011; 39: W541–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





