Skip to main content
Cell Reports Methods logoLink to Cell Reports Methods
. 2024 May 17;4(6):100781. doi: 10.1016/j.crmeth.2024.100781

Subtype-WGME enables whole-genome-wide multi-omics cancer subtyping

Hai Yang 1, Liang Zhao 1, Dongdong Li 1, Congcong An 1, Xiaoyang Fang 2, Yiwen Chen 3, Jingping Liu 1, Ting Xiao 1, Zhe Wang 1,4,
PMCID: PMC11228280  PMID: 38761803

Summary

We present an innovative strategy for integrating whole-genome-wide multi-omics data, which facilitates adaptive amalgamation by leveraging hidden layer features derived from high-dimensional omics data through a multi-task encoder. Empirical evaluations on eight benchmark cancer datasets substantiated that our proposed framework outstripped the comparative algorithms in cancer subtyping, delivering superior subtyping outcomes. Building upon these subtyping results, we establish a robust pipeline for identifying whole-genome-wide biomarkers, unearthing 195 significant biomarkers. Furthermore, we conduct an exhaustive analysis to assess the importance of each omic and non-coding region features at the whole-genome-wide level during cancer subtyping. Our investigation shows that both omics and non-coding region features substantially impact cancer development and survival prognosis. This study emphasizes the potential and practical implications of integrating genome-wide data in cancer research, demonstrating the potency of comprehensive genomic characterization. Additionally, our findings offer insightful perspectives for multi-omics analysis employing deep learning methodologies.

Keywords: deep learning methods, molecular subtyping, whole-genome, multi-omics data integration, cancer biomarkers

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Subtype-WGME utilizes whole-genome omics data for precise cancer subtyping

  • Subtype-WGME integrates the MLP-Mixer to handle high-dimensional multi-omics data

  • Subtype-WGME explores the role of various omics and non-coding regions

  • Based on Subtype-WGME, we develop a robust biomarker discovery pipeline

Motivation

Many investigations have aimed to harness multi-omics data to elucidate the molecular subtypes and pathogenesis of cancer. Several cancer subtyping models based on exome sequencing have emerged that strive to integrate diverse omics information. However, protein-coding genes represent approximately 2% of the entire human genome, and growing evidence highlights the pivotal role of non-coding regions in the context of complex diseases. We therefore propose an innovative multi-omics cancer subtyping deep network to overcome the limitations of traditional models that predominantly rely on exome data. This deep network is specifically designed to comprehensively characterize the full spectrum of genomic data at the whole-genome level, facilitating a more integrated understanding of cancer biology.


Yang et al. propose a cancer subtyping method using whole-genome multi-omics data, which extracts comprehensive features that capture the molecular subtypes of cancer. They also quantify the subtyping contribution of different omics and identify prognostic-related biomarkers. This approach expands the scope of cancer subtyping studies to the genome-wide level.

Introduction

Cancer, a spectrum of complex genomic diseases, constitutes a formidable menace to human life and health.1 It is frequently characterized by gene mutations and concurrent molecular perturbations at the cellular level,2 such as alterations in gene and microRNA (miRNA) expression, as well as copy number variations. In the contemporary landscape of precision medicine, the application of targeted genomic therapies is indispensable for effective cancer treatment. However, a pronounced heterogeneity pervades both among patients with tumors and across different tumor types, and these variations can profoundly influence the clinical trajectories of patients. Cancer subtyping research endeavors to categorize cancers exhibiting similar phenotypes into distinct molecular subtypes, predicated on the molecular characteristics of tumor cells.3 These subtypes display identical biological properties and respond similarly to therapeutic interventions. The accurate delineation of molecular subtypes is of paramount importance in cancer diagnosis, prognosis, and the selection of appropriate treatments.4 Nevertheless, cancer subtyping investigations continue to present substantial challenges owing to the diversity, complexity, and specificity inherent in cancer genomics.

Cancer subtyping methods predicated on single-omics data neglect the shared attributes and disparities among patients’ multi-omics molecular profiles, thereby constraining the accuracy of the obtained results. In contrast, integrated subtyping approaches that incorporate multiple omics datasets capitalize on the complementary information derived from diverse molecular datasets to delineate cancer patients.5 In recent years, the rapid advancements in next-generation high-throughput sequencing technologies have propelled the field of cancer molecular subtyping. International cancer research endeavors, such as the International Cancer Genome Consortium (ICGC)6 and The Cancer Genome Atlas (TCGA),7 have played a pivotal role in advancing this field by providing extensive omics data and clinical information across various cancer types. These large-scale collaborative efforts have fostered the growth of cancer molecular subtyping research. Furthermore, recent studies have emphasized the importance of non-coding region data, which harbors valuable biological insights.8,9,10 Notably, The Pan-Cancer Analysis of Whole Genomes (PCAWG) project aggregated whole-genome sequencing data from 2,658 cancer cases spanning 38 tumor types. It constituted a collaborative endeavor between the ICGC and TCGA. The availability of genome-wide multi-omics data has expanded opportunities for cancer subtyping investigations.

Several compelling subtyping methodologies have emerged, which can be classified into three distinct groups (Figure 1) based on the timing of integrating multi-omics data11: early integration (Early-Integration), medium integration (Med-Integration) and late integration (Late-Integration). Early-Integration commonly involves the elementary concatenation and amalgamation of multi-omics data, fusing the data into a unified input prior to modeling, followed by the application of clustering techniques such as K-means12 for classifying the integrated data. For instance, LRAcluster13 connects the heterogeneous multi-omics data for each sample by probabilistically modeling the distribution of numerical, count, and discrete features. However, Early-Integration tends to increase data dimensionality and overlooks the variability of different omics distributions.

Figure 1.

Figure 1

Three strategies for integrating multi-omics data in cancer subtyping

Med-Integration approaches involve the internal fusion of multiple omics data types, aiming to establish a shared subspace across different omics data. MCCA14 and nonnegative matrix factorization (NMF)15 employ dimensionality reduction algorithms to condense multi-omics data into a shared low-dimensional subspace, to maximize the correlation between features, and subsequently to perform clustering. Based on deep neural networks, Subtype-GAN16 utilizes adversarial training to independently reduce the dimensionality of each omics datum and subsequently concatenate the low-dimensional representations for clustering. Subtype-WESLR17 integrates clustering knowledge from disparate methods using a weighted ensemble strategy, preserving the local structure of the original sample feature space. It ensures consistency with the weighted ensemble while mapping sample features from each omics to a common latent subspace. DLSF18 integrates multi-omics data by learning a coherent sample manifold space through a deep cycle autoencoder (CAE) framework with self-expression layers. These Med-Integration methods facilitate the exploration of shared patterns among different omics data types, enhancing the robustness and interpretability of cancer subtyping.

Late-Integration methods involve the independent clustering of different omics data types, followed by merging the results to generate a unified outcome. PINS,19 for instance, employs perturbation clustering to compress each omics data type and constructs a connectivity matrix for fusion. SNF20 computes and merges samples from each omic level based on similarity and conducts cluster analysis. NEMO21 introduces a neighborhood multi-omics clustering algorithm based on a similarity network, constructing a similarity matrix for each omic datum and computing the average similarity matrix across all data types. SUMO22 is a unique method that leverages NMF to cluster different omics data types. It addresses the challenge of missing omics data by employing multiple quality and stability metrics to determine consensus clustering labels. By integrating these metrics, SUMO provides an effective solution for accurately assigning cluster labels to samples, even in the absence of omics data. These Late-Integration approaches enable the integration of independent clustering results from different omics data, facilitating a comprehensive understanding of the complex interplay between various molecular attributes and their impact on cancer subtyping.

In current cancer subtyping research, the focus has primarily been on utilizing coding region data. In contrast, the potential role of non-coding region multi-omics data in cancer molecular subtyping remains largely unexplored. The analysis of whole-genome data poses challenges due to limited sample sizes and the high dimensionality of the data, making existing subtyping methods less suitable for such complex inputs. In this study, we aim to address these challenges and investigate the molecular subtyping of cancer using whole-genome multi-omics data, with a specific focus on understanding the contributions of non-coding data to cancer subtyping. To tackle the intricacies of whole-genome multi-omics data, we have developed an innovative deep learning model termed subtyping with the whole genome multi-omics encoder (Subtype-WGME) (STAR Methods; Figure 2). Subtype-WGME combines the MLP-Mixer23 network and the adversarial variational autoencoder24 structure to perform unsupervised dimensionality reduction of whole-genome multi-omics data. It learns a low-dimensional latent space that is consistent across different data modalities, enabling the exploration and interpretation of subtypes. Specifically, Subtype-WGME adeptly integrates high-dimensional multi-omics data using Early-Integration and Med-Integration strategies. It leverages the MLP-Mixer network, known for its capacity to handle complex multi-omics data representations in the latent space. We assembled a large sample size by collecting data from eight cancer types from the PCAWG dataset, creating eight benchmark datasets. The results demonstrate that Subtype-WGME surpasses the current state-of-the-art methods in cancer subtyping tasks. Furthermore, based on the subtyping outcomes, we developed a biomarker discovery pipeline leveraging the random forest algorithm. This pipeline allowed us to assess the importance of genome-wide data in the subtyping task and to identify biomarkers derived from whole-genome data. Our findings highlight the clinical relevance of non-coding region data and provide valuable insights for further research in the field of cancer subtyping.

Figure 2.

Figure 2

Summary of Subtype-WGME

(A) Subtype-WGME is an unsupervised learning framework. The encoder comprises an MLP-Mixer and a multilayer perceptron (MLP). After training, clustering is performed using a Gaussian mixture model in the hidden space.

(B) MLP-mixer model structure.

Results

Performance on eight benchmark PCAWG cancer datasets

In our study, we conducted analyses on eight tumor datasets obtained from the PCAWG. The eight cancers are ovarian cancer (OV), pancreatic endocrine neoplasms (PAEN), renal cell cancer (RECA), chronic lymphocytic leukemia (CLLE), esophageal adenocarcinoma (ESAD), malignant lymphoma (MALY), pancreatic cancer (PACA), and breast cancer (BRCA), respectively, and their corresponding project codes are inside the parentheses. These datasets included four types of omics data: RNA sequencing data (RNA), single nucleotide mutation (Mut), copy number alteration (CNA), and miRNA expression (miRNA). Gene expression level obtained from RNA sequencing data reveals variations in gene transcript levels across samples, providing valuable insights for tumor subtype identification. We will refer to RNA sequencing data as gene expression throughout the following sections. Mutation data help identify genetic mutations associated with tumor development and enable the exploration of genetic differences between tumor subtypes. CNA data assist in revealing distinct patterns of gene copy number variations in tumors. Additionally, miRNA data allow for the identification of miRNA expression patterns associated with tumor subtypes, shedding light on the regulatory role of miRNA in tumorigenesis and tumor progression. We compared the performance of Subtype-WGME, our proposed method utilizing whole-genome multi-omics data, with four commonly used methods that rely on coding region data in multi-omics subtyping analyses. The comparative algorithms employed in the study consisted of Med-Integration methods such as Subtype-GAN and MCCA, as well as Late-Integration methods like NEMO and SNF. To ensure comparability across methodologies, we established an identical number of subtypes for each cancer type based on prior research, the detailed introduction of which can be found in STAR Methods. For further details on the dataset processing procedures, please refer to the STAR Methods and Tables S1 and S2.

Firstly, we assessed the clustering results of different methods using −log10 p value (Figure 3A) and the number of cancer datasets with significant subtyping results (p value < 0.05) (Figure 3B). We conducted survival analysis on the subtyping results using the log rank test, an extension of the Cox proportional hazards model tailored for comparing the survival curves among multiple groups. The test statistic follows a chi-squared distribution, and the p value is computed accordingly. Subtype-WGME exhibited superior performance compared to all benchmark methods, achieving a higher median −log10 p value of 2.367 and an average of 7.06. It consistently demonstrated significant survival analysis results across all datasets. Specifically, Subtype-WGME yielded significant results for CLLE (3.01E−03), ESAD (6.24E−12), MALY (4.07E−02), OV (6.30E−11), PACA (4.55E−02), PAEN (2.27E−02), RECA (1.10E−02), and BRCA (2.19E−19). Subtype-GAN ranked second among all compared algorithms, with a median −log10 p value of 1.11 and an average of 3.49, and it displayed significant differences in survival analysis on half of the cancer datasets. NEMO ranked third, with a median −log10 p value of 0.811 and an average of 3.78, achieving significance in three datasets. SNF and MCCA followed. Overall, Subtype-WGME outperformed other methods regarding survival analysis performance. Additionally, we recorded the runtime of all methods. Due to the generally small sample sizes in the datasets, deep learning methods were slower than traditional models, consistent with previous findings in coding differentiation tasks.25 Subtype-WGME demonstrated superior performance with an average total running time of 25.36 s across eight cancer datasets, outperforming our previously proposed deep subtyping model, Subtype-GAN. When considering only the inference time, the average time further decreases to 2.16 s for detailed performance in survival analysis and runtime (Tables S3 and S4).

Figure 3.

Figure 3

Performance results on the benchmark datasets

(A) The comparative boxplots of survival analysis for the five methods. The x axis represents the −log10 p value, and the y axis represents the different methods. In this plot, WGME means Subtype-WGME, and GAN denotes Subtype-GAN. p value was determined via the log-rank test.

(B) The dataset situation concerning significant subtyping results was procured by different methods. The y axis is the same as in (A). The gray color indicates that the dataset did not achieve significant survival analysis results, while the colored regions signify significant analysis results.

(C) Performance comparison radar plots between Subtype-WGME and two ablation analyses. Log10 scaling is applied uniformly to p values for better visualization.

(D) Performance comparison radar plots between using a single omic and using all four omics. Log10 scaling is applied uniformly to the p value for better visualization.

To assess the efficacy of Subtype-WGME in handling genome-wide high-dimensional data, we conducted two comparative analyses, where specific analysis data can be found in Table S5. In the first analysis, we eliminated the early fusion pipeline of Subtype-WGME, involving the encoding and decoding process using MLP-Mixer. Instead, we individually applied encoding and decoding to each single omic using a fully connected network. The results revealed that the median survival analysis of the subtyping outcomes was 1.43 when using only this single-stream network across eight datasets. This analysis marked a significant decrease compared to the median survival analysis of Subtype-WGME’s subtyping results, which was 2.367, underscoring the substantial contribution of the dual-stream network, incorporating both early and intermediate fusion, in enhancing subtyping performance. Illustrated in Figure 3C, relative to Subtype-WGME, comparison analysis 1 demonstrated notable variations in the survival analysis of subtyping outcomes for only four datasets, ESAD, MALY, OV, and BRCA, while exhibiting poorer results for the remaining seven datasets. In the second comparison analysis, we omitted the reshaping step for the concatenated multi-omics and directly employed a simple fully connected network for dimensionality reduction. This was done as a substitution for the MLP-Mixer network structure, allowing us to assess its significance within the model. The results indicate that the median survival analysis of the subtyping outcomes using the fully connected network on the eight datasets is 2.035, showcasing a noticeable performance decline. This underscores the rational and effective application of the MLP-Mixer network structure in our model. In particular, this analysis yielded notable distinctions in the survival analysis of subtyping outcomes for five datasets—ESAD, OV, PACA, PAEN, and BRCA—showcasing superior results on the remaining seven datasets, excluding PAEN, where Subtype-WGME exhibited less favorable subtyping outcomes.

In our final exploration of data subtyping from various omics, we conducted analyses using Subtype-WGME for each single omic and exploratory analyses using only coding and non-coding region data (Figure 3D; Table S5). The results suggest that optimal performance in cancer subtyping is achieved using RNA data alone, followed by Mut data and CNA data. Nevertheless, when employing all omics data for subtyping, the performance may not consistently surpass that of using single-omics data, as multi-omics data introduce increased noise. Nonetheless, Subtype-WGME adeptly harnesses the complementary information across multiple groups, leading to more robust and stable results. Specifically, the median survival analyses of the subtyping results for RNA, CNA, and Mut omics on the eight datasets were 1.725, 0.724, and 1.229, respectively. For the CLLE and MALY datasets, optimal subtyping results are attained when exclusively utilizing mutation data. In the case of the RECA dataset, the most accurate subtyping outcomes are achieved with the use of only RNA data. Conversely, for the PAEN dataset, the subtyping results demonstrate superior performance when exclusively relying on CNA data. Moreover, when considering only coding region data for subtyping, the result was 1.717, while subtyping with non-coding region data alone yielded a result of 0.839. Although the use of multi-omics coding region data significantly outperformed non-coding region data, there was an apparent decrease compared to using whole-genome data, underscoring the feasibility and necessity of whole-genome subtyping.

Construction of the genome-wide cancer biomarker discovery pipeline

Expanding upon the insights derived from Subtype-WGME, we conducted further interpretative research on the subtype results using the random forest (RF) algorithm as described in STAR Methods (Figure S2). Moreover, we investigated the utilization of XGBoost and LightGBM as alternatives to the RF model (STAR Methods). We developed a pipeline for the automated exploration of whole-genome-wide biomarkers across various cancers. The process begins by splitting each dataset into a training dataset and a valid dataset with a proportion of 6:4. We utilize four omics features from the training set as input for the RF model, with subtype labels obtained from Subtype-WGME as the training output. The RF model is then trained using these data. Subsequently, we rank the omics data features based on the Gini importance score, selecting the top 50 features with the highest scores as an initial set of candidate biomarkers. For each candidate biomarker in the preliminary set, we segregate the samples into high- and low-expression groups based on the median expression level of the biomarker in the validation set. We calculate the relevance of the survival analysis by assessing the difference between these two groups and identifying features with significant results. These selected features constitute the final set of biomarkers. Finally, a comprehensive literature review is conducted on the final biomarkers to validate the credibility and rationale of the study further.

Our proposed pipeline identified a total of 195 biomarkers across all cancer datasets, and the statistics of biomarkers and the biomarkers found in each cancer are presented in Tables S9A and S9B-I. Among them, 157 biomarkers were located in coding regions, while 38 were located in non-coding regions. When categorized based on omics features, 144 biomarkers were associated with gene expression, 3 with copy number variation, and 45 with mutations. Notably, non-coding biomarkers were predominantly associated with long non-coding RNA (lncRNA), such as CTD-2192J16.26, RP11-134G8.8, SNORD116-20, and RP11-68I3.11. It has been observed that lncRNA plays a crucial role in regulating gene expression in coding regions26,27 and is also implicated in cancer subtyping.28 Additionally, we identified antisense genes with significant correlations to patients’ survival outcomes, such as RP5-894A10.2, RP11-380L11.4, MAP3K14-AS1, and others. Notably, RBFOX1 emerged as a biomarker applicable to both PAEN and MALY cancers. Furthermore, we categorized Mut and CNA omics features into more specific regions, including coding DNA sequence (CDS), promoter core region (PromCore), 5′ untranslated region (5′ UTR), 3′ untranslated region (3′ UTR), enhancers, splice site (SS), and non-coding RNA (ncRNA) (STAR Methods). Our analysis revealed that mutations in both coding and non-coding regions can serve as effective biomarkers for cancer subtyping. For instance, mutations in the PromCore and CDS of CNTNAP2, as well as mutations in the 5′ UTR and CDS of RBFOX1, can function as biomarkers for PAEN. Similarly, mutations in the 5′ UTR and CDS region of GRID2 and mutations in the 3′ UTR and CDS of USH2A can serve as biomarkers for MALY.

In a conclusive phase, we meticulously validated the potential biomarkers identified by our pipeline through an extensive literature review, confirming the credibility of 20 among them (Table 1; Table S9J). For esophageal cancer patients, the downregulation of ABI3BP29 was observed to inhibit cancer cell proliferation, activity, migration, and invasion, classifying it as an oncogene pivotal in the progression of growth and metastasis. In the context of MALY, PTPRD,30,31 a recognized tumor suppressor gene, exerted regulatory control over cell growth, with its mutations being proposed as promising diagnostic biomarkers for MALY. In the ovarian cancer domain, potential biomarkers included ZNF217,32 implicated in cancer cell proliferation, invasion, and metastasis, and RBM, an upregulated gene acting as a regulator of variable pre-mRNA splicing, thereby influencing apoptosis through the modulation of apoptotic factors. For PACA patients, the identified biomarkers comprised MMP28,33 contributing significantly to the tumor microenvironment, CYP3A5,34 serving as a predictor of treatment response, TP53BP1,35 which exhibited inhibitory effects on pancreatic tumor growth, and ROBO2,36 a regulator of TGF-β in the pancreas. Additionally, CNTNAP2,37 with recurrent mutations, emerged as a potential prognostic biomarker, while PDE4D38 was identified as an independent prognostic factor, and GPC539 correlated significantly with favorable survival outcomes. Renal cancer patients featured biomarkers such as RPL36A40 and TACC3,41 with associations with immune-related pathways and T cell depletion. Within the realm of BRCA patients, GNPNAT142 surfaced as an upregulated biomarker associated with proliferation and invasiveness, CD2243 exhibited heightened expression, correlating with tumor size, DHCR2444 overexpression was linked to tumor growth promotion, and BUB345 was upregulated and negatively correlated with various immune cell types. Furthermore, low PHF246 expression was identified as significantly associated with poor prognosis, while FAM83H-AS147 and lncRNA-ATB were observed to be overexpressed in sera, demonstrating correlations with tumor metastasis, size, and lymph node metastasis and emphasizing their prognostic rather than diagnostic value.

Table 1.

Biomarkers after literature validation

Biomarker Gini score p value Comment
ABI3BP 1.52E−02 2.50E−03 ESAD, CNA, SS
PTPRD 8.74E−03 2.16E−02 MALY, Mut, 5′ UTR
ZNF217 7.16E−03 1.18E−03 OV, RNA
RBM25 7.07E−03 8.99E−05 OV, RNA
MMP28 5.68E−03 2.74E−02 PACA, RNA
CYP3A5 5.09E−03 2.74E−02 PACA, RNA
TP53BP1 4.81E−03 2.74E−02 PACA, RNA
ROBO2 1.33E−02 4.33E−02 PAEN, Mut, SS
ROBO2 1.10E−02 4.33E−02 PAEN, Mut, CDS
CNTNAP2 5.66E−03 1.73E−02 PAEN, Mut, 5′ UTR
PDE4D 5.16E−03 3.33E−03 PAEN, Mut, SS
GPC5 3.54E−03 1.72E−02 PAEN, Mut, SS
CNTNAP2 4.77E−03 3.72E−02 PAEN, Mut, CDS
RPL36A 1.02E−02 2.51E−02 RECA, RNA
TACC3 8.15E−03 2.67E−02 RECA, RNA
GNPNAT1 1.34E−02 1.09E−02 BRCA, RNA
CD22 7.87E−03 1.49E−02 BRCA, RNA
DHCR24 7.70E−03 3.63E−03 BRCA, RNA
BUB3 6.96E−03 7.82E−03 BRCA, RNA
PHF2 6.96E−03 1.09E−02 BRCA, RNA
FAM83H-AS1 6.56E−03 7.82E−03 BRCA, RNA

Importance analysis of the four omics

To investigate the influence of different omics data on tumor subtyping results at the genome-wide level, we summarized and quantified the Gini importance scores of each omics datum, thereby determining their contributions to the final subtyping outcomes. Due to the absence of miRNA data in the six datasets and their limited contribution to the subtyping of individual cancers (with the highest percentage not exceeding 2% and an overall percentage not exceeding 1%), our study primarily focused on analyzing the contributions of RNA, Mut, and CNA omics data to different cancer subtyping.

Our analysis revealed that genome-wide multi-omics data’s relative contributions in different cancer datasets were correlated with cancer types (Figure 4A). Considering all cancers (Figure 4B), gene expression data accounted for 58% of the total. They played the most significant role in the six datasets, specifically in BRCA (84%), OV (92%), RECA (98%), PACA (78%), MALY (39%), and CLLE (78%). This result suggests that gene expression levels dominate tumor subtyping, consistent with previous findings regarding coding regions.48,49 It’s noteworthy that RNA omics contribute less than 1% in both the ESAD and PAEN datasets. This is primarily due to the prevalence of missing data in the original RNA data. After preprocessing, all available PAEN samples exhibit missing values for RNA data, with only a handful of ESAD samples retaining RNA data. Notably, attempts to subtype the ESAD dataset using RNA omics alone failed to achieve statistical significance. Despite substantial data gaps in the original dataset, our model adeptly extracts complementary information from multi-omics sources for classification. This underscores the compatibility of Subtype-WGME with challenges posed by missing data issues. Furthermore, mutation data contributed 33% in combination in the context of genome-wide cancer subtyping. In comparison, CNAs contributed 8%, indicating the greater importance of mutation data than CNAs (Table S6).

Figure 4.

Figure 4

Analysis of omics importance in cancer subtyping

(A) Importance of RNA, Mut, and CNA in the eight cancers.

(B) The overall contribution of RNA, Mut, and CNA in all eight cancers.

(C) Importance of mutation in the pan-cancer datasets.

(D) Importance of CNA omic in pan-cancer datasets.

Moreover, the contribution of different genomic regions to the subtyping results was analyzed to assess the importance of non-coding region features that were previously overlooked. Specifically, we examined the contribution of mutations and CNAs in the above genomic regions (Table S7). Our findings revealed consistent patterns of contribution across different cancer datasets. Considering all datasets together, the CNAs in the CDS region were crucial, accounting for 36% of the contribution (Figure 4D). In three individual datasets (PACA, ESAD, and PAEN), the CNAs in the CDS region were relatively more important than in other loci. In terms of mutation data, mutations in the CDS region were also pivotal (Figure 4C), contributing to 67% among all the datasets, and in all eight datasets, mutations in the CDS region were relatively more important compared to other loci. Additionally, we observed that variations in the gene promoter and 5′ UTR also played significant roles in subtyping. In mutation data, the contribution of the promoter interval was 14%, ranking second, while the 5′ UTR contributed 10% and ranked third. The subsequent rankings were 3′ UTR (6%) and ncRNA (3%). Similarly, in copy number omics, the promoter interval accounted for 17% of the contribution, ranking second, followed by 5′ UTR (17%) in the third place. The subsequent rankings were enhancers (17%), 3′ UTR (12%), and ncRNA (4%). Promoter activity holds paramount significance in cancer subtyping as it governs the timing and expression levels of genes by interacting with transcription factors, thus dictating gene activity. Promoters have been demonstrated to play pivotal pathological roles in diverse cancers, such as BRCA50 and bladder cancer.51 Additionally, the 5′ UTR plays a pivotal role in modulating gene expression levels, and previous studies indicate that the DNA sequence of the 5′ UTR encompasses numerous cis-regulatory elements that engage with transcription factors. The spectrum of mutations in the 5′ UTR collectively influences multiple mechanisms of gene expression, holding significant functional implications in cancer.52

Case study on OV

We conducted a detailed analysis of the Subtype-WGME subtyping results on ovarian cancer to demonstrate the rationality of the classification. Firstly, we plotted a heatmap of differences between the latent feature representations and subtypes to observe the degree of aggregation between subtypes (Figure 5A). It was evident that there were distinct boundaries between the low-dimensional features of patients from different subtypes, indicating that different subtypes were well-reflected in the latent space. To further visualize the latent features of the three subtypes, we used the PCA algorithm to project the latent vectors from 256 dimensions to 2 dimensions. We envisioned the 2-dimensional features and sample subtype labels (Figure 5B). It could be observed that samples belonging to the same subtype were clustered together. In contrast, samples from different subtypes were widely separated: the visual representation of the latent features provided an intuitive distinction between subtypes.

Figure 5.

Figure 5

Interpretable analysis on OV cancer

(A) Heatmap depicting the differences in cryptic features and subtype labels.

(B) Using the PCA algorithm to visualize the hidden layer features after reducing dimensionality.

(C) Survival curves of the three subtypes of OV cancer, which was plotted by Kaplan-Meier estimate analysis.

(D) Violin plots showing the distribution of selected biomarkers’ features in the three subtypes.

Moreover, we performed Kaplan-Meier survival analysis on the OV cancer dataset to demonstrate significant differences in survival times among samples corresponding to different subtypes (Figure 5C). The three subtypes exhibited distinct survival patterns, with a substantial difference (p value = 9.6E−07). Specifically, Subtype-WGME classified OV tumors into three subtypes, denoted as subtypes C1, C2, and C3, and sample labels are listed in Table S8. Subtype-C1 comprised 59 patients, of which 54 had died, with an average survival time of 804 days; subtype-C2 has 17 patients, with only deaths and an average survival time of 1,731 days; subtype-C3 included nine patients, 3 of whom have died, with an average survival time of 871 days. These results indicate that Subtype-WGME can effectively distinguish OV tumor subtypes without prior knowledge (such as the number of subtypes and patient age). Through the above visualizations, we provided compelling evidence that the latent features extracted by Subtype-WGME contain biologically relevant information correlating with subtypes. The identified subtypes are meaningful and interpretable.

Additionally, we utilized the Gini score to identify the top 10 features in each omics dataset and illustrated the distribution of the three subtypes based on their original expression (Figure 5D). The results showed significant differences in the distribution of these genomic features among the three subtypes, suggesting a correlation between genomic features and subtypes. Moreover, different genomic features exhibited distinct distribution patterns among the three subtypes, complementing each other and collectively contributing to the final subtyping results.

Based on the subtyping results, we applied the proposed biomarker pipeline to the OV dataset (Figure 6A). Initially, we selected a preliminary set of 50 biomarkers based on their Gini importance ranking on the training set (Figure 6B). Subsequently, we divided the samples into two subgroups based on the expression levels of these biomarkers and performed survival analysis to validate their relevance on the validation set. Among the 50 selected biomarkers, 31 exhibited significant differences in survival analysis (Figure 6C), indicating their potential as prognostic indicators. Further analysis of these 31 biomarkers revealed eight potential non-coding biomarkers with clinically relevant survival curves (Figure 6D). To elaborate, three LncRNAs, OSER1-AS1 (1.85E−03), RP11-134G8.8 (9.2E−05), and CTD-2192J16.26 (4.95E−02), as well as four antisense genes, RP11-85F14.5 (6.53E−03), RP11-3D4.2 (2.96E−02), RP11-77H9.2 (4.49E−02), and RP5-894A10.2 (1.28E−02), and one pseudogene, RP11-91I11.1 (2.12E−02), have been identified with significant relevance in the context of ovarian cancer subtyping. These findings underscore the intricate involvement of non-coding elements, such as LncRNAs, antisense genes, and pseudogenes, in the molecular landscape of ovarian cancer, highlighting their potential as crucial biomarkers in discerning distinct subtypes.

Figure 6.

Figure 6

Biomarker analysis on OV cancer

(A) The pipeline for genome-wide biomarker discovery in the OV dataset.

(B) The top important features obtained after training the random forest algorithm. The brackets indicate the occurrence of mutations or copy number variants in the corresponding interval.

(C) The significance of survival differences between high and low expression groups based on feature values. Aqua green color indicates significance, while the orange color indicates insignificance.

(D) The biomarker discovery pipeline identified four non-coding region biomarkers for OV cancer.

Discussion

One of the paramount objectives in cancer genomics research is the integration of multi-omics data for the analysis of cancer subtypes and the identification of cancer biomarkers. With the rapid advancements in high-throughput sequencing technologies, we can now acquire comprehensive genome-wide multi-omics data, including RNA sequencing data, CNAs, and somatic mutations. Integrating these diverse yet complementary datasets allows for a more profound understanding of cancer progression and metastasis mechanisms. This study introduces an innovative whole-genome cancer subtyping method named Subtype-WGME, which is based on genome-wide data. This method combines multi-omics data to extract essential biological information from the vast high-dimensional multi-omics data and to capture higher-level complex associations within low-level features. To the best of our knowledge, Subtype-WGME is the first method that utilizes genome-wide multi-omics data for cancer molecular subtyping. By employing this approach, we have achieved optimal performance on eight cancer datasets from the PCAWG project. Moreover, our method demonstrated significance across all eight datasets. We also successfully identified 90 biomarkers that significantly impact patient survival time. These findings highlight the ability of Subtype-WGME to accurately characterize large-scale genome-wide multi-omics data and its enormous potential in discovering potential cancer targets on a genome-wide scale.

Subtype-WGME introduces an innovative deep learning architecture capable of integrating multiple inputs. In contrast to previous methods, it leverages both Early-Integration and Med-Integration data as inputs, employing different encoders tailored to each input type. This unique design allows the model to harness the strengths of both input modes, leading to improved performance. Particularly for high-dimensional multi-omics data, Subtype-WGME innovatively transforms the data into matrix form and applies MLP-Mixer for feature modeling. This approach reduces complexity and facilitates the modeling of omics information across multiple omics layers. Furthermore, Subtype-WGME incorporates regularization techniques into its output. Employing a discriminator with variational capabilities maps the latent features from diverse omics data, characterized by different distributions, into a unified space. The model robustly extracts multi-scale biological information from genome-wide multi-omics data throughout the training process. The resulting low-dimensional features adhere to a Gaussian distribution, enabling a unified feature representation that dramatically enhances the performance of subsequent clustering tasks.

We propose a biomarker discovery pipeline based on the RF algorithm in response to the subtyping results. This pipeline integrates genome-wide feature analysis and deep learning models to systematically explore features at various levels in the cancer genome, intending to identify potential cancer biomarkers. By considering both coding and non-coding region features, we gain a more comprehensive understanding of the complexity and diversity of cancer, thereby revealing previously undiscovered biological mechanisms. The pipeline enables the high-throughput discovery of potential cancer biomarkers. In addition to coding region features, we have identified a substantial number of cancer biomarkers in the non-coding regions, including antisense genes and lncRNA. Besides, we found that the coding and non-coding region features of CNTNAP2 and RBFOX1 can be used as biomarkers simultaneously. This finding challenges the conventional notion of exclusively focusing on protein-coding regions in biomarker studies. It offers a more profound understanding of the mechanisms and functional implications of gene regulation in cancer development. Unveiling these refined biomarkers constitutes a groundbreaking discovery that has yet to be explored in existing research, promising to propel the field of cancer precision medicine forward. These findings demonstrate the feasibility of employing deep learning models in genome-wide biomarker discovery. Further validation of these biomarkers will contribute to the realization of gene-specific therapies for cancer patients and propel the development of precision medicine, opening up exciting prospects for improved patient outcomes.

Furthermore, we quantitatively assessed the contributions of the four types of omics data involved in the analysis to cancer subtyping. The task of subtype delineation was primarily influenced by RNA sequencing data, followed by single nucleotide mutation data and CNA. However, the contribution of miRNA features was relatively low, which could be due to the large number of missing miRNA features in the data published by PCAWG. Gene expression data provide crucial information about gene activity within cancer cells and play a vital role in unraveling tumors’ molecular characteristics and dysregulations. Even at the genome-wide level, RNA sequencing data maintain their decisive role in the process of cancer subtyping. Although copy number data exhibit relative abundance, they are susceptible to substantial noise. On the other hand, gene mutations possess higher specificity, rendering them more influential than copy number variations. Additionally, a more granular analysis of somatic mutations and copy number variations revealed that changes occurring in the coding regions were paramount in determining the subtyping results. Such coding region variations directly impact gene mutations and structural alterations, influencing gene function and regulatory mechanisms. Simultaneously, the role of promoters in subtyping results proved to be of utmost significance. Promoters constitute crucial sequence regions that regulate gene transcription, controlling the initiation timing and extent of gene expression through interactions with transcription factors. It is important to note that while the significance of non-coding regions received preliminary validation in our study, further study is required to delve deeper into their specific mechanisms and functions.

Limitations of the study

Although Subtype-WGME has achieved significant accomplishments in genome-wide data analysis, it also has certain limitations. Firstly, the presence of missing data can restrict the performance of the subtyping method, even to the extent of impeding its proper functioning. Currently, Subtype-WGME addresses missing data through a simple imputation method without considering the interrelationships among multiple omics data. The presence of missing data can potentially impact the accuracy of the model. This limitation will be addressed by leveraging the rapidly evolving AI-generated content model, which aims to complement missing data at the multi-omics level. Secondly, due to the nature of deep learning models, Subtype-WGME requires adequate training data to achieve optimal cancer subtyping performance. Insufficient training samples may adversely affect the model’s final performance. However, the availability of comprehensive genome-wide multi-omics data currently needs to be improved. Therefore, future study efforts will focus on enhancing the performance of Subtype-WGME through data augmentation techniques that leverage multi-omics integration. These considerations highlight areas for improvement and further exploration to overcome the limitations of Subtype-WGME and to enhance its effectiveness in cancer subtyping.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

Gene expression, somatic mutation, miRNA expression, copy number alteration ICGC Data Portal https://dcc.icgc.org/releases/PCAWG/

Software and algorithms

R v4.2.3 The R Foundation https://www.r-project.org/
Subtype-WGME This paper https://zenodo.org/doi/10.5281/zenodo.11044332
NEMO Rappoport et al.21 https://github.com/Shamir-Lab/NEMO
Subtype-GAN Yang et al.16 https://github.com/haiyang1986/Subtype-GAN
MCCA from PMA package v1.2.1 Lee et al.14 https://cran.r-project.org/web/packages/PMA
SNF v2.3.1 Wang et al.20 https://cran.r-project.org/src/contrib/Archive/SNFtool/

Other

Code to reproduce comparison of various methods This Paper https://github.com/zhaol233/Subtype-WGME/tree/master/other
Capsule website for reproducing Subtype-WGME Code ocean https://codeocean.com/capsule/4686472/tree/

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Zhe Wang (wangzhe@ecust.edu.cn).

Materials availability

This study did not generate new unique reagents.

Data and code availability

  • This paper analyzes existing, publicly available datasets processed and hosted on ICGC data portal https://dcc.icgc.org/releases/PCAWG/. Information is also listed in the key resources table.

  • All original code has been deposited at GitHub and is publicly available as of the date of publication. Idedntifiers are listed in the key resources table.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Method details

Method overview

Subtype-WGME leverages comprehensive genome-wide multi-omics data, including gene expression (RNA), miRNA, copy number (CNA), and mutation data (Mut) obtained from PCAWG cancer samples. To enhance the modeling capacity of sample features in the context of genome-wide multi-omics data, we have developed a multi-task auto-encoder framework with MLP-Mixer serving as the core encoder. This framework facilitates the fusion of reduced-dimensional holistic and single-omics features within a unified hidden Gaussian space, effectively capturing both the global associations and local variations of the samples. Consequently, the downstream tasks exhibit superior performance when dealing with high-dimensional and heterogeneous biological data. The resulting low-dimensional representations serve as feature representations for the sample’s multi-omics data, which are subsequently clustered using a Gaussian mixture model. A graphical representation of the Subtype-WGME model can be found in Figure 2.

Architecture of Subtype-WGME

The Subtype-WGME framework comprises three essential processes: encoding (E), discriminating (D), and decoding (D). In the encoding process, the primary objective is to reduce the dimensionality of input features and try to retain biological information, encompassing a pipeline that integrates multi-omics input(MLP-Mixer) and a pipeline for each omics data (Each-omics). For the integration of multi-omics input, the output is initially reshaped into a two-dimensional matrix, followed by encoding through the innovative MLP-Mixer Module for enhanced adaptability to high-dimensional data. In Each-omics pipeline, different dimensions of features are aligned using a specific fully connected layer, reducing them to the corresponding latent space. The encoding process also encompasses a fusion layer facilitating the merging of the output from both pipelines. This involves concatenating the outputs of the two pipelines, utilizing the MLP block for dimensionality reduction, and ultimately ensuring a consistent distribution of hidden layer features through Batch Normalization. This process aims to facilitate more effective integration of information. The GELU activation function (σ) is also utilized to enhance the fusion layer’s nonlinear representation capability. Let’s denote the j th omics data of the i th sample as Xij. X:j represents the feature of the j th omics, Xi represents the feature of the j th sample, and Xa denotes the overall input obtained after combining multi-omics. Thus, the encoding process can be mathematically expressed as follows:

bn(xi)=xiE[xi]Var[xi]+ε·γ+β
σ=0.5·(1+tanh[2π(x+0.044715x3)])
Z=σ·bn(wf(concat(Ej(Xj)+Ea(Xa))+bf))

Where wf and bf represent the parameters of the fusion layer, Ej and Ea illustrates the encoder for each omics and multi-omics features, Z denotes the extracted hidden layer features, concat denotes stitching together the implicit representation of each omics.

In the discriminating process, we apply variational processing to the hidden layer features to ensure that the encoder’s output follows a specific probability distribution in the low-dimensional confidential space. We select the standard normal distribution N(0,1) as the target distribution since any data distribution can be seen as a combination of multiple normal distributions. The discriminator assesses whether the hidden features conform to the normal distribution by determining if they belong to the distribution or not. Instead of using the Kullback-Leibler (KL) divergence, we employ adversarial training with binary cross-entropy (BCE) as the loss function.

In the decoding process, we reconstruct the hidden layer features to restore the original input. Similar to the encoding process, the decoding stage comprises two pipelines. These pipelines individually reconstruct the hidden layer features into the original omics data and the reshaped two-dimensional matrix resulting from the combined omics. Through this multi-tasking approach, our goal is to enhance the model’s learning capabilities. The overall loss function of the model consists of three components, as described as follows:

MSE=1ni=1mwi(yiyˆi)2
BCE=(1y)log(1yˆ)ylogyˆ
l1=MSE(Xa,Da(Z))
l2=KL(pθ(Z|X)η(0,1))BCE(Desc(Z),0)
l3=j=1nMSE(Dj(Z),Xˆj)
L=λ1l1+λ2l2+λ3l3

Where Di denotes each-omics decoder, Da denotes multi-omics decoder, l1 denotes reconstruction loss of MLP-Mixer pipeline, l2 denotes loss of discriminator, l3 denotes reconstruction loss of each omics pipeline, λ1, λ2, λ3 are hyperparameters, denotes the weight of each loss function.

MLP-Mixer Module

To address the challenges of high-dimensional multi-omics input, where fully connected layers alone may not effectively extract the underlying nonlinear biological association information and involve high computational complexity, we propose an innovative approach to reshape the multi-omics data into two-dimensional matrix, which can be seen as a image, for input representation. To reshape the data into a image with regular size (H,W), we pad the overall data with 1 to achieve the desired size. Subsequently, we partition the image into patches of equal measure, denoted as p, and reshape them into a matrix with dimensions XaRB×N×p2, B represents the batch size. We employ a convolutional network to embed the matrix during the encoding process. MLP-Mixer utilizes the resulting sequence of patches with uniform embeddings as input. Each sample yields a two-dimensional representation:

H=W=Dpp
N=HWp2

Where D represents the sum of the dimensions of all omics, N indicates the number of patches per sample.

The MLP-Mixer architecture is structured with several mixer layers of equal size, each composed of two pivotal components. The token mixer, where token denotes the smaller matrix in the image, strategically operates on the columns of the input tensor Xa. Its primary objective is to facilitate inter-token communication by adeptly fusing spatial information. Notably, the parameters of this process are shared across all columns to ensure model parameter efficiency. Concurrently, the channel mixer operates on the rows of the input tensor Xa, fostering information exchange between different channels, with shared parameters across all rows. Each mixer process comprises two fully connected layers. Following the initial fully connected layer, LayerNorm is applied to normalize features, enhancing model stability. Subsequently, the second fully connected layer undergoes processing, incorporating the nonlinear activation function GELU to introduce the model’s nonlinear transformation capability. This meticulous design permits seamless information flow and diverse nonlinear transformations within MLP-Mixer. Collectively, the architecture of MLP-Mixer is meticulously crafted to efficiently model and learn from input data. The encoding process of the MLP-Mixer can be summarized as follows:

U,i=Z,i+W2σ(W1LayerNorm(X),i),i=1C,
Yj,=Uj,+W4σ(W3LayerNorm(U)j,),j=1N.
Za=mean(Y,i),fori=1C.

Where W1, W2 denotes the parameters of the two fully connected layers of the MLP block in the token mixer process, W3, W4 denotes the parameters of the two fully connected layers of the MLP block in the channel mixer process.

Gaussian mixture model

Subtype-WGME utilizes the extracted feature representations to analyze complex omics data. In this study, the Gaussian mixture model (GMM) is employed for clustering. GMM maximizes the parameters (i.e., variance and mean) of the likelihood function. GMM refers to a linear combination of multiple Gaussian distribution functions. It has the ability to fit various types of distributions and is commonly used for scenarios where data within the same dataset exhibit different distributions or variations in the same distribution. Let X be a random variable, and the GMM can be represented by following equation:

p(x)=k=1KπkN(x|μk,Σk)

Where each component is denoted by (x|μk,Σk) , μk is the mixture coefficient.

During the clustering process, GMM assumes that the spatial probability distribution of input features conforms to a mixed Gaussian distribution. The k components in the GMM correspond to the k clusters, making it a soft clustering method. The number of clusters needs to be specified. The parameter estimation uses the Expectation Maximization (EM) algorithm to determine the values of πk, μ, and Σk. Subsequently, the GMM model assigns a probability to each sample belonging to each cluster. After the training process, the most suitable subtype labels are assigned based on the highest probability density of each sample across different clusters.

Benchmark method

The scripts for comparing various methods on this benchmark are available at https://github.com/zhaol233/Subtype-WGME. Subtype-GAN was directly implemented based on our previous work. NEMO, SNF, and MCCA are implemented in R. We conducted the experiments using R version 4.2.3. For the NEMO method, we downloaded the source code of the method from https://github.com/ShamirLab/NEMO and executed the NEMO method using the nemo.clustering function. For the SNF method, we used the SNFtool package (version 2.3.1), and for the MCCA method, we utilized the PMA package (version 1.2.1) and executed the PMA::MultiCCA function at runtime. The parameter settings for the comparison algorithms followed the recommended default configurations.

Random forest

A random forest is a collection of decision trees, where each decision tree is built using a subset of features randomly selected from the training set. Each tree is trained independently and makes predictions based on a different subset of features. The training process of a random forest is depicted in Figure S2. The random forest algorithm has three primary hyperparameters: node size, number of trees, and the number of features sampled. This study implemented the random forest algorithm using the scikit-learn library, with n_estimators set to 100 and max_depth set to 10. In the realm of random forests, a node marks a pivotal juncture in a decision tree, encapsulating a specific condition within the dataset. By leveraging a distinct feature and its corresponding threshold, each node meticulously partitions the dataset into two subsets. This iterative process unfolds, birthing additional child nodes, until a predefined stopping criterion is met. On the flip side, a leaf stands as the conclusive terminus of the decision tree, abstaining from further divisions. Each leaf node crystallizes into a definitive classification or regression outcome within the dataset. To gauge the importance of each feature, we delve into the realm of calculating the mean shift in node impurity across all decision trees. Impurity mirrors the extent to which distinct classes permeate the dataset, commonly gauged using metrics such as the Gini index or information gain. Throughout the feature importance computation, our focal point is discerning the cumulative impact each feature wields on node impurity across multiple decision trees via the split operation. This evaluation method accurately measures the contribution of each feature to the classification of subtypes. Suppose there are m features, and the Gini index for each feature is calculated using the following formula:

GIm=1k=1|K|pmk2

Where K represents the number of classification categories, and Pm,k represents the proportion of category k in node m.

Determining the number of cancer subtypes

In determining the number of cancer subtypes, we adhered to a standardized criterion by drawing insights from prior studies. Specifically, Subtype-WELSR53 utilized the silhouette index to ascertain the optimal number of ovarian cancer subtypes as 3 through clustering. Bloehdorn54 employed ensemble clustering on CLL 89 GEP data, achieving optimal subtype differentiation at k = 6. Liu,55 through proteomic analysis, classified esophageal cancer into two molecular subtypes: S1 and S2. Notably, the S2 subtype, characterized by the upregulation of spliceosomal and ribosomal proteins, exhibits a more aggressive nature. Loeffler-Wirth,56 referencing the 5th edition of the WHO classification of haemato-lymphoid tumors in 2022, categorized lymphoma into three main subtypes: BL, DLBCL, and FL. Zhao57 analyzed the whole transcriptome data of over 1,200 pancreatic cancer patients, identifying six subtypes through the nonnegative matrix factorization (NMF) clustering method to explore molecular heterogeneity. DLSF18 utilized spectral clustering on hidden layer features and eigengap analysis to delineate four subtypes of kidney cancer. Lehmann58 employed the TNBCtype tool, categorizing TCGA breast cancer data into four categories based on centroid correlation.

Biomarker discovery method analysis

The choice of the random forest algorithm is grounded in our specific context—dealing with high-dimensional data and a limited sample size. Random forests demonstrate excellence in handling smaller datasets with a large number of features. The ensemble nature of the algorithm mitigates the risk of overfitting, especially in scenarios with limited samples, by aggregating predictions from multiple subtrees and avoiding undue reliance on individual instances. This algorithm is widely applied in biomedical data analysis,59,60 aiding in the evaluation of feature importance. While continuing to explore alternative algorithms, such as XGBoost61 and LightGBM,62 for biomarker discovery, we conducted a specific comparison of the preselected biomarkers identified by these three algorithms. This comparison involved assessing the number of top 50 features, scored by each algorithm, that exhibited significant survival analysis results on the validation set (Table S10). Across 8 datasets, the random forest algorithm identified a total of 195 significant features, XGBoost 117, LightGBM 139. Notably, the random forest algorithm outperformed the other two algorithms in terms of the number of significant features found in 6 datasets. This highlights the robust capabilities of the random forest algorithm in our specific scenario.

Quantification and statistical analysis

Data preprocessing

The dataset used in this study was obtained from PCAWG (https://dcc.icgc.org/pcawg). However, due to missing data, some cancer samples lacked clinical information, rendering it impossible to validate the model’s classification accuracy. The sample screening process is depicted in Figure S1A. After eliminating samples with missing clinical data, the study focused on eight cancer datasets with larger sample sizes: 91 BRCA tumors, 85 OV tumors, 83 PAEN tumors, 89 RECA tumors, 94 CLLE tumors, 98 ESAD tumors, 101 MALY tumors, and 235 PACA tumors. For the complete number of available samples for PCAWG cancers, please refer to Table S1. The dimension of genome-wide omics data is immense compared to the sample size, with 121,311 dimensions for Mutations (Mut), 1,864 dimensions for miRNA, 134,074 dimensions for Copy Number Alterations (CNA), and 112,171 dimensions for RNA data (Figure S1D). Moreover, the data distribution is highly imbalanced (Figure S1C). During the feature selection process (Figure S1B), we initially applied a logarithmic transformation to the RNA and miRNA data. Since the corresponding omics data were partially missing in most samples, we filled the missing data with zeros. Subsequently, we performed variance filtering (with a threshold of 0.2) to select the omics features for the eight cancer types of interest. This filtering step effectively reduces the input’s dimensionality (Table S2). Lastly, we applied Z score normalization to standardize the features by subtracting their mean and scaling them to unit variance.

Explanation of CNA and Mut regions

While constructing the biomarker identification pipeline, we used a dataset consistent with the subtyping task. For copy number alterations (CNA) and somatic mutation features, the PCAWG dataset provided locus information for each feature (Figure S3). The features included Coding DNA Sequence (CDS), Promoter Core Region (PromCore), 5′ untranslated region (5′UTR), 3′ untranslated region (3′UTR), Enhancers, non-coding RNA (ncRNA), and splice site (SS).

Gene structure primarily comprises two main regions: the coding and non-coding regions. The general DNA sequence from the transcription start site to the transcription termination site is called the protein-coding region sequence (CDS). In eukaryotes, the coding region is discontinuous and consists of exons and introns. Exons represent the DNA segments retained after preRNA undergoes splicing or modifications, and they ultimately appear in the mature RNA gene sequence. The non-coding region is vital in gene expression regulation and encompasses various functional elements such as promoters, enhancers, and UTR. Promoters are specific DNA regions that initiate transcription and are typically located upstream of the gene’s transcription start site. During transcription, RNA polymerase and transcription factors recognize and bind to specific DNA sequences within the promoter region, thus initiating transcription. On the other hand, enhancers are DNA sequences typically located at the transcription start site or within a 1 Mbp range downstream of the gene. Transcriptional activators can bind them. By binding to enhancers, these transcription factors increase the probability of gene transcription. Enhancers are found extensively in the gene structures of both prokaryotes and eukaryotes. UTR is also part of the non-coding region and can be transcribed without being translated into proteins. UTR is present at the coding region’s 5′ and 3′ ends. The 5′UTR refers to the sequence located upstream of the coding region, while the 3′UTR refers to the downstream sequence. UTR plays important roles in post-transcriptional regulation, including RNA stability, localization, and translation efficiency. In predicting biomarkers, enhancer biomarkers were not considered in this study because the features in the Enhancer region could not be directly mapped to corresponding genes. Additionally, when calculating the contribution of different loci to subtyping, the Splice Site feature, located within the CDS region, was categorized within the CDS intervals for comparison.

Experimental environment

During the training process of Subtype-WGME, we employed the backpropagation algorithm to optimize the parameters of the entire model iteratively. The objective was to minimize the losses of the discriminator and decoder, with the loss function hyperparameters set to 1, 1, and 0.01, respectively. We utilized the Adam optimizer with a learning rate of 0.005 for optimization. The training was performed with a batch size of 64. To determine the optimal training rounds, we implemented the early stop algorithm. Upon completing the training process, we obtained each sample’s low-dimensional hidden feature representations, which are crucial for subsequent analysis.

In Subtype-WGME, the Scikit-learn package version 0.24.2 was employed for implementing the Gaussian Mixture clustering method. The PyTorch package version was 1.9.0, the Timm version was 0.5.4, the Pandas version was 1.2.4, and the Numpy version was 1.24.2. The analysis was conducted using Google Colab, with the operating system being Ubuntu 20.04 Linux. The CPU used was Intel(R) Xeon(R) CPU @ 2.30GHz, and the GPU utilized was the NVIDIA Tesla T4.

Acknowledgments

This work is supported by the National Key Research and Development Program of China under grant no. 2022YFB3203500, the Natural Science Foundation of China under grant nos. 61902126 and 62076094, the Shanghai Science and Technology Program’s “Distributed and Generative FewShot Algorithm and Theory Research” under grant no. 20511100600, and the Shanghai Science and Technology Program’s “Federated-Based Cross-Domain and Cross-Task Incremental Learning” under grant no. 21511100800.

Author contributions

H.Y. conceptualized the study. L.Z., X.F., and C.A. performed the methodology and formal analysis. L.Z. and H.Y. wrote the original draft. Y.C., T.X., and L.Z. performed the visualization. J.L. and H.Y. performed the review and editing. Z.W., H.Y., and D.L. acquired the funding. H.Y. and L.Z. were responsible for all data, figures, and text. Z.W. and D.L. supervised the study.

Declaration of interests

The authors declare no competing interests.

Published: May 17, 2024

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2024.100781.

Supplemental information

Document S1. Figures S1–S3 and Tables S1–S8 and S10
mmc1.pdf (276.4KB, pdf)
Table S9. Biomarker information found by Subtype-WGME for each cancer, related to Table 1
mmc2.xlsx (32.3KB, xlsx)
Document S2. Article plus supplemental information
mmc3.pdf (5.5MB, pdf)

References

  • 1.Ushijima T., Clark S.J., Tan P. Mapping genomic and epigenomic evolution in cancer ecosystems. Science. 2021;373:1474–1479. doi: 10.1126/science.abh1645. [DOI] [PubMed] [Google Scholar]
  • 2.Kristensen V.N., Lingjærde O.C., Russnes H.G., Vollan H.K.M., Frigessi A., Børresen-Dale A.-L. Principles and methods of integrative genomic analyses in cancer. Nat. Rev. Cancer. 2014;14:299–313. doi: 10.1038/nrc3721. [DOI] [PubMed] [Google Scholar]
  • 3.Heo Y.J., Hwa C., Lee G.-H., Park J.-M., An J.-Y. Integrative multi-omics approaches in cancer research: from biological networks to clinical subtypes. Mol. Cells. 2021;44:433–443. doi: 10.14348/molcells.2021.0042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Singh M.P., Rai S., Pandey A., Singh N.K., Srivastava S. Molecular subtypes of colorectal cancer: An emerging therapeutic opportunity for personalized medicine. Genes Dis. 2021;8:133–145. doi: 10.1016/j.gendis.2019.10.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Huang S., Chaudhary K., Garmire L.X. More is better: recent progress in multi-omics data integration methods. Front. Genet. 2017;8:84. doi: 10.3389/fgene.2017.00084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.International Cancer Genome Consortium. Hudson T.J., Anderson W., Artez A., Barker A.D., Bell C., Bernabé R.R., Bhan M.K., Calvo F., Eerola I., et al. International network of cancer genome projects. Nature. 2010;464:993–998. doi: 10.1038/nature08987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cancer Genome Atlas Research Network. Weinstein J.N., Collisson E.A., Mills G.B., Shaw K.R.M., Ozenberger B.A., Ellrott K., Shmulevich I., Sander C., Stuart J.M. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rheinbay E., Nielsen M.M., Abascal F., Wala J.A., Shapira O., Tiao G., Hornshøj H., Hess J.M., Juul R.I., Lin Z., et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature. 2020;578:102–111. doi: 10.1038/s41586-020-1965-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Elliott K., Larsson E. Non-coding driver mutations in human cancer. Nat. Rev. Cancer. 2021;21:500–509. doi: 10.1038/s41568-021-00371-z. [DOI] [PubMed] [Google Scholar]
  • 10.Ransohoff J.D., Wei Y., Khavari P.A. The functions and unique features of long intergenic non-coding rna. Nat. Rev. Mol. Cell Biol. 2018;19:143–157. doi: 10.1038/nrm.2017.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Rappoport N., Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 2018;46:10546–10562. doi: 10.1093/nar/gky889. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wiharto W., Suryani E. The comparison of clustering algorithms k-means and fuzzy c-means for segmentation retinal blood vessels. Acta Inform. Med. 2020;28:42–47. doi: 10.5455/aim.2020.28.42-47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wu D., Wang D., Zhang M.Q., Gu J. Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genom. 2015;16:1022–1110. doi: 10.1186/s12864-015-2223-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Witten D.M., Tibshirani R.J. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 2009;8 doi: 10.2202/1544-6115.1470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lee D., Seung H.S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2000;13:535–541. [Google Scholar]
  • 16.Yang H., Chen R., Li D., Wang Z. Subtype-gan: a deep learning approach for integrative cancer subtyping of multi-omics data. Bioinformatics. 2021;37:2231–2237. doi: 10.1093/bioinformatics/btab109. [DOI] [PubMed] [Google Scholar]
  • 17.Song T., Myoung N., Lee H., Park H.C. Machine learning approach to the recognition of nanobubbles in graphene. Appl. Phys. Lett. 2021;119 doi: 10.1063/5.0065411. [DOI] [Google Scholar]
  • 18.Zhang C., Chen Y., Zeng T., Zhang C., Chen L. Deep latent space fusion for adaptive representation of heterogeneous multi-omics data. Brief. Bioinform. 2022;23:bbab600. doi: 10.1093/bib/bbab600. [DOI] [PubMed] [Google Scholar]
  • 19.Nguyen H., Shrestha S., Draghici S., Nguyen T. Pinsplus: a tool for tumor subtype discovery in integrated genomic data. Bioinformatics. 2019;35:2843–2846. doi: 10.1093/bioinformatics/bty1049. [DOI] [PubMed] [Google Scholar]
  • 20.Wang B., Mezlini A.M., Demir F., Fiume M., Tu Z., Brudno M., Haibe-Kains B., Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods. 2014;11:333–337. doi: 10.1038/nmeth.2810. [DOI] [PubMed] [Google Scholar]
  • 21.Rappoport N., Shamir R. Nemo: cancer subtyping by integration of partial multi-omic data. Bioinformatics. 2019;35:3348–3356. doi: 10.1093/bioinformatics/btz058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Sienkiewicz K., Chen J., Chatrath A., Lawson J.T., Sheffield N.C., Zhang L., Ratan A. Detecting molecular subtypes from multi-omics datasets using sumo. Cell Rep. Methods. 2022;2 doi: 10.1016/j.crmeth.2021.100152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Tolstikhin I.O., Houlsby N., Kolesnikov A., Beyer L., Zhai X., Unterthiner T., Yung J., Steiner A., Keysers D., Uszkoreit J., et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021;34:24261–24272. https://proceedings.neurips.cc/paper_files/paper/2021/file/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Paper.pdf [Google Scholar]
  • 24.Mescheder L., Nowozin S., Geiger A. Vol. 70. PMLR; 2017. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks; pp. 2391–2400.https://proceedings.mlr.press/v70/mescheder17a.html (International Conference on Machine Learning). [Google Scholar]
  • 25.Picard M., Scott-Boyer M.-P., Bodein A., Périn O., Droit A. Integration strategies of multi-omics data for machine learning analysis. Comput. Struct. Biotechnol. J. 2021;19:3735–3746. doi: 10.1016/j.csbj.2021.06.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zhao Z., Guo Y., Liu Y., Sun L., Chen B., Wang C., Chen T., Wang Y., Li Y., Dong Q., et al. Individualized lncrna differential expression profile reveals heterogeneity of breast cancer. Oncogene. 2021;40:4604–4614. doi: 10.1038/s41388-021-01883-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Yan X., Hu Z., Feng Y., Hu X., Yuan J., Zhao S.D., Zhang Y., Yang L., Shan W., He Q., et al. Comprehensive genomic characterization of long non-coding rnas across human cancers. Cancer Cell. 2015;28:529–540. doi: 10.1016/j.ccell.2015.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhang Z., Yan C., Li K., Bao S., Li L., Chen L., Zhao J., Sun J., Zhou M. Pan-cancer characterization of lncrna modifiers of immune microenvironment reveals clinically distinct de novo tumor subtypes. NPJ Genom. Med. 2021;6:52. doi: 10.1038/s41525-021-00215-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Cai H., Li Y., Qin D., Wang R., Tang Z., Lu T., Cui Y. The depletion of abi3bp by microrna-183 promotes the development of esophageal carcinoma. Mediators Inflamm. 2020;2020 doi: 10.1155/2020/3420946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Spina V., Rossi D. Molecular pathogenesis of splenic and nodal marginal zone lymphoma. Best Pract. Res. Clin. Haematol. 2017;30:5–12. doi: 10.1016/j.beha.2016.09.004. [DOI] [PubMed] [Google Scholar]
  • 31.Spina V., Mensah A.A., Arribas A.J. Biology of splenic and nodal marginal zone lymphomas. Ann. Lymphoma. 2021;5 doi: 10.21037/aol-20-38. https://aol.amegroups.org/article/view/7066 [DOI] [Google Scholar]
  • 32.Li J., Song L., Qiu Y., Yin A., Zhong M. Znf217 is associated with poor prognosis and enhances proliferation and metastasis in ovarian cancer. Int. J. Clin. Exp. Pathol. 2014;7:3038–3047. [PMC free article] [PubMed] [Google Scholar]
  • 33.Luan H., Jian L., Huang Y., Guo Y., Zhou L. Identification of novel therapeutic target and prognostic biomarker in matrix metalloproteinase gene family in pancreatic cancer. Sci. Rep. 2023;13 doi: 10.1038/s41598-023-44506-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Noll E.M., Eisen C., Stenzinger A., Espinet E., Muckenhuber A., Klein C., Vogel V., Klaus B., Nadler W., Rösli C., et al. Cyp3a5 mediates basal and acquired therapy resistance in different subtypes of pancreatic ductal adenocarcinoma. Nat. Med. 2016;22:278–287. doi: 10.1038/nm.4038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Xia R., Hu C., Ye Y., Zhang X., Li T., He R., Zheng S., Wen X., Chen R. Hnf1a regulates oxaliplatin resistance in pancreatic cancer by targeting 53bp1. Int. J. Oncol. 2023;62 doi: 10.3892/ijo.2023.5493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ding C., Li Y., Wang S., Xing C., Chen L., Zhang H., Wang Y., Dai M. Robo2 hampers malignant biological behavior and predicts a better prognosis in pancreatic adenocarcinoma. Scand. J. Gastroenterol. 2021;56:955–964. doi: 10.1080/00365521.2021.1930144. [DOI] [PubMed] [Google Scholar]
  • 37.Sanchez P., Espinosa M., Maldonado V., Barquera R., Belem-Gabiño N., Torres J., Cravioto A., Melendez-Zajgla J. Pancreatic ductal adenocarcinomas from mexican patients present a distinct genomic mutational pattern. Mol. Biol. Rep. 2020;47:5175–5184. doi: 10.1007/s11033-020-05592-3. [DOI] [PubMed] [Google Scholar]
  • 38.Liu F., Ma J., Wang K., Li Z., Jiang Q., Chen H., Li W., Xia J. High expression of pde4d correlates with poor prognosis and clinical progression in pancreaticductal adenocarcinoma. J. Cancer. 2019;10:6252–6260. doi: 10.7150/jca.35443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Liu J.-Q., Liao X.-W., Wang X.-K., Yang C.-K., Zhou X., Liu Z.-Q., Han Q.-F., Fu T.-H., Zhu G.-Z., Han C.-Y., et al. Prognostic value of glypican family genes in early-stage pancreatic ductal adenocarcinoma after pancreaticoduodenectomy and possible mechanisms. BMC Gastroenterol. 2020;20:415–423. doi: 10.1186/s12876-020-01560-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Wu Y., Wei X., Feng H., Hu B., Liu B., Luan Y., Ruan Y., Liu X., Liu Z., Wang S., et al. Transcriptome analyses identify an rna binding protein related prognostic model for clear cell renal cell carcinoma. Front. Genet. 2020;11 doi: 10.3389/fgene.2020.617872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Fan X., Liu B., Wang Z., He D. Tacc3 is a prognostic biomarker for kidney renal clear cell carcinoma and correlates with immune cell infiltration and t cell exhaustion. Aging (Albany NY) 2021;13:8541–8562. doi: 10.18632/aging.202668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Yuan R., Zhang Y., Wang Y., Chen H., Zhang R., Hu Z., Chai C., Chen T. Gnpnat1 is a potential biomarker correlated with immune infiltration and immunotherapy outcome in breast cancer. Front. Immunol. 2023;14 doi: 10.3389/fimmu.2023.1152678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Zaib T., Cheng K., Liu T., Mei R., Liu Q., Zhou X., He L., Rashid H., Xie Q., Khan H., et al. Expression of cd22 in triple-negative breast cancer: A novel prognostic biomarker and potential target for car therapy. Int. J. Mol. Sci. 2023;24:2152. doi: 10.3390/ijms24032152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Yan L., Wu X., Zhang Y., Tan Q., Xu J., Wang Y., Wang R., Li Y., Zhao J. Lncrna enst00000370438 promotes cell proliferation by upregulating dhcr24 in breast cancer. Mol. Carcinog. 2023;62:855–865. doi: 10.1002/mc.23529. [DOI] [PubMed] [Google Scholar]
  • 45.Wang S., Liu X., Yang M., Yuan D., Ye K., Qu X., Wang X. Bubs are new biomarkers of promoting tumorigenesis and affecting prognosis in breast cancer. Dis. Markers. 2022;2022 doi: 10.1155/2022/2760432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Zhang L., Hui T.-L., Wei Y.-X., Cao Z.-M., Feng F., Ren G.-S., Li F. The expression and biological function of the phf2 gene in breast cancer. RSC Adv. 2018;8:39520–39528. doi: 10.1039/c8ra06017g. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.El-Ashmawy N.E., Hussien F.Z., El-Feky O.A., Hamouda S.M., Al-Ashmawy G.M. Serum lncrna-atb and fam83h-as1 as diagnostic/prognostic non-invasive biomarkers for breast cancer. Life Sci. 2020;259 doi: 10.1016/j.lfs.2020.118193. [DOI] [PubMed] [Google Scholar]
  • 48.Yang Y., Tian S., Qiu Y., Zhao P., Zou Q. Mdicc: Novel method for multi-omics data integration and cancer subtype identification. Brief. Bioinform. 2022;23 doi: 10.1093/bib/bbac132. [DOI] [PubMed] [Google Scholar]
  • 49.Iorio F., Knijnenburg T.A., Vis D.J., Bignell G.R., Menden M.P., Schubert M., Aben N., Gonçalves E., Barthorpe S., Lightfoot H., et al. A landscape of pharmacogenomic interactions in cancer. Cell. 2016;166:740–754. doi: 10.1016/j.cell.2016.06.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Huang T., Li J., Wang S.M. Etiological roles of core promoter variation in triple-negative breast cancer. Genes Dis. 2023;10:228–238. doi: 10.1016/j.gendis.2022.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Huang T., Li J., Wang S.M. Core promoter mutation contributes to abnormal gene expression in bladder cancer. BMC Cancer. 2022;22:68. doi: 10.1186/s12885-022-09178-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Lim Y., Arora S., Schuster S.L., Corey L., Fitzgibbon M., Wladyka C.L., Wu X., Coleman I.M., Delrow J.J., Corey E., et al. Multiplexed functional genomic analysis of 5’untranslated region mutations across the spectrum of prostate cancer. Nat. Commun. 2021;12:4217. doi: 10.1038/s41467-021-24445-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Song W., Wang W., Dai D.-Q. Subtype-weslr: identifying cancer subtype with weighted ensemble sparse latent representation of multi-view data. Brief. Bioinform. 2022;23 doi: 10.1093/bib/bbab398. [DOI] [PubMed] [Google Scholar]
  • 54.Bloehdorn J., Braun A., Taylor-Weiner A., Jebaraj B.M.C., Robrecht S., Krzykalla J., Pan H., Giza A., Akylzhanova G., Holzmann K., et al. Multi-platform profiling characterizes molecular subgroups and resistance networks in chronic lymphocytic leukemia. Nat. Commun. 2021;12:5395. doi: 10.1038/s41467-021-25403-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Liu W., Xie L., He Y.-H., Wu Z.-Y., Liu L.-X., Bai X.-F., Deng D.-X., Xu X.-E., Liao L.-D., Lin W., et al. Large-scale and high-resolution mass spectrometry-based proteomics profiling defines molecular subtypes of esophageal cancer for therapeutic targeting. Nat. Commun. 2021;12:4961. doi: 10.1038/s41467-021-25202-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Loeffler-Wirth H., Kreuz M., Schmidt M., Ott G., Siebert R., Binder H. Classifying germinal center derived lymphomas—navigate a complex transcriptional landscape. Cancers. 2022;14:3434. doi: 10.3390/cancers14143434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zhao L., Zhao H., Yan H. Gene expression profiling of 1200 pancreatic ductal adenocarcinoma reveals novel subtypes. BMC Cancer. 2018;18:603–613. doi: 10.1186/s12885-018-4546-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Lehmann B.D., Colaprico A., Silva T.C., Chen J., An H., Ban Y., Huang H., Wang L., James J.L., Balko J.M., et al. Multi-omics analysis identifies therapeutic vulnerabilities in triple-negative breast cancer subtypes. Nat. Commun. 2021;12:6276. doi: 10.1038/s41467-021-26502-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Acharjee A., Larkman J., Xu Y., Cardoso V.R., Gkoutos G.V. A random forest based biomarker discovery and power analysis framework for diagnostics research. BMC Med. Genomics. 2020;13:178–214. doi: 10.1186/s12920-020-00826-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Toth R., Schiffmann H., Hube-Magg C., Büscheck F., Höflmayer D., Weidemann S., Lebok P., Fraune C., Minner S., Schlomm T., et al. Random forest-based modelling to detect biomarkers for prostate cancer progression. Clin. Epigenetics. 2019;11:148–215. doi: 10.1186/s13148-019-0736-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Chen T., Guestrin C. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. Xgboost: A scalable tree boosting system; pp. 785–794. [DOI] [Google Scholar]
  • 62.Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017;31:3149–3157. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S3 and Tables S1–S8 and S10
mmc1.pdf (276.4KB, pdf)
Table S9. Biomarker information found by Subtype-WGME for each cancer, related to Table 1
mmc2.xlsx (32.3KB, xlsx)
Document S2. Article plus supplemental information
mmc3.pdf (5.5MB, pdf)

Data Availability Statement

  • This paper analyzes existing, publicly available datasets processed and hosted on ICGC data portal https://dcc.icgc.org/releases/PCAWG/. Information is also listed in the key resources table.

  • All original code has been deposited at GitHub and is publicly available as of the date of publication. Idedntifiers are listed in the key resources table.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.


Articles from Cell Reports Methods are provided here courtesy of Elsevier

RESOURCES