Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Feb 3;26(1):bbaf043. doi: 10.1093/bib/bbaf043

Multi-view multi-level contrastive graph convolutional network for cancer subtyping on multi-omics data

Bo Yang 1,, Chenxi Cui 2, Meng Wang 3, Hong Ji 4, Feiyue Gao 5
PMCID: PMC11789786  PMID: 39899598

Abstract

Cancer is a highly diverse group of diseases, and each type of cancer can be further divided into various subtypes according to specific characteristics, cellular origins, and molecular markers. Subtyping helps in tailoring treatment and prognosis accuracy. However, the existing studies are more concerned with integrating different omics data to discover potential connections, but ignoring the relationships between consensus information and individual information within each omics level during the integration process. To this end, we propose a novel fusion-free method called multi-view multi-level contrastive graph convolutional network (MInline graphicCGCN) for cancer subtyping. MInline graphicCGCN learns multi-level features, i.e. high-level and low-level features, respectively. The low-level features from each view capture the intrinsic information in each omics by reconstruction of node attribute and graph structures. The high-level features achieve cancer subtyping via contrastive learning. Comprehensive experiments were performed on 34 multi-omics cancer datasets. The findings indicate that MInline graphicCGCN achieves results comparable to or surpassing many state-of-the-art methods.

Keywords: contrastive learning, graph convolutional network, cancer subtype, multi-omics data

Introduction

Cancer is a heterogeneous disease [1–3]. For decades, pathologists have recognized cancer heterogeneity and classified tumors originating from the same organ into distinct histological subtypes [4–6]. Distinguishing cancer subtypes is critical to advancing cancer research since it can improve the quality of patient care and promote the development of personalized cancer treatments. Recent explosive advances in the release of sequencing technology enable us to comprehensively analyze several cancer genome profiles [7–9]. Various significant national and international projects, including The Cancer Genome Atlas (TCGA), have amassed extensive biological samples analyzing multi-level molecular profiles [10, 11], revealing many new cancer genes, pathways, and mechanisms, which would contribute to cancer diagnosis, treatment, and prognosis.

Labeling cancer data is laborious and time-consuming work since it can only via clinical follow-up. Therefore, more and more multi-omics integrative clustering algorithms are employed to achieve cancer subtyping. Early conventional clustering methods, such as K-means [12] and spectral clustering [13], directly integrated each omics data in tandem. Integration-based methods have gradually become mainstream in recent years, starting with CC [14] and PINS [15], where various clustering models are independently trained on each omics data and then the clustering results are fused for ultimate prediction. An alternative approach involves designing a comprehensive representation model to investigate the relationships among different omics data, such as MCCA [16], iCluster [17], MOFA [18], and others. Multi-omics data usually present complex non-linearity, hence some works use kernel trick to some extent deal with nonlinear structure therein data. CIMLR [19] constructs several Gaussian kernels for each omics data, and then combines these kernels to construct one fused similarity matrix. COPS [20] constructs kernel-functions using pathway as prior knowledge and then fuses the different kernels to carry out spectral clustering. IntNMF [21] utilizes the shared factors obtained via non-negative matrix factorization to embed multi-omics data and then to cluster. MCluster-VAEs [22] employs a unified attention-based network architecture to model multi-omics data and leverages a variational Bayes objective function to derive cluster-friendly representations and posterior estimates for clustering assignments. MOCSS [23] constructs two auto-encodes to learning the shared and specific representation, respectively, and applies contrastive learning only to shared representation for clustering. SNF [24] uses message-passing theory to fuse the sample neighborhood graphs constructed on each omics into a unified similarity network. SNFCC [25] combines SNF [24] and CC [14] for clustering. In NEMO [26], per omics dataset, a similarity matrix is constructed and averaged using a kernel based on the radial basis function. In recent years, deep learning algorithms have emerged as a highly promising approach for integrating multi-omics data [27]. These integrative representation models have already achieved satisfactory performance, however, practically, multi-omics data contain consensus information therein all omics levels and meanwhile individual information for omics-specific data. Integration multi-omics could discover the potential clustering patterns, but the integration process usually ignores the individual information which may contain some knowledge that other omics do not have. Therefore, constructing an appropriate fusion-free model is not only convenient for the training procedure but also could promote the subtyping performance.

The primary purpose of clustering is to discover meaningful structures within the data, identify natural groupings, and gain insights into the underlying distribution of the data. Recently, graph neural networks have the ability to uncover patient similarity and complex gene expression patterns from the latent space [28, 29]. Moreover, contrastive learning [30, 31] is a type of self-supervised learning methodology. It is designed to generate data point representations by increasing the similarity among similar data points and decreasing it among dissimilar ones. In the realm of multi-view learning, some recent studies have demonstrated impressive results by employing contrast learning techniques [32, 33]. For instance, Tian and colleagues [32] introduced a contrastive multi-view coding framework aimed at capturing the latent semantics of scenes. Meanwhile, in a separate work [33], the authors devised a contrast-based approach for multi-view representation learning, specifically tailored for addressing graph categorization tasks.

In this paper, we propose a novel fusion-free multi-omics representation model for cancer subtyping, i.e. multi-view multi-level contrastive graph convolutional network (MInline graphicCGCN). As shown in Fig. 1, first, the graph convolutional networks are used to learn low-level features contained in each omics data. Second, the overlay feature multilayer perceptron (MLP) and label MLP on the low-level features are designed to obtain high-level features and subtype labels, respectively. Third, in order to improve the effectiveness of clustering, the information obtained by clustering on high-level features with subtype labels is combined. Finally, GCN is used to reconstruct node attributes and graph structures, and meanwhile, the two consistency goals are achieved by comparative learning. To the best of our knowledge, this is the first attempt to construct a fusion-free graph convolutional contrast learning model for multi-omics clustering. The proposed approach achieves results that are comparable to or surpass some state-of-the-art methods across 34 public multi-omics cancer datasets.

Figure 1.

Figure 1

The architecture of MInline graphicCGCN. First, GCN are applied to extract low-level features from each omics dataset. Next, the feature MLP and label MLP are designed to extract high-level features and predict subtype labels, respectively. The clustering on high-level features assist in improving clustering performance via Maximum matching. Finally, GCN reconstructs node attributes and graph structures while ensuring consistency through contrastive learning.

Materials and method

Datasets

The experiments utilize all 33 datasets from TCGA, encompassing a variety of cancer types. For each type of cancer, we used three omics levels: mRNA expression, DNA methylation, and miRNA expression, for predicting cancer subtypes. In addition, we also performed experiments using mRNA and CNV expression from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) breast cancer dataset [34]. The size of the dataset for each cancer subtype is shown in Supplementary Table S5, and the details of datasets are presented in Supplementary Text S1. All data are preprocessed using the same manners as Rappoport et al. [24, 35], details shown in Supplementary Text S2.

Method

MInline graphicCGCN includes three modules, i.e. individual encoder, multi-omics contrastive learning, and clustering with high-level features. The details of each module will be explained in the following sections.

Individual encoder

A multi-omics dataset Inline graphic includes Inline graphic samples across Inline graphic omics levels. Inline graphic denotes the Inline graphic-dimensional sample from the Inline graphic-th omics data. Let Inline graphic denote the corresponding graph structure matrix set. Graph structure Inline graphic can be defined as follows:

graphic file with name DmEquation1.gif (1)

where Inline graphic denotes the neighbor set of Inline graphic in Inline graphic-th omics measurements.

The original omics data Inline graphic and the corresponding graph structure matrix Inline graphic are inputs of our model. The encoder extracts the representation Inline graphic from Inline graphic and Inline graphic through convolution operation in two layers of GCN. The Inline graphic-th encoder layer’s outputs are expressed as follows:

graphic file with name DmEquation2.gif (2)

where Inline graphic denotes a tanh activation function. Inline graphic and Inline graphic is the identity matrix, Inline graphic. The parameter matrix Inline graphic is learned by Inline graphic-th layer, where Inline graphic and Inline graphic is the layer number of the encoder. When Inline graphic, set Inline graphic, and when Inline graphic, set Inline graphic, which is the latent embedded representation of Inline graphic-th omics data.

To conduct the reconstruction, the decoder is designed as the reverse of the encoder, thus the decoder also has a two-layer structure. The reconstruction of the omics data and the reconstruction of corresponding structural graph are computed as equation (3) and equation (5), respectively.

graphic file with name DmEquation3.gif (3)
graphic file with name DmEquation4.gif (4)
graphic file with name DmEquation5.gif (5)

where Inline graphic are the decoder model parameters and the parameters Inline graphic are defined by the training of the reconfiguration graph Inline graphic.

Then we define a loss of omics data matrix reconstruction and a loss of corresponding graph reconstruction as follows:

graphic file with name DmEquation6.gif (6)
graphic file with name DmEquation7.gif (7)

Thus, the reconstruction objective of all omics levels is defined as follows:

graphic file with name DmEquation8.gif (8)

Multi-omics contrastive learning

We take the latent embedding representation Inline graphic as a low-level feature and learn the high-level features Inline graphic via the feature MLP, and the feature MLP with one layer. We perform reconstruction objectives and consistency objectives in different feature spaces. To avoid model collapse, the representation ability of Inline graphic is maintained in the low-level feature space by equation (8). To further learn the consensus information, contrastive learning is utilized in the high-level feature space to allow Inline graphic to further achieve the consistency objective.

Contrastive learning seeks to enhance the similarity between positive pairs (related samples) and reduce the similarity between negative pairs (unrelated samples). In particular, each high-level feature Inline graphic has a total of Inline graphic feature pairs, i.e.,Inline graphic. Regarding the same sample Inline graphic-th, Inline graphic are constructed as Inline graphic positive feature pairs, i.e. different omics data of the same patient, and the remaining Inline graphic are all negative pairs. We use the cosine distance measure, suggested by NT-Xent [28], to determine the similarity between two features:

graphic file with name DmEquation9.gif (9)

The loss of feature contrast between Inline graphic and Inline graphic is defined as follows:

graphic file with name DmEquation10.gif (10)

where Inline graphic represents the temperature parameter. Then, the cumulative feature contrast loss at all omics levels is defined as:

graphic file with name DmEquation11.gif (11)

Next, we illustrate how to cluster in a fusion-free model. In detail, a label MLP is superimposed on the low-level features to get a clustering assignment for all the omics data Inline graphic. Define Inline graphic as the probability that the Inline graphic-th sample is the Inline graphic-th category in the Inline graphic-th omics data. Inline graphic represents the vector consisting of the probabilities that each sample under Inline graphic-th omics measure belongs to Inline graphic-th category. We set the last layer of the label MLP to softmax to output the probabilities. To obtain robustness of clustering, similar to the objective of obtaining consistency of high-level features, we utilize contrastive learning to achieve clustering consistency. For the Inline graphic-th omics level, the cluster labels have a total of Inline graphic label pairs, i.e., Inline graphic, which Inline graphic are defined as Inline graphic positive feature pairs and the remaining Inline graphic are all negative pairs.

graphic file with name DmEquation12.gif (12)

where Inline graphic represents the temperature parameter, therefore, the clustering-oriented loss can be defined as follows:

graphic file with name DmEquation13.gif (13)

where Inline graphic. The first part of equation (13) is used to obtain clustering consistency, while the second part is a regularization term [36] for avoiding overfitting.

Clustering with high-level features

To improve the clustering performance, we use Inline graphic as an anchor point and match it with the clusters obtained by clustering on the high-level feature Inline graphic.

First, K-means [12] is unitized on the high-level features to obtain the clustering information for each omics data. In the Inline graphic-th omics data, define Inline graphic as the center of the Inline graphic clusters. The clustering label Inline graphic of all samples can be calculated as follows:

graphic file with name DmEquation14.gif (14)

Then, let Inline graphicdenote the cluster labels for the output of the label MLP, where Inline graphic. It is important to note that the clusters denoted by Inline graphic and Inline graphic do not correspond. Thus, using Inline graphic as an anchor point, Inline graphic is modified by the maximum matching formula [37], which can be described as follows:

graphic file with name DmEquation15.gif (16)

where Inline graphic denotes the boolean matrix and Inline graphic is the cost matrix, which can be formulated as follows:

graphic file with name DmEquation16.gif (16)
graphic file with name DmEquation17.gif (17)

where Inline graphic is the indicator function. Define the modified cluster assignment Inline graphic of the Inline graphic-th sample as a one-hot vector. The value of the Inline graphic-th element of Inline graphic is 1 only if Inline graphic satisfies Inline graphic. We then optimize the model by cross-entropy loss:

graphic file with name DmEquation18.gif (18)

where Inline graphic. Next, we define the loss of consensus information as follows:

graphic file with name DmEquation19.gif (19)

In summary, the overall loss function for MInline graphicCGCN is expressed as follows:

graphic file with name DmEquation20.gif (20)

where Inline graphic and Inline graphic are trade-off parameters.

In the last, the subtype labels of the Inline graphic-th sample are computed as follows:

graphic file with name DmEquation21.gif (21)

Algorithm 1 summarizes the entire optimization process of MInline graphicCGCN in detail. A mini-batch gradient descent algorithm is employed throughout the training process to optimize the model, which integrates a GCN encoder and decoder along with feature and label MLPs. The graph for each omics dataset is constructed by equation (1). Next, the low-level features are learned through the GCN encoder, and the reconstruction of node attribute and graph structure is achieved through the GCN decoder. After contrastive learning of multi-omics through equations (11) and (13), the clustering labels derived from the high-level features are adjusted using the maximum matching formula in equation (15). Finally, the model is fine-tuned with the modified clustering labels by equation (20). Supplementary Tables S1 and S2 provide the node counts for each layer of MInline graphicCGCN. And, we give a real example to show the overall data flow in Supplementary Table S3.

graphic file with name bbaf043f1001.jpg

Result

On 34 cancer datasets, we compared MInline graphicCGCN with 14 methods, and the comparison methods included directly clustering and integrating methods, i.e. 2 directly clustering methods (K-means [12] and spectral clustering [13]) and 12 integrating methods (LRAcluser[38], CC [14], PINS [15], MCCA [16], iClusterBayes [17], MOFA [18], IntNMF [19], CIMLR [21], MOCSS[23], SNF [24], SNFCC [25], and NEMO [26]).

Result analysis

The effectiveness of cancer subtyping is evaluated by two metrics, i.e. survival analysis and enrichment analysis on clinical labels. The survival analysis employs the Cox proportional hazards model [39] along with the P-value to assess statistically significant differences in survival profiles across different cancer subtypes. For the enrichment analysis of clinical labels, a standardized set of patient clinical details, including gender and age at diagnosis, is utilized alongside four distinct clinical pathological parameters: overall progression (pathologic stage), lymph node involvement (pathologic N), metastases (pathologic M), and tumor progression (pathologic T), is selected for all cancers. The details of the evaluation metrics are described in Supplementary Text S3. To ensure greater adaptability of our model, the number of clusters is treated as an input parameter. This parameter can be determined automatically or set using prior medical knowledge when available. For automatic determination, we concatenate the low-level features Inline graphic, perform clustering, and calculate the silhouette coefficient for varying cluster numbers. The number of clusters corresponding to the highest silhouette score is selected as the optimal value. The silhouette-based selection process is illustrated in Supplementary Figs S1(c)S34(c), where we show the line plots of silhouette scores for different numbers of clusters.

The comparison results on 33 TCGA datasets and METABRIC datasets are shown in Table 1 and Fig. 2. It can be seen that MInline graphicCGCN achieves the highest average number of enriched clinical parameters and average logrank P-value, reaching 1.9 and 12.2, respectively. After MInline graphicCGCN, five methods achieve a superior average count of enriched clinical parameters at 1.6, while three methods achieve a superior average logrank P-value of 11.5. On the COAD dataset, none of the comparative methods identified clusters with significant differences in survival analysis except our method, and our method demonstrated significant survival values in 22 out of 34 datasets. The experimental results indicated that MInline graphicCGCN outperformed the other 14 comparison methods on 34 cancer datasets, and proved the effectiveness and accuracy of MInline graphicCGCN in cancer subtyping.

Table 1.

Clustering result comparison between MInline graphicCGCN and other approaches on TCGA and METABRIC datasets

Cancer/Alg. K-Means Spectral MOFA LRAcluste CC PINS MCCA iClusterBayes IntNMF CIMLR MOCSS SNF SNFCC NEMO MInline graphicCGCN
ACC 0/4.0(3) 0/4.2(4) 1/5.9(3) 0/5.6(5) 1/5.4(3) 0/6.5(2) 2/5.2(2) 0/4.4(4) 0/3.6(4) 1/6.2(2) 0/4.3(4) 2/4.3(4) 0/4.4(4) 1/5.1(3) 1/5.6(3)
AML 1/3.3(5) 1/3.2(6) 1/3.8(4) 1/1.3(7) 1/3.6(3) 1/1.5(4) 1/1.5(12) 1/3.3(5) 1/1.9(5) 0/1.5(3) 1/3.4(3) 1/3.1(6) 1/4.0(4) 1/1.8(5) 1/3.0(5)
BIC 1/4.6(4) 1/5.0(3) 1/4.4(4) 2/5.2(5) 1/2.1(5) 2/4.9(5) 2/6.3(5) 1/4.7(4) 1/4.3(4) 4/7.4(13) 1/5.0(5) 1/6.3(5) 1/6.5(5) 2/6.2(4) 2/6.7(4)
BLCA 5/2.0(5) 4/2.6(4) 5/2.8(5) 5/2.8(11) 5/2.6(5) 6/1.0(5) 5/1.6(10) 0/0.2(4) 6/1.2(5) 5/3.9(5) 5/1.4(4) 5/1.9(3) 5/3.9(8) 5/2.3(3) 4/3.0(3)
CHOL 0/0.5(4) 0/0.2(3) 0/0.2(3) 0/0.3(3) 0/0.1(2) 0/0.1(3) 0/0.4(5) 0/0.3(4) 0/0.3(4) 0/0.1(4) 0/0.2(2) 0/0.1(3) 0/0.1(3) 0/0.3(5) 0/0.3(2)
DLBC 0/0.1(3) 0/0.1(4) 0/0.4(2) 0/0.1(2) 0/0.3(5) 0/0.2(3) 0/0.1(5) 0/0.1(2) 0/0.1(2) 0/0.1(3) 0/0.2(4) 0/0.2(3) 0/0.5(4) 0/0.4(4) 0/0.1(2)
LUSC 0/1.5(2) 0/2.1(2) 0/1.7(2) 0/0.7(12) 1/1.1(4) 0/2.0(2) 2/1.8(12) 0/1.9(5) 0/0.9(3) 1/1.4(8) 1/0.9(5) 1/1.3(2) 1/1.7(2) 0/1.8(2) 2/2.5(12)
KIRC 1/0.8(2) 3/1.4(3) 1/0.1(2) 3/1.4(11) 4/2.7(4) 3/1.7(6) 3/6.4(15) 3/1.5(2) 3/0.2(2) 4/1.3(11) 2/0.7(2) 4/2.1(4) 4/2.0(2) 4/2.2(12) 4/1.5(2)
HNSC 4/2.8(5) 2/2.1(4) 4/2.3(5) 1/0.6(3) 3/2.0(4) 3/1.7(5) 3/1.7(3) 1/0.5(4) 1/0.3(4) 2/3.5(5) 4/2.0(5) 2/2.7(3) 2/2.7(3) 3/3.0(4) 3/3.0(3)
CESC 1/0.8(4) 2/0.8(4) 1/0.4(2) 1/0.5(3) 1/1.1(3) 1/0.3(4) 1/0.1(3) 1/0.2(4) 1/0.5(3) 1/0.4(3) 3/1.1(3) 2/0.8(4) 0/0.4(5) 2/0.5(5) 3/0.6(4)
KICH 1/0.5(3) 1/0.8(4) 3/0.4(2) 2/0.4(3) 0/0.3(5) 0/0.1(2) 1/0.4(5) 0/0.2(3) 2/0.7(3) 0/0.3(3) 2/0.1(3) 1/0.9(4) 2/0.7(5) 2/1.2(4) 1/1.2(4)
LUAD 1/0.2(2) 1/0.1(3) 1/3.0(6) 1/0.1(4) 1/0.4(2) 1/0.3(4) 1/0.2(3) 2/0.1(4) 1/1.5(4) 1/0.3(4) 2/0.7(4) 1/1.6(3) 1/2.1(3) 1/1.0(5) 5/2.5(3)
KIRP 5/7.7(5) 6/9.1(4) 6/3.6(4) 5/7.5(3) 6/3.5(3) 4/6.3(4) 5/5.1(4) 5/4.2(5) 4/3.8(5) 4/2.5(5) 3/2.4(2) 4/4.5(3) 6/3.2(4) 6/4.8(3) 3/12.1(5)
ESCA 2/0.1(5) 2/0.3(4) 3/0.1(2) 3/0.1(2) 4/0.3(3) 3/0.6(4) 3/0.1(2) 3/0.1(3) 2/0.1(3) 3/0.1(3) 3/0.1(3) 3/0.1(4) 3/0.1(5) 3/0.1(4) 4/0.2(4)
PAAD 3/2.5(4) 3/2.4(2) 3/3.7(4) 3/2.6(8) 3/3.5(3) 0/2.1(2) 3/3.0(6) 0/1.6(3) 3/3.0(3) 0/1.1(12) 0/3.2(2) 3/3.5(2) 3/3.4(4) 3/3.5(3) 4/4.2(5)
COAD 1/0.4(2) 1/0.9(12) 1/0.2(2) 1/0.1(10) 0/0.3(2) 2/0.1(4) 0/0.2(2) 1/0.1(2) 1/0.2(3) 2/0.1(11) 2/0.1(2) 0/0.6(3) 2/0.3(10) 0/0.1(3) 1/1.7(5)
LIHC 2/0.2(2) 2/0.4(2) 2/0.3(2) 2/2.9(12) 1/0.3(2) 2/0.8(5) 2/1.5(15) 2/2.2(6) 2/2.0(5) 3/2.6(8) 2/2.1(6) 2/4.2(5) 2/3.2(10) 3/4.2(5) 2/3.2(5)
OV 0/0.1(2) 2/0.8(4) 0/0.1(2) 1/0.1(4) 1/0.2(3) 1/0.1(2) 1/0.8(9) 1/0.4(6) 0/0.7(3) 1/0.1(2) 1/1.0(3) 1/0.6(3) 1/0.5(3) 1/0.4(3) 1/0.4(4)
UVM 0/4.7(4) 0/3.0(5) 0/5.5(3) 0/4.0(3) 0/5.0(3) 0/5.1(5) 0/5.0(3) 0/4.9(5) 1/2.1(5) 0/5.7(5) 0/5.7(2) 0/5.7(3) 0/4.7(3) 0/3.3(4) 0/7.6(3)
LGG 1/323(3) 1/323(3) 1/323(3) 1/323(3) 1/9.0(3) 1/323(3) 1/323(3) 1/14.0(3) 1/323(3) 1/323(3) 1/323(3) 1/323(3) 1/323(3) 1/323(3) 1/323(3)
MESO 0/3.3(4) 0/2.5(3) 0/0.6(2) 0/0.9(4) 0/2.6(3) 0/1.7(5) 0/1.6(3) 0/1.1(4) 0/3.9(4) 0/1.7(4) 0/2.8(3) 0/3.6(5) 0/3.9(3) 0/2.8(5) 1/4.4(5)
TGCT 3/0.6(3) 1/0.5(4) 3/0.4(2) 2/0.2(3) 2/0.3(2) 1/0.1(4) 2/0.2(3) 3/0.4(4) 1/0.5(4) 0/0.8(4) 3/0.7(3) 3/0.7(3) 2/0.3(2) 3/0.6(3) 2/0.7(3)
UCEC 3/2.5(4) 1/1.5(5) 3/2.1(4) 3/2.6(4) 3/3.9(3) 2/0.2(5) 3/3.0(3) 3/4.2(4) 3/3.7(4) 3/3.7(4) 2/3.3(3) 3/3.4(5) 3/3.7(5) 2/3.3(5) 1/0.3(3)
PCPG 0/0.2(4) 0/0.5(5) 0/0.2(4) 0/0.5(5) 0/0.1(3) 0/0.4(3) 0/0.2(3) 0/0.4(5) 0/0.1(5) 0/0.2(5) 0/0.1(5) 1/0.2(3) 0/0.1(4) 0/0.2(4) 1/0.3(5)
PRAD 3/0.2(5) 1/0.1(5) 1/0.3(2) 2/0.1(4) 3/0.2(4) 1/0.1(3) 2/0.4(5) 2/0.1(4) 1/0.4(4) 2/0.6(4) 1/0.1(3) 4/0.1(5) 4/0.2(5) 0/0.1(3) 4/0.2(5)
GBM 2/2.6(5) 2/2.5(5) 2/4.2(5) 2/1.6(12) 2/3.3(7) 0/0.7(2) 1/3.6(11) 2/2.7(2) 1/3.5(3) 1/2.9(8) 1/4.1(4) 0/4.5(2) 2/2.6(9) 1/3.8(4) 1/5.5(6)
READ 0/0.2(3) 0/0.6(4) 1/0.4(3) 0/0.2(4) 1/0.6(3) 1/0.3(5) 2/0.3(3) 0/0.1(4) 1/0.6(4) 0/0.6(4) 0/0.2(2) 0/0.2(5) 0/0.5(3) 0/0.1(3) 1/0.7(4)
SKCM 2/0.9(2) 3/1.5(6) 0/0.5(2) 2/1.5(15) 3/1.5(4) 2/1.0(15) 2/4.8(2) 2/0.6(2) 2/4.1(2) 3/3.3(4) 1/0.6(4) 1/1.6(3) 1/2.2(4) 3/4.0(5) 3/4.1(5)
THYM 1/1.8(3) 1/2.2(5) 1/2.9(3) 1/1.9(3) 0/2.2(2) 1/1.5(4) 2/1.1(4) 1/0.1(5) 1/0.2(4) 0/0.9(4) 1/1.7(2) 1/1.5(5) 1/1.3(4) 1/0.8(3) 1/3.5(4)
STAD 0/1.4(5) 1/1.6(5) 0/1.1(5) 1/1.7(3) 1/1.3(4) 0/1.4(3) 0/3.3(5) 0/3.0(4) 0/0.5(4) 0/2.3(4) 1/3.3(4) 2/1.8(5) 1/1.0(4) 0/2.1(5) 2/1.4(4)
SARC 2/1.3(2) 2/1.3(2) 2/1.1(2) 2/2.2(13) 2/1.0(2) 2/0.8(3) 2/1.0(15) 1/0.9(2) 2/1.8(3) 2/2.5(5) 2/2.4(4) 2/2.3(3) 2/2.0(3) 2/1.9(3) 2/3.0(13)
THCA 0/0.5(2) 0/0.6(3) 4/0.2(4) 0/0.4(3) 0/0.2(4) 0/0.7(3) 1/0.3(4) 0/0.4(3) 0/0.3(2) 0/0.6(3) 3/1.2(4) 3/1.1(2) 3/1.2(3) 0/1.4(2) 2/1.6(2)
UCS 0/0.1(2) 0/0.1(3) 1/0.1(2) 0/0.3(4) 0/0.3(5) 0/0.1(2) 0/0.2(2) 0/0.1(3) 0/1.2(3) 0/0.1(3) 1/0.1(2) 0/0.1(4) 0/0.1(4) 0/0.2(2) 0/0.1(3)
METABRIC 1/1.3(5) 2/1.8(7) 2/0.9(5) 1/0.9(7) 2/0.6(6) 1/1.8(7) 2/5.6(7) 1/5.7(7) 1/3.3(7) 1/1.8(7) 2/4.6(9) 2/2.9(7) 2/2.5(7) 1/3.5(9) 2/5.9(8)
Mean 1.4/11.1 1.4/11.2 1.6/11.1 1.4/11 1.6/1.8 1.2/10.9 1.6/11.5 1.1/1.9 1.3/11 1.3/11.3 1.5/11.3 1.6/11.5 1.6/11.4 1.5/11.5 1.9/12.2
Sig 22/15 24/18 25/14 24/15 24/15 21/14 26/18 20/14 24/15 20/17 26/17 26/20 25/19 22/20 30/22

Note. The results are presented as A/B(C), where A indicates significant clinical parameters identified, B denotes the -log10 P-value for survival, and C refers to the number of clusters. A significance threshold of 0.05 is applied, and significant results are shown in bold. Mean represents the average across all datasets, while Sig indicates the count of datasets that display significant outcomes.

Figure 2.

Figure 2

Average performance of various methods across the 34 cancer datasets. The X-axis indicates the enriched clinical parameters’ mean, and the Y-axis displays the -log10 logrank test’s average P-values, with the intersection of the red dashed lines highlighting the MInline graphicCGCN results.

To validate the subtyping results produced by MInline graphicCGCN and compare them with existing subtypes while showcasing the differential expression across distinct subtypes, the experiments are conducted as follows. Initially, the PAM50 classification [40] on the BIC dataset is utilized as a benchmark for comparison. Next, as the PAM50 involves 48 mRNA expression features related to 50 genes, these features are excluded from the original mRNA data of the BIC dataset to eliminate the direct influence of known oncogenes in the multi-omics data. Subsequently, the processed mRNA data, along with other omics data, are used as input for MInline graphicCGCN. Lastly, a heatmap is generated based on the expression of the 48 mRNA features, highlighting the relationship between oncogenes and the subtypes identified by MInline graphicCGCN, as well as the overlap between the subtypes identified by MInline graphicCGCN and PAM50. The heatmap results are shown in Fig. 3, in which patients are rearranged according to subtypes from MInline graphicCGCN. It is evident that various subtypes exhibit unique expression patterns, and certain subtypes identified by M2CGCN overlap with those from PAM50, such as LumA and our subtype 4, as well as Basal and our subtype 2.

Figure 3.

Figure 3

The BIC dataset’s heatmap. Columns denote patients and rows denote mRNAs associated with PAM50. The top column displays the patients’ PAM50 annotation alongside the subtyping results from MInline graphicCGCN.

The Kaplan Meier survival curves of different cancer types using MInline graphicCGCN are shown in Fig. 4, and the survival curves of BIC using different methods are shown in Fig. 5. In the experiments, the BIC data is grouped into four clusters by the proposed MInline graphicCGCN. Survival curves for MInline graphicCGCN are presented using three, four, and five clusters, respectively, to enable direct comparisons with other methods. From these figures we can observe that the MInline graphicCGCN obtains clearer survival curve separation than other state-of-the-art methods, which illustrates MInline graphicCGCN is a powerful cancer subtyping method on multi-omics data.

Figure 4.

Figure 4

Kaplan–Meier survival curves for 34 cancer types using the MInline graphicCGCN method. The x-axis represents the number of days since the study began, and the y-axis shows the estimated survival rate.

Figure 5.

Figure 5

Kaplan–Meier survival curves comparing MInline graphicCGCN with other methods on the BIC dataset. The x-axis denotes the number of days since the study started, while the y-axis indicates the estimated survival rate.

In this paper, experiments on comparing methods are implemented using publicly available code. The optimization is performed by using the Adam optimizer [41], and the MInline graphicCGCN model is implemented by PyTorch. More details about MInline graphicCGCN are at https://github.com/chenxi-cui/M2CGCN.

Model analysis

Parameter sensitivity analysis

There are two hyperparameters Inline graphic and Inline graphic in equation (20), and we investigated whether hyperparameters are needed to balance the losses in this equation. Figure 6a and b shows the values of the enriched clinical parameters and survival analysis on BIC dataset for different hyperparameters, respectively, demonstrating that our method is not sensitive to Inline graphic and Inline graphic. The reason for this is the well-designed multi-level feature learning framework to reduce the influence between different layers, the trade-off parameters Inline graphic and Inline graphic are all set equal to 1.0 for simplicity. The other datasets for different hyperparameters are shown in Supplementary Figs S1S34. In addition, the selection of the two temperature parameters in multi-omics contrast learning needs to be investigated, i.e. Inline graphic in the high-level feature contrast loss equation (10) and Inline graphic in the subtype label contrast loss equation (12). Figure 6c and d shows the enriched clinical parameter values and survival analysis values for different Inline graphic and Inline graphic, respectively, indicating that our method is not sensitive to the choice of Inline graphic and Inline graphic, which were empirically set to Inline graphic and Inline graphic. We performed sampling at different ratios on the cancer datasets and evaluated the stability of the clustering results [42]. The relevant results are shown in Supplementary Table S4.

Figure 6.

Figure 6

Sensitivity analysis. (a)For MInline graphicCGCN, enriched clinical parameters are evaluated in relation to Inline graphic and Inline graphic through sensitivity analysis. (b)For MInline graphicCGCN, survival analysis is examined relative to Inline graphic and Inline graphic via sensitivity analysis. (c)For MInline graphicCGCN, enriched clinical parameters are assessed concerning temperature parameters Inline graphic and Inline graphic through sensitivity analysis. (d)For MInline graphicCGCN, survival analysis is conducted regarding temperature parameters Inline graphic and Inline graphic via sensitivity analysis.

Ablation studies

We performed ablation experiments on the losses in equation (20) to study the contribution of each component individually. Table 2 shows the different loss components and the corresponding experimental results on BIC dataset. In (1) only Inline graphic is optimized to fulfill the primary objective of multi-omics clustering, which is to capture the cluster consistency. In (2), Inline graphic is refined to enable low-level features to reconstruct the attribute nodes and associated structural graphs of the multi-omics data. In (3), Inline graphic is refined to extract high-level features and generate clustering labels. (4) is the full loss of MInline graphicCGCN. The results of (2) and (4) are somewhat better than those of (1) and (3), proving the importance of the reconstruction goal. And the results of (3) and (4) are much better than those of (1) and (2), proving that high-level features are the key contributors to enhancing clustering performance.

Table 2.

Ablation studies on loss components

  Components BIC
Inline graphic Inline graphic Inline graphic Enriched clinical parameters -log10 logrank test’s P-values
(1) 1 1.914
(2) 1 1.959
(3) 1 4.594
(4) 2 6.659

Conclusion

The integration of multi-omics data enables researchers and clinicians to capture a more holistic view of cancer biology, as different omics layers provide complementary information about the genetic, epigenetic, and functional changes that occur in cancer cells. By combining these multi-omics data, it becomes possible to identify unique patterns and molecular signatures associated with distinct cancer subtypes. In this paper, we propose MInline graphicCGCN, a graph convolutional network with contrast learning for predicting cancer subtypes. MInline graphicCGCN learns low-level features, and high-level features, and on the basis of these multi-level features, obtains cancer subtyping in a fusion-free manner. MInline graphicCGCN aims to reduce the effect of individual information and better learn the consensus information among different omics data. The experimental findings on public multi-omics datasets from 33 TCGA and METABRIC reveal that MInline graphicCGCN can achieve advanced performance compared to other relevant methods. Although our study was conducted at two or three omics levels, MInline graphicCGCN offers a flexible framework that can be readily adapted to handle scenarios involving additional omics data. We believe MInline graphicCGCN is expected to advance precision oncology and improve patient prognosis.

Key Points

  • A new unsupervised graph convolutional networks model is proposed to simultaneously learn the cross-omics high level feature representation and cancer subtype labels.

  • The contrastive learning is carried out to maintain the consistency of the clustering results from high level representation and the subtype labels from prediction.

  • The results of experiments conducted on the TCGA and METABRIC datasets highlight the proposed method’s superior performance in cancer subtype identification.

Supplementary Material

Supplementary_Materials_bbaf043

Acknowledgments

The authors sincerely appreciate the anonymous reviewers for their insightful feedback and helpful suggestions.

Contributor Information

Bo Yang, School of Computer Science & The Shaanxi Key Laboratory of Clothing Intelligence, Xi'an Polytechnic University, Xi'an 710048, China.

Chenxi Cui, School of Computer Science & The Shaanxi Key Laboratory of Clothing Intelligence, Xi'an Polytechnic University, Xi'an 710048, China.

Meng Wang, School of Computer Science & The Shaanxi Key Laboratory of Clothing Intelligence, Xi'an Polytechnic University, Xi'an 710048, China.

Hong Ji, School of Computer Science & The Shaanxi Key Laboratory of Clothing Intelligence, Xi'an Polytechnic University, Xi'an 710048, China.

Feiyue Gao, School of Computer Science & The Shaanxi Key Laboratory of Clothing Intelligence, Xi'an Polytechnic University, Xi'an 710048, China.

Funding

This work was supported by the Natural Science Basic Research Program of Shaanxi (2024JCYBMS-473, 2023-JC-YB-558), Humanities and Social Science Foundation of Ministry of Education of China (24YJA880034), National Natural Science Foundation of China (61972312), and Scientific Research Program Funded by Shaanxi Provincial Education Department (22JS019, 23JS028).

References

  • 1. Dagogo-Jack  I, Shaw  AT. Tumour heterogeneity and resistance to cancer therapies. Nat Rev Clin Oncol  2017;15:81–94. 10.1038/nrclinonc.2017.166 [DOI] [PubMed] [Google Scholar]
  • 2. Bianchini  G, Balko  JM, Mayer  IA. et al.  Triple-negative breast cancer: challenges and opportunities of a heterogeneous disease. Nat Rev Clin Oncol  2016;13:674–90. 10.1038/nrclinonc.2016.66 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Chen  Z, Fillmore  CM, Hammerman  PS. et al.  Non-small-cell lung cancers: a heterogeneous set of diseases. Nat Rev Cancer  2014;14:535–46. 10.1038/nrc3775 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Zardavas  D, Irrthum  A, Swanton  C. et al.  Clinical management of breast cancer heterogeneity. Nat Rev Clin Oncol  2015;12:381–94. 10.1038/nrclinonc.2015.73 [DOI] [PubMed] [Google Scholar]
  • 5. Meacham  CE, Morrison  SJ. Tumour heterogeneity and cancer cell plasticity. Nature  2013;501:328–37. 10.1038/nature12624 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Kim  J, DeBerardinis  RJ. Mechanisms and implications of metabolic heterogeneity in cancer. Cell Metab  2019;30:434–46. 10.1016/j.cmet.2019.08.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Nakagawa  H, Wardell  CP, Furuta  M. et al.  Cancer whole-genome sequencing: present and future. Oncogene  2015;34:5943–50. 10.1038/onc.2015.90 [DOI] [PubMed] [Google Scholar]
  • 8. Tyanova  S, Albrechtsen  R, Kronqvist  P. et al.  Proteomic maps of breast cancer subtypes. Nat Commun  2016;7:10259. 10.1038/ncomms10259 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Eberlein  TJ. Race, breast cancer subtypes, and survival in the carolina breast cancer study. Yearbook of Surgery  2007;2007:304–5. 10.1016/S0090-3671(08)70227-1 [DOI] [Google Scholar]
  • 10. Ho  A, Edwards  JS. Lessons from cancer genome sequencing. Systems Biology of Cancer  2015;3:7–19. [Google Scholar]
  • 11. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature  2008;455:1061–8. 10.1038/nature07385 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Hartigan  JA, Wong  MA. Algorithm as 136: a k-means clustering algorithm. Appl Stat  1979;28:100. 10.2307/2346830 [DOI] [Google Scholar]
  • 13. von Luxburg  U. A tutorial on spectral clustering. Stat Comput  2007;17:395–416. 10.1007/s11222-007-9033-z [DOI] [Google Scholar]
  • 14. Monti  S, Tamayo  P, Mesirov  J. et al.  Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn  2003;52:91–118. 10.1023/A:1023949509487 [DOI] [Google Scholar]
  • 15. Nguyen  T, Tagett  R, Diaz  D. et al.  A novel approach for data integration and disease subtyping. Genome Res  2017;27:2025–39. 10.1101/gr.215129.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Witten  DM, Tibshirani  RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol  2009;8:1–27. 10.2202/1544-6115.1470 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Shen  R, Olshen  AB, Ladanyi  M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics  2010;26:292–3. 10.1093/bioinformatics/btp659 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Argelaguet  R, Velten  B, Arnol  D. et al.  Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol  2018;14:e8124. 10.15252/msb.20178124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Chalise  P, Fridley  BL. Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PloS One  2017;12:e0176278. 10.1371/journal.pone.0176278 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Rintala  TJ, Fortino  V. Cops: a novel platform for multi-omic disease subtype discovery via robust multi-objective evaluation of clustering algorithms. PLoS Comput Biol  2024;20:e1012275. 10.1371/journal.pcbi.1012275 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Ramazzotti  D, Lal  A, Wang  B. et al.  Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival. Nat Commun  2018;9:4453. 10.1038/s41467-018-06921-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Rong  Z, Song  J, Cao  L. et al.  MCluster-VAEs: an end-to-end variational deep learning-based clustering method for subtype discovery using multi-omics data. Comput Biol Med  2022;150:106085. 10.1016/j.compbiomed.2022.106085 [DOI] [PubMed] [Google Scholar]
  • 23. Chen  Y, Wen  Y, Xie  C. et al.  Multi-omics data clustering and cancer subtyping via shared and specific representation learning. iScience  2023;26:107378. 10.1016/j.isci.2023.107378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Wang  B, Mezlini  AM, Demir  F. et al.  Similarity network fusion for aggregating data types on a genomic scale. Nat Methods  2014;11:333–7. 10.1038/nmeth.2810 [DOI] [PubMed] [Google Scholar]
  • 25. Xu  T, Le  TD, Liu  L. et al.  CancerSubtypes: an R/bioconductor package for molecular cancer subtype identification, validation and visualization. Bioinformatics  2017;33:3131–3. 10.1093/bioinformatics/btx378 [DOI] [PubMed] [Google Scholar]
  • 26. Rappoport  N, Shamir  R. NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics  2019;35:3348–56. 10.1093/bioinformatics/btz058 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kang  M, Ko  E, Mersha  TB. A roadmap for multi-omics data integration using deep learning. Brief Bioinform  2021;23:bbab454. 10.1093/bib/bbab454 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Wang  J, Ma  A, Chang  Y. et al.  scGNN: a novel graph neural network framework for single-cell RNA-seq analyses. Nature Communications  2020;12:1882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Zeng  Y, Zhou  X, Rao  J. et al.  Accurately clustering single-cell RNA-seq data by capturing structural relations between cells through graph convolutional network. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). New York, USA: IEEE, 2020.
  • 30. Chen  T, Kornblith  S, Norouzi  M. et al.  A simple framework for contrastive learning of visual representations. Proceedings of Machine Learning Research. USA: ICML, 2020. 1597–607.
  • 31. Wang  TZ, Isola  P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. Proceedings of Machine Learning Research. USA: ICML, 2020. 9929–39.
  • 32. Hassani  K, Hosein  A, Khasahmadi.  Contrastive multi-view representation learning on graphs. Proceedings of Machine Learning Research. USA: ICML, 2020. 4116–26.
  • 33. Tian  YL, Krishnan  D, Isola  P. Contrastive multiview coding. Conference Proceedings is Springer. Germany: ECCV, pages 776–94, 2020, 10.1007/978-3-030-58621-8_45. [DOI]
  • 34. Pereira  B, Chin  S-F, Rueda  OM. et al.  Erratum: the somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nat Commun  2016;7:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Rappoport  N, Shamir  R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res  2018;46:10546–62. 10.1093/nar/gky889 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Van Gansbeke  W, Vandenhende  S, Georgoulis  S. et al.  Scan: learning to classify images without labels. Computer Vision. Germany: ECCV, 2020;268–85. 10.1007/978-3-030-58607-2_16 [DOI] [Google Scholar]
  • 37. Jonker  R, Volgenant  T. Improving the Hungarian assignment algorithm. Oper Res Lett  1986;5: 171–5. 10.1016/0167-6377(86)90073-8 [DOI] [Google Scholar]
  • 38. Wu  D, Wang  D, Zhang  MQ. et al.  Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genomics  2015;16:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Muche  R. Applied survival analysis: regression modeling of time to event data. Int J Epidemiol  2001;30:408–9. 10.1093/ije/30.2.408 [DOI] [Google Scholar]
  • 40. Parker  JS, Mullins  M, Cheang  MCU. et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol  2009;27:1160–7. 10.1200/JCO.2008.18.1370 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Adam: A Method for Stochastic Optimization by Kingma and Ba was published on arXiv.org in 2014 under the arXiv identifier 1412.6980.
  • 42. Xie  M, Kuang  Y, Song  M. et al.  Subtype-MGTP: a cancer subtype identification framework based on multi-omics translation. Bioinformatics  2024;40. 10.1093/bioinformatics/btae360 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Materials_bbaf043

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES