Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Nov 25;26(1):bbae609. doi: 10.1093/bib/bbae609

PartIES: a disease subtyping framework with Partition-level Integration using diffusion-Enhanced Similarities from multi-omics Data

Yuqi Miao 1, Huang Xu 2, Shuang Wang 3,
PMCID: PMC11586768  PMID: 39584699

Abstract

Integrating multi-omics data helps identify disease subtypes. Many similarity-based methods were developed for disease subtyping using multi-omics data, with many of them focusing on extracting common clustering structures across multiple types of omics data, but not preserving data-type-specific clustering structures. Moreover, clustering performance of similarity-based methods is affected when similarity measures are noisy. Here we proposed PartIES, a Partition-level Integration using diffusion-Enhanced Similarities to perform disease subtyping using multi-omics data. PartIES uses diffusion to reduce noises in individual similarity/kernel matrices from individual omics data types first, and then extract partition information from diffusion-enhanced similarity matrices and integrate the partition-level similarity through a weighted average iteratively. Simulation studies showed that (1) the diffusion step enhances clustering accuracy, and (2) PartIES outperforms competing methods, particularly when omics data types provide different clustering structures. Using mRNA, long noncoding RNAs, microRNAs expression data, DNA methylation data, and somatic mutation data from The Cancer Genome Atlas project, PartIES identified subtypes in bladder urothelial carcinoma, liver hepatocellular carcinoma, and thyroid carcinoma that are most significantly associated with patient survival across all methods. Further investigations suggested that among subtype-associated genes, many of those that are highly interacting with other genes are known important cancer genes. The identified cancer subtypes also have different activity levels for some known cancer-related pathways. The R code can be accessed at https://github.com/yuqimiao/PartIES.git

Keywords: multi-omics integration, disease subtyping, similarity-based methods, diffusion, partition-level similarity learning

Introduction

Disease subtyping using molecular profiles led to numerous discoveries [1–3]. Integrating multiple types of molecular profiles such as genomic, epigenomic, and transcriptomic profiles provides a better understanding of biological mechanisms and a more accurate subtyping [4]. Clustering methods like K-means, partition around medoids, and hierarchical clustering have been widely used. With high-dimensional omics data, dimension reduction methods, such as principal component analysis (PCA), non-negative matrix factorization (NMF) [5] and auto-encoder [6] have been applied to learn low-dimensional feature representations before applying aforementioned clustering methods. With multi-omics data, many early methods for disease subtyping are model-based such as Gaussian latent variable models, e.g. iCluster [7] and its extensions [8, 9], which assume a common cluster structure across data types. More recent methods tend to learn feature representations on multi-omics data first and integrate them followed by K-means. For example, iNMF [10] applies NMF and pattern fusion analysis [11] uses PCA to learn feature representations of individual omics data types first, which are then averaged with weights followed by K-means. Deep learning methods such as auto-encoder were also used for feature representations on individual omics data types, which are then concatenated for K-means clustering [12].

Another category of methods is similarity-based methods [13–18]. For these methods, pairwise similarities between subjects are calculated using features of each omics data type, and multiple similarity matrices/graphs are integrated into an overall similarity matrix, on which spectral clustering is conducted, where eigenvectors of the graph Laplacian induced by a similarity graph are extracted, and K-means is applied on these graph representations. Many similarity-based subtyping methods with multi-omics data construct individual similarity matrices from individual omics data types through kernels and focus on ways to integrate similarity matrices/kernels with the assumption that individual omics data types provide similar clustering structures. For example, similarity network fusion (SNF) [13] iteratively averages elements in one similarity matrix with elements in other similarity matrices as weights until convergence and then can be averaged. Another popular method CIMLR (Cancer Integration via Multi-kernel LeaRning) [16] learns an overall similarity matrix from a weighted average of individual similarity matrices/kernels with a low-rank constraint, where larger weights are learned for kernels with structures closer to the overall similarity, aiming to extract a consensus clustering pattern across data types.

However, disease subtyping studies have suggested that different omics data types might provide distinct cluster structures [7, 19, 20]. A breast cancer subtyping study identified different subtypes with distinct survival patterns using mRNA expression and DNA methylation data separately [19]. An ovarian cancer subtyping study suggested that subtypes identified using gene expression data only partially agree with those found using DNA methylation and microRNA expression data, respectively [20]. These data-type-specific cluster structures might be overlooked by some current similarity-based integration methods or iCluster and its extensions that assume similar clustering structures across data types. Recently, methods were also developed that first extract clustering information from individual omics data types and then integrate them [21–23]. For example, PINS (Perturbation clustering for data INtegration and disease Subtyping) [21] obtains individual pairwise connectivity matrices using clusters identified from individual omics data types and averages them to have an overall connectivity matrix followed by a similarity-based clustering method. More recently, partition-level Multi-View Clustering [22] was developed for imaging data that first extracts partition information from individual similarity matrices and integrates similarities on the partition level. When different omics data types provide distinct clustering structures, we want methods that can preserve data-type-specific cluster structures before integration. Moreover, as clustering results of similarity-based methods are greatly affected when similarity measures are noisy, different network diffusion methods [24–26] were developed to reduce noise or variances of similarity measures using neighbor information.

Here we developed PartIES, a Partition-level Integration framework that uses diffusion-Enhanced Similarities. PartIES first conducts diffusion on individual similarity matrices from individual omics data types to reduce noises and then partitions diffusion-enhanced similarity matrices to capture distinct data-type-specific cluster structures and integrates low-rank partition-information-induced similarity matrices through a weighted average iteratively. We conducted extensive simulation studies to evaluate clustering performance of PartIES and competing methods SNF and CIMLR, with/without diffusion. We applied PartIES and competing methods to identify subtypes of three cancers: bladder urothelial carcinoma (BLCA), liver hepatocellular carcinoma (LIHC), and thyroid carcinoma (THCA) using mRNAs, long noncoding RNAs (lncRNAs), and microRNAs (miRNAs) expression data, DNA methylation data, and somatic mutation data from the The Cancer Genome Atlas (TCGA) project. Subsequent survival analyses suggest that BLCA, LIHC, and THCA subtypes identified by PartIES are most significantly associated with patient survival across all methods. We further investigated biological meanings of the identified subtypes and noticed that among subtype-associated genes, many of those that are highly interacting with other genes are known important cancer genes. The identified cancer subtypes also have different activity levels for some known cancer-related pathways.

Methods

Figure 1 displays the schematic flowchart of PartIES: (1) construct individual similarity/kernel matrices from individual feature matrices; (2) diffusion on individual similarity matrices; and (3) learn a partition-level integrative similarity iteratively.

Figure 1.

Figure 1

The schematic plot of PartIES: (1) construct individual kernel/similarity matrices from original feature matrices, (2) enhance individual similarity/kernel matrices through diffusion, and (3) extract partition information from individual diffusion-enhanced kernels and integrate individual partition-level similarity matrices with a weighted average iteratively by minimizing the loss function Inline graphic.

Diffusion-enhanced similarity matrices

Given a feature matrix Inline graphic, Inline graphic, where Inline graphic is the number of subjects and Inline graphic is the number of features of data type Inline graphic, we calculate the pairwise similarity measure between subjects Inline graphic and Inline graphic using the following kernel function [15]:

graphic file with name DmEquation1.gif (1)

where Inline graphic is subject Inline graphic’s feature vector of data type Inline graphic, Inline graphic is the L2-norm of a vector Inline graphic, Inline graphic is the Inline graphic nearest neighbors of subject Inline graphic using data type Inline graphic, Inline graphic is the average Euclidean distance between subject Inline graphic and his(her) Inline graphic nearest neighbors using data type Inline graphic, and Inline graphic is the sum of Inline graphic and Inline graphic, which reflects similarity ranges based on neighbors of subjects Inline graphic and Inline graphic using data type Inline graphic. We further express kernel distances between subjects Inline graphic and Inline graphic as Inline graphic.

For data type Inline graphic, to denoise Inline graphic using diffusion, we define a local similarity matrix Inline graphic using Inline graphic nearest neighbors: Inline graphic, and obtain the diffusion-enhanced similarity matrix through a one-hop diffusion: Inline graphic. To have symmetric enhanced similarity matrices, we further define Inline graphic, and the corresponding Inline graphic.

The proposed PartIES

With enhanced kernel distance matrices Inline graphic, we propose the following loss function to capture overall cluster structure integrating multiple omics data types while preserving clustering structures from individual data types:

graphic file with name DmEquation2.gif (2)

where Inline graphic and Inline graphic are the similarity matrix and corresponding partition information from data type Inline graphic with Inline graphic the number of clusters using data type Inline graphic, Inline graphic is the integrated partition information with Inline graphic the number of clusters using all Inline graphic data types. The choices of Inline graphic and Inline graphic are guided by the eigengap criterion with details included in the Supplementary Materials. Here Inline graphic and Inline graphic are non-negative hyper-parameters, which are set such that only Inline graphic nearest neighbors of sample Inline graphic have nonzero similarities in Inline graphic during optimization. That is, Inline graphic is a local similarity matrix. Details on tuning Inline graphic and Inline graphic are included in the Supplementary Materials. Weights Inline graphic are calculated as Inline graphic with the constraint of summing to 1 and the rationale that the more similar a data-type-specific cluster structure is to the common cluster structure, the more contribution this data type should have in integration.

The first term in the objective function is the Frobenius inner product between the enhanced kernel distance Inline graphic and the learned local similarity matrix Inline graphic for data type Inline graphic, which aims to learn small similarities when distances are large. The second term is a regularization term to avoid learned Inline graphic being identity matrices. If there are Inline graphic clusters using data type Inline graphic, molecular profiles of samples in the same cluster should have high similarity, and the effective rank of Inline graphic should be ideally Inline graphic. The third term along with the constraint on Inline graphic enforces the low-rank structure of Inline graphic to preserve the data-type-specific cluster structures [27]. Minimizing the third term is equivalent to performing spectral clustering [28] on Inline graphic, which extracts Inline graphic, the partition information of data type Inline graphic, i.e. the first Inline graphic eigenvectors from the graph Laplacian Inline graphic. We then have Inline graphic as the partition-level similarity matrix for data type Inline graphic. The last term learns the integrated partition information Inline graphic using all Inline graphic data types, assuming Inline graphic clusters, where minimizing this term is equivalent to performing spectral clustering on the weighted average Inline graphic.

Optimization

We optimize Inline graphic and Inline graphic for data type Inline graphic, and Inline graphic. Although the objective function is non-convex, optimizing each parameter while holding others constant is convex. We use the following optimization steps:

Steps 0: initialize Inline graphic, Inline graphic, and Inline graphic. With enhanced kernel distance matrix Inline graphic, we initialize Inline graphic for Inline graphic. We initialize Inline graphic as the first Inline graphic eigenvectors of Inline graphic corresponding to the Inline graphic smallest eigenvalues for Inline graphic, which is equivalent to extracting partition information from individual diffusion-enhanced kernels using spectral clustering. We initialize Inline graphic as the first Inline graphic eigenvectors of Inline graphic with weights being initialized as Inline graphic for Inline graphic.

We then iteratively update each parameter as follows:

Step 1: update Inline graphic while fixing Inline graphic, for Inline graphic separately:

graphic file with name DmEquation3.gif

This step localizes Inline graphic with each row having around Inline graphic non-zero elements with details in the Supplement Materials.

Step 2: update Inline graphic while fixing Inline graphic and Inline graphic, for Inline graphic separately:

graphic file with name DmEquation4.gif

where Inline graphic is calculated using values from the last iteration with normalization Inline graphic, and Inline graphic is the first Inline graphic eigenvectors of Inline graphic corresponding to the Inline graphic smallest eigenvalues.

Step 3: update Inline graphic while fixing Inline graphic:

graphic file with name DmEquation5.gif

where Inline graphic and Inline graphic is the first Inline graphic eigenvectors of Inline graphic corresponding to the Inline graphic smallest eigenvalues.

Steps are repeated iteratively until converge. The final cluster labels are obtained by performing k-means on Inline graphic.

Simulation studies

We conducted simulation studies to investigate (1) how diffusion on the similarity matrix affects clustering performance with one data type; and (2) the overall clustering performance of PartIES and competing methods SNF and CIMLR, with/without diffusion on individual similarity matrices before integration.

We set each data type to have 10 000 features composed of signal and noise features. We simulated signal features in data type Inline graphic for subjects in cluster Inline graphic from a normal distribution Inline graphic, and noise features from Inline graphic for all subjects. We assumed all signal features from one data type have the same effect size for simplicity. We set the number of neighbors Inline graphic and explored different choices of Inline graphic and included results in the Supplementary Materials. For SNF and CIMLR, we set the number of clusters Inline graphic as the truth. For PartIES, the choice of Inline graphic was guided by the eigengap criterion as detailed in the Supplementary Materials, while Inline graphic was also set as the true number of clusters. We conducted 1000 simulations for each simulation setting and evaluated clustering performance in simulation studies using normalized mutual information (NMI).

Simulation settings

To investigate if diffusion helps clustering performance, we considered one data type with Inline graphic and three equal-sized clusters with 50 subjects each. We considered two simulation settings varying effect sizes: (1) varying the number of signal features Inline graphic out of 10,000 features, where all signal features have the same effect size and can separate all three clusters, i.e. signal features were generated from normal Inline graphic, Inline graphic, and Inline graphic for samples in the three clusters, respectively; (2) fixing the number of signal features as 180 and varying standard deviation (SD) Inline graphic, where signal features were generated from Inline graphic, Inline graphic and Inline graphic for samples in the three clusters, respectively. We conducted spectral clustering on similarity matrices with/without diffusion.

To examine overall clustering performance of PartIES, SNF, and CIMLR, we considered three data types all being generated from normal distributions each with 10,000 features. We set Inline graphic with four equal-sized clusters each with 50 subjects. We set each data type to have the same number of signal features ranging from 10 to 200 with a grid of 10. We considered the following three simulation settings.

In simulation setting I, three data types provide the same clustering structures with signal features in one data type having the same effect size while signal features in different data types have similar effect sizes. Specifically, signal features in data type 1 have means for four clusters Inline graphic; signal features in data type 2 have means for four clusters Inline graphic; and signal features in data type 3 have means for four clusters Inline graphic.

In simulation setting II, three data types provide similar clustering structures but signal features in different data types have different effect sizes. Specifically, signal features in data type 1 have means for four clusters Inline graphic, i.e. data type 1 separates all four clusters while it separates clusters 1 and 4 the best. Signal features in data type 2 have means for four clusters Inline graphic, i.e. data type 2 separates clusters 1 and 2 the best. Signal features in data type 3 have means in four clustering Inline graphic, i.e. data type 3 separates clusters 2 and 3 the best. PartIES would perform well in this setting.

In simulation setting III, three data types provide different cluster structures. Specifically, signal features in data type 1 mainly separate clusters 1 and 2 from clusters 3 and 4 with means in four clusters Inline graphic. Signal features in data type 2 mainly separate clusters 1 and 2 from clusters 3 and 4, and can further separate cluster 1 and cluster 2 with means in four clusters being Inline graphic. Signal features in data type 3 separate clusters 1 and 2 from clusters 3 and 4, and can further separate cluster 3 and cluster 4 with means in four clusters being Inline graphic. This is the setting PartIES is designed for.

We also conducted simulation studies with the three data types mimicking real data more realistically. Specifically, features in data type 1 were generated from Bernoulli distributions to mimic mutation presence/absence data and features in the other two data types were generated from normal distributions to mimic gene expression data. We considered 200 subjects with four equal-sized clusters each with 50 subjects and 60 subjects from four equal-sized clusters each with 15 subjects and repeated the three simulation settings above. Details of these additional simulation studies were included in the Supplementary Materials.

Simulation results

Clustering performance with diffusion Figure 2 displays NMI means with/without diffusion for the two simulation settings varying (1) the number of signal features, and (2) Inline graphic of signal features. We can see that when effect sizes are small, i.e. when the number of signal features is smaller than 70 or Inline graphic of signal features is greater than 4, clustering performance with/without diffusion are very similar. As effect sizes increase, better clustering performance is observed with diffusion as expected. This is because when effect sizes increase, Inline graphic nearest neighbors can be more accurately defined. Thus, the diffusion step that relies on neighbors can better help denoise.

Figure 2.

Figure 2

Effect of the proposed diffusion step with mean NMIs across 1000 simulations.

Clustering performance of PartIES and competing methods Figure 3 displays NMI means of PartIES and competing methods for the three simulation settings considered. First of all, we can see that the diffusion step improves the clustering accuracy of all methods in the three simulation settings considered, especially in setting II when three data types provide similar clustering structures but signal features in the three data types have different effect sizes, and in setting III when three data types provide different clustering structures. Moreover, the diffusion step helps CIMLR more because the original CIMLR does not use local similarities from individual data types, while the original SNF uses local similarities when fusing multiple similarity matrices from multiple data types. Similarly, PartIES also imposes local similarities in the step of learning Inline graphic.

Figure 3.

Figure 3

Clustering performance of PartIES and competing methods with mean NMI across 1000 simulations for the three simulation settings considered.

For the overall clustering performance, under simulation setting I when three data types provide the same cluster structures and signal features in different data types have similar effect sizes, PartIES and CIMLR have similar performance and are much better than that of SNF. This is because the granular differences in clustering structures due to small differences in effect sizes in different data types are not captured by the SNF fusing step that uses individual similarity matrices directly. Under setting II when three data types provide similar cluster structures but signal features in different data types have different effect sizes and setting III when three data types provide different cluster structures, PartIES that integrates data-type-specific partition information performs much better than the two competing methods as expected.

For the additional simulation studies with three data types being generated from different distributions and with a much smaller sample size, we observed very similar results as being reported above. Details were included in the Supplementary Materials and Figs S3–4.

Real data applications

Data processing

We applied PartIES and competing methods with/without diffusion to identify BLCA, LIHC, and THCA subtypes using mRNA, lncRNA, miRNA expression, DNA methylation and somatic mutation data from TCGA. We downloaded omics data using the R package ‘TCGAbiolinks’ and used tumors with all five omics data types, leading to 401 BLCA tumors, 362 LIHC tumors, and 484 THCA tumors (Table 1). The original 450K DNA methylation data have measures on 485,577 CpGs. We excluded CpGs (1) on sex chromosomes, (2) overlapping with known single nucleotide polymorphisms, and (3) with > 30% missing rates. We ended up with 405 336 CpGs in BLCA data, 399 476 CpGs in LIHC data, and 415 378 CpGs in THCA data. We further corrected type I/II probe bias using the R package ‘wateRmelon’ and imputed missing values using k-nearest neighbors. There is no missingness in three types of gene expression data and somatic mutation data. Note that somatic mutation data summarizes numbers of non-synonymous mutations per gene per tumor sample. We summarized numbers of features of each omics type in Table 1. After data processing steps, we normalized each feature with a mean 0 and an SD 1.

Table 1.

Summary of TCGA omics data for BLCA, LIHC, and THCA tumors

Cancer Types BLCA LIHC THCA
Number of tumor samples 401 362 484
Number of deaths 175 132 14
Median survival days 1008 1694 NA
Number of omics features after QC mRNA expression 19 962 19 962 19 962
lncRNA expression 16 901 16,901 16 901
miRNA expression 1881 1,881 1881
DNA methylation 405 336 399 476 415 378
Somatic mutation 16 299 12 761 3193

Overall performances of PartIES

Table 2 displays cancer subtypes identified by PartIES and competing methods and corresponding log-rank Inline graphic-values that associate subtypes and patient survival. We can see that subtypes identified by PartIES are most significantly associated with patient survival among all methods for all three cancers BLCA, LIHC and THCA. The six BLCA subtypes by PartIES are associated with patient survival with a Inline graphic-value Inline graphic. The four LIHC subtypes by PartIES are associated with patient survival with a Inline graphic-value Inline graphic. The four THCA subtypes by PartIES are associated with patient survival with a Inline graphic-value Inline graphic. In general, PartIES and CIMLR that use diffused kernels have better subtyping performance (in terms of associating with patient survival) than that without diffusion, while the performances of SNF with/without diffusion are very similar, consistent with observations in simulation results.

Table 2.

TCGA BLCA, LIHC, and THCA cancer subtyping with (i) numbers of subtypes, and (ii) log-rank survival Inline graphic-values

Methods PartIES CIMLR SNF
Diffusion Yes No Yes No Yes No
BLCA number of clusters 6 6 4 4 4 4
Survival Inline graphic-values Inline graphic Inline graphic Inline graphic 0.06 Inline graphic Inline graphic
LIHC number of clusters 4 4 3 3 2 2
Survival Inline graphic-values Inline graphic Inline graphic 0.19 0.24 Inline graphic Inline graphic
THCA number of clusters 4 4 3 3 3 3
Survival Inline graphic-values Inline graphic 0.11 0.67 0.58 0.93 0.56

Individual cancer studies

The Cancer Genome Atlas bladder urothelial carcinoma

Studies have identified BLCA subtypes using omics data of TCGA BLCA tumors. For example, using mRNA expression data of 412 TCGA BLCA tumors, five subtypes were identified that are associated with patient survival with a Inline graphic-value Inline graphic [29]. Another study identified two major BLCA subtypes using mRNA expression, DNA methylation, DNA copy number, and somatic mutation data of 388 TCGA BLCA tumors with the iclusterBayes method. These two major subtypes are associated with patient survival with a Inline graphic-value Inline graphic [30]. Here we used mRNA, lncRNA, and miRNA expression data, DNA methylation data, and somatic mutation data of 401 TCGA BLCA tumors and identified six subtypes using the proposed PartIES with a survival Inline graphic-value Inline graphic. Figure 4a presents Kaplan–Meier curves of the six BLCA subtypes by PartIES. We can see that subtype 5 with 106 tumors has the worst survival and a median survival time of 623 days. Subtype 1 with 37 tumors has the best survival where more than 50% of subjects were alive at the end of the follow-up.

Figure 4.

Figure 4

(a) Kaplan–Meier curves of the six TCGA BLCA subtypes by PartIES. (b)–(f) Heatmaps of top 500 features by KW test comparing feature measures across the six BLCA subtypes. (g) Mutation landscape of top 20 most frequently mutated genes across all BLCA tumors.

TCGA BLCA subtyping using a single omics data type vs. five omics data types

Figure 4(b)4(e) display heatmaps of top 500 features of mRNA, lncRNA, miRNA expression levels, and DNA methylation levels ranked by Inline graphic-values from the Kruskal–Wallis (KW) test comparing feature levels across the six BLCA subtypes by PartIES. Figure 4f displays the heatmap of mutation profiles of top 20 most frequently mutated genes across 401 BLCA tumors. Subtypes identified by single omics data types with diffused kernels were also indicated, where two subtypes were identified using mRNA expression data, four using lncRNA expression data, two using miRNA expression data, two using DNA methylation data, and three using mutation data. This information was further displayed in Fig. 5.

Figure 5.

Figure 5

TCGA BLCA subtypes identified by PartIES and competing methods using five types of omics data vs. that using one type of omics data. Samples are ordered by PartIES subtypes.

We can see that overall, for the six BLCA subtypes by PartIES, mRNA and miRNA data provide very similar subtype structures and mainly separate PartIES subtypes 3 and 5. LncRNA data provide a distinct subtype structure that separates PartIES subtypes 1, 5 from 3, 4, and 6. Mutation data further separate PartIES subtypes 1, 2, and 5. DNA methylation data, on the other hand, do not provide clear subtype information. Only with all types of omics data, PartIES identifies current six BLCA subtypes.

When comparing the six PartIES subtypes to the four lncRNA subtypes, majority (85.5%) of tumors in PartIES subtype 4 are in lncRNA subtype 1, 84.9% of tumors in PartIES subtype 5 are in lncRNA subtype 2, 70.2% of tumors in PartIES subtype 3 are in lncRNA subtype 3, while all tumors in PartIES subtype 6 are in lncRNA subtype 4 (Fig. 4c). Using somatic mutation data, we further separate PartIES subtypes 1 and 2 from others. Although mRNA and miRNA expression data and DNA methylation data provide overlapping subtyping information as that of lncRNA expression data, only when we use all five types of omics data, we can identify current six BLCA subtypes. More details on single omics data subtyping results and important omics profiles that differentiate the six BLCA subtypes are included in the Supplementary Materials.

Investigate subtype-differentiated important genes using the PPI network

To investigate biological meanings of the six BLCA subtypes by PartIES, we examined omics features that differentiate them at a Bonferroni-adjusted KW test Inline graphic-value threshold 0.01. From a collection of differentially expressed genes (mRNA, lncRNA, and miRNA genes), differentially methylated genes (CpGs were mapped to genes), and differentially mutated genes, we identified important genes that are highly interacting with others using protein-protein interaction (PPI) network [17] from the STRING database [31] where we kept edges having interaction scores Inline graphic using Cytoscape (version 3.10.1) [32].

Details of mapping differential omics profiles to the PPI network are included in the Supplementary Materials. Briefly, we selected top 200 differential profiles from each omics data type and ended with a network with 455 genes and 1,054 edges after removing overlapping genes. Using the Largest Subnetwork app in Cytoscape, we partitioned this network into sub-networks such that nodes in the same sub-network are connected while those in different sub-networks are not, and worked with the largest sub-network with 328 genes.

We then used three metrics to measure interactions among the 328 genes: degree, stress, and betweenness centrality. Degree of a gene is the number of genes it connects. Stress of a gene is the number of shortest paths between two other genes that this gene passes. Betweenness centrality of a gene is the normalized ratio of the number of shortest paths between two other genes that this gene passes over all the shortest paths between the two genes. Table 3 displays top 10 genes ranked by each of the three metrics. Genes rank on top are potentially important BLCA genes. For instance, gene HNRNPH1 has the highest mRNA expression levels in tumors in PartIES subtypes 3 (Fig. 4b), who had relatively good survival. Studies have suggested that low expression levels of HNRNPH1 are associated with poor survival in BLCA [33]. Gene CDC5L (mapped from CpG cg05318503) has relatively high methylation levels in PartIES subtype 6 tumors and low levels in subtype 4 tumors (Fig. 4e), which has been found to be related to the apoptosis and migration of bladder cancer cells [34]. Gene YBX1 (mapped from CpG cg07603685) has relatively high methylation levels in PartIES subtype 6 tumors and low levels in subtype 4 tumors (Fig. 4e) and has been found to be associated with bladder cancer progression [35].

Table 3.

Top 10 genes ranked by degree, stress, and betweenness centrality

Top 10 Genes Degree Top 10 Genes Stress Top 10 Genes Betweenness Centrality
RBM39 45 EGFR 129 808 EGFR 0.290
RBM25 38 DHX9 64 208 DHX9 0.101
HNRNPH1 36 HNRNPH1 50 206 YBX1 0.077
SRSF2 36 YBX1 47 262 HNRNPH1 0.076
EGFR 35 RBM39 41 542 NF1 0.063
PRPF8 35 SRSF2 37 924 PRPF8 0.060
CDC5L 35 CDC5L 36 762 TNRC6A 0.057
DHX9 35 PRPF8 36 374 RBM39 0.056
LUC7L3 33 TNRC6A 35 366 SRSF2 0.056
SRSF11 33 SFPQ 30 360 RUNX1 0.055

Pathway activities of the six BLCA subtypes by PartIES

We further compared pathway activities across the 6 BLCA subtypes of 11 cancer-related pathways (EGFR, MAPK, PI3K, VEGF, JAK-STAT, TGFb, TNFa, NFkB, Hypoxia, p53-mediated DNA damage response, and Trail (apoptosis)) calculated using mRNA expression levels with the R package ‘PROGENy’ [36]. PROGENy (Pathway RespOnsive GENes) assigns pathway scores to tumors using the fitted coefficients matrix and tumors’ mRNA expression levels. We compared pathway activity scores across the six BLCA subtypes using the KW test where all 11 pathways are significant after Bonferroni correction. We examined top 6 most significant pathways (Fig. 6). We can see that tumors in PartIES subtype 5, which have the worst survival, have the highest PI3K activity scores but the lowest p53 activity scores. Hyper-activations of the PI3K pathway was linked to various malignant cancer progresses such as tumor cell proliferation, metastasis, and drug resistance [37]. Accumulations of p53 protein is important to suppress cancer development and abnormal changes in p53 pathway activities potentially due to TP53 gene mutation are significantly associated with poor patient survival [38]. Moreover, tumors in PartIES subtypes 5 and 6 with worse survival have relatively high EGFR activities, which is an indicator of more aggressive cancer behavior and poor survival outcomes in bladder cancer [39].

Figure 6.

Figure 6

Boxplot of top 6 cancer-related pathways’ activity scores ranked by KW test comparing scores across the six BLCA subtypes by PartIES. Also displayed are the Inline graphic-values of the three pairs of subtypes with the most significant difference from the Wilcoxon rank sum test comparing pathway activities between each pair of subtypes.

The Cancer Genome Atlas liver hepatocellular carcinoma

Much research has been done to identify liver cancer subtypes. TCGA research network identified three liver cancer subtypes using copy number variants, DNA methylation, mRNA expression, miRNA expression, and reverse-phase protein array data of 183 TCGA LIHC tumors with iCluster [40]. The three subtypes are not significantly associated with patient survival. We previously developed abSNF [41] to integrate mRNA expression, DNA methylation, and somatic mutation data of 161 TCGA LIHC tumors and identified five subtypes that are associated with patient survival with a Inline graphic-value 0.046. Here PartIES used mRNA, lncRNA, miRNA expression data, DNA methylation data, and somatic mutation data of 362 TCGA LIHC tumors and identified four subtypes that are associated with patient survival with a Inline graphic-value Inline graphic. We similarly investigated biological meanings of the four LIHC subtypes with details in the Supplementary Materials and Table S1, Figs S5–7.

The Cancer Genome Atlas thyroid carcinoma

Using 496 TCGA THCA tumors, TCGA research network identified two major subtypes based on driver mutations with no patient survival analysis being done. Here PartIES used mRNA, lncRNA, miRNA expression data, DNA methylation data, and somatic mutation data of 484 TCGA THCA tumors and identified four subtypes that are significantly associated with patient survival with a Inline graphic-value Inline graphic. We similarly investigated biological meanings of the four THCA subtypes with details in the Supplementary Materials and Table S2, Figs S8–10.

Sensitivity analysis

We conducted sensitivity analyses using subsets of the five omics data types. We examined PartIES and competing methods when removing data types with (1) similar clustering structures, and (2) distinct clustering structures. For example, from Fig. 5 for BLCA, we see that lncRNA and somatic mutation data provide distinct clustering structures, while mRNA expression, miRNA expression and DNA methylation data provide similar cluster information. Therefore, we repeated the analyses removing (i) mRNA data, (ii) mRNA and DNA methylation data, (iii) mutation data, and (iv) mutation and lncRNA data. Table 4 displays these BLCA subtyping results. When removing data types with similar structures, all methods have similar clustering results as that using five omics data types where PartIES performs the best in terms of patient survival (Table 4, Analysis 1–3). When removing data types with distinct cluster structures, all methods have worse clustering results than that using all five omics data types (Analysis 4–5) as expected. Similar patterns were observed for LIHC and THCA with details in the Supplementary Materials Tables S3–4.

Table 4.

Sensitivity analysis of TCGA BLCA subtyping with different omics data types

Cancer Analysis Omics Data types PartIES CIMLR SNF
BLCA 1 LncRNA, mutation, miRNA, DNA methylation, mRNA Number of clusters 6 4 4
Survival Inline graphic-value 2.30E-05 3.67E-04 7.39E-03
2 LncRNA, mutation, miRNA, DNA methylation Number of clusters 6 4 4
Survival Inline graphic-value 1.09E-05 2.61E-03 3.04E-03
3 LncRNA, mutation, miRNA Number of clusters 6 3 4
Survival Inline graphic-value 3.19E-05 4.75E-04 3.09E-03
4 LncRNA, miRNA, DNA methylation, mRNA Number of clusters 5 3 3
Survival Inline graphic-value 3.17E-03 0.030 0.038
5 miRNA, DNA methylation, mRNA Number of clusters 4 3 3
Survival Inline graphic-value 0.18 0.04 0.11

Discussion

In this paper, we developed PartIES, a Partition-level Integration framework that uses diffusion-Enhanced Similarities to preserve data-type-specific cluster structures for a more accurate clustering result using multi-omics data. PartIES denoises individual similarity matrices through diffusion and partitions denoised similarity matrices before integrating them iteratively.

Simulation studies suggested a much-improved cluster accuracy with the proposed diffusion step as a general strategy, especially on the competing method CIMLR, which only uses global similarity in the original algorithm. SNF has only granular improvement with diffusion because local similarities were applied in fusing multiple similarity matrices in SNF. For overall clustering performance, PartIES performs the best under settings when three data types provide different subtype structures, the scenario PartIES was designed for. Under settings when three data types provide the same subtype structures, PartIES performs similarly as CIMLR, with both better than SNF.

We applied PartIES and competing methods to identify BLCA, LIHC, and THCA subtypes using mRNAs, lncRNAs, and miRNAs expression data, DNA methylation data, and somatic mutation data from TCGA. The six BLCA subtypes, the four LIHC subtypes, and the four THCA subtypes by PartIES can most significantly differentiate patient survival across all methods. More importantly, some omics data types provide different subtype structures from others. When integrating data-type-specific partition level information, we can preserve distinct subtype structures and more accurately subtype all tumors. Further investigations on biological meanings of the identified subtypes suggest that among subtype-associated genes, many of those that are highly interacting with other genes are known important cancer genes. The identified cancer subtypes also have different activity levels for some known cancer-related pathways.

One limitation of PartIES is that we need to determine numbers of subtypes using individual omics data types. There are currently no universal rules to select the number of clusters. We used the eigengap criterion to guide our selections. While being a limitation, eigengap can also help determine if a data type is informative for clustering. We conducted additional simulation studies and showed that when no cluster is in a data type, eigengaps do not change with different numbers of clusters (see Supplementary Materials). A perturbation method was recently developed [21] to determine the number of clusters. For further work, we will explore using perturbations with spectral clustering to determine the number of clusters. Another limitation is that we only use counts of non-synonymous mutations per gene across the genome as a sample’s mutation profile, which ignores mutation types. With the current TCGA tumor sample sizes, only few genes have more than one mutations per gene. We will explore meta-analysis with a much larger sample size to allow finer analyses with mutation types for future work. We also want to mention that PartIES is computationally efficient and uses comparable computational time as that of SNF and CIMLR. Details are in the Supplementary Materials Table S6.

Key Points

  • The proposed PartIES integrate partition level information iteratively from individual types of omics data to better preserve distinct data-type-specific cluster structures.

  • The proposed PartIES uses a diffusion step to denoise individual similarity matricies.

  • The proposed PartIES has the best clustering performance when different types of omics data provide distinct clustering structures or provide with similar structure but different effect sizes, and performs similarly as competing methods when different types of omics data provide similar clustering structures.

Supplementary Material

PartIES_Oxford_suppl_bbae609

Acknowledgments

The authors thank the anonymous reviewers for their valuable suggestions.

Contributor Information

Yuqi Miao, Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10027, United States.

Huang Xu, Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10027, United States.

Shuang Wang, Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10027, United States.

 

Conflict of interest: None declared.

Funding

This work is supported in part by funds from the National Institute of Health (NIH:# R01 LM013061-01).

References

  • 1. Sørlie T, Perou CM, Tibshirani R. et al.. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci 2001;98:10869–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Holm K, Hegardt C, Staaf J. et al.. Molecular subtypes of breast cancer are associated with characteristic DNA methylation patterns. Breast Cancer Res 2010;12:R36. 10.1186/bcr2590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Kuijjer ML, Paulson JN, Salzman P. et al.. Cancer subtype identification using somatic mutation data. Br J Cancer 2018;118:1492–501. 10.1038/s41416-018-0109-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Duan R, Gao L, Gao Y. et al.. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput Biol 2021;17:e1009224. 10.1371/journal.pcbi.1009224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Liu J, Wang C, Gao J. et al.. Multi-view clustering via joint nonnegative matrix factorization. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 252–60. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2013. [Google Scholar]
  • 6. Ma T, Zhang A. Multi-View Factorization AutoEncoder with Network Constraints for Multi-Omic Integrative Analysis. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 702–7. Los Alamitos, CA, USA: IEEE, 2018.
  • 7. Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 2009;25:2906–12. 10.1093/bioinformatics/btp543. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Mo Q, Wang S, Seshan VE. et al.. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci 2013;110:4245–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Mo Q, Shen R, Guo C. et al.. A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics 2018;19:71–86. 10.1093/biostatistics/kxx017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Yang Z, Michailidis G. A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics 2016;32:1–8. 10.1093/bioinformatics/btv544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Shi Q, Zhang C, Peng M. et al.. Pattern fusion analysis by adaptive alignment of multiple heterogeneous omics data. Bioinformatics 2017;33:2706–14. 10.1093/bioinformatics/btx176. [DOI] [PubMed] [Google Scholar]
  • 12. Kang M, Ko E, Mersha TB. A roadmap for multi-omics data integration using deep learning. Brief Bioinform 2022;23:1–16. 10.1093/bib/bbab454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Wang B, Mezlini AM, Demir F. et al.. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 2014;11:333–7. 10.1038/nmeth.2810. [DOI] [PubMed] [Google Scholar]
  • 14. Rappoport N, Shamir R. NEMO: Cancer subtyping by integration of partial multi-omic data. Bioinformatics 2019;35:3348–56. 10.1093/bioinformatics/btz058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Wang B, Zhu J, Pierson E. et al.. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods 2017;14:414–6. 10.1038/nmeth.4207. [DOI] [PubMed] [Google Scholar]
  • 16. Ramazzotti D, Lal A, Wang B. et al.. Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival. Nat Commun 2018;9:4453. 10.1038/s41467-018-06921-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Wei Y, Li L, Zhao X. et al.. Cancer subtyping with heterogeneous multi-omics data via hierarchical multi-kernel learning. Brief Bioinform 2023;24:1–13. 10.1093/bib/bbac488. [DOI] [PubMed] [Google Scholar]
  • 18. Duan X, Ding X, Zhao Z. Multi-omics integration with weighted affinity and self-diffusion applied for cancer subtypes identification. J Transl Med 2024;22:79. 10.1186/s12967-024-04864-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Netanely D, Avraham A, Ben-Baruch A. et al.. Expression and methylation patterns partition luminal-a breast tumors into distinct prognostic subgroups. Breast Cancer Res 2016;18:74. 10.1186/s13058-016-0724-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Zhang Z, Huang K, Chenglei G. et al.. Molecular subtyping of serous ovarian cancer based on multi-omics data. Sci Rep 2016;6:26001. 10.1038/srep26001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Nguyen T, Tagett R, Diaz D. et al.. A novel approach for data integration and disease subtyping. Genome Res 2017;27:2025–39. 10.1101/gr.215129.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Kang Z, Zhao X, Peng C. et al.. Partition level multiview subspace clustering. Neural Netw 2020;122:279–88. 10.1016/j.neunet.2019.10.010. [DOI] [PubMed] [Google Scholar]
  • 23. Chen Y, Wen Y, Xie C. et al.. MOCSS: multi-omics data clustering and cancer subtyping via shared and specific representation learning. iScience 2023;26:107378. 10.1016/j.isci.2023.107378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Cowen L, Ideker T, Raphael BJ. et al.. Network propagation: a universal amplifier of genetic associations. Nat Rev Genet 2017;18:551–62. 10.1038/nrg.2017.38. [DOI] [PubMed] [Google Scholar]
  • 25. Wang B, Pourshafeie A, Zitnik M. et al.. Network enhancement as a general method to denoise weighted biological networks. Nat Commun 2018;9:3108. 10.1038/s41467-018-05469-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Gasteiger J, Weißenberger S, Günnemann S. Diffusion Improves Graph Learning. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article 1197, pp. 13366–78. Red Hook, NY, USA: Curran Associates Inc., 2022.
  • 27. Nie F, Wang X, Jordan M. et al.. The constrained Laplacian rank algorithm for graph-based clustering. Proc AAAI Conf Artif Intell 2016;30:1969–76. 10.1609/aaai.v30i1.10302. [DOI] [Google Scholar]
  • 28. von Luxburg U. A tutorial on spectral clustering. Stat Comput 2007;17:395–416. 10.1007/s11222-007-9033-z. [DOI] [Google Scholar]
  • 29. Robertson AG, Kim J, Al-Ahmadie H. et al.. Comprehensive molecular characterization of muscle-invasive bladder cancer. Cell 2017;171:540–556.e25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Mo Q, Li R, Adeegbe DO. et al.. Integrative multi-omics analysis of muscle-invasive bladder cancer identifies prognostic biomarkers for frontline chemotherapy and immunotherapy. Commun Biol 2020;3:1–14. 10.1038/s42003-020-01491-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Doncheva NT, Morris JH, Gorodkin J. et al.. Cytoscape StringApp: network analysis and visualization of proteomics data. J Proteome Res 2019;18:623–32. 10.1021/acs.jproteome.8b00702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Shannon P, Markiel A, Ozier O. et al.. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003;13:2498–504. 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Chang Y, Shukun Y, Zhang M. et al.. N6-Methyladenosine-related alternative splicing events play a role in bladder cancer. Open Life Sci 2022;17:1371–82. 10.1515/biol-2022-0479. [DOI] [Google Scholar]
  • 34. Zhang Z, Mao W, Wang L. et al.. Depletion of CDC5L inhibits bladder cancer tumorigenesis. J Cancer 2020;11:353–63. 10.7150/jca.32850. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Liuyu X, Li H, Longchao W. et al.. YBX1 promotes tumor growth by elevating glycolysis in human bladder cancer. Oncotarget 2017;8:65946–56. 10.18632/oncotarget.19583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Schubert M, Klinger B, Klünemann M. et al.. Perturbation-response genes reveal signaling footprints in cancer gene expression. Nat Commun 2018;9:20. 10.1038/s41467-017-02391-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Jiang N, Dai Q, Xiaorui S. et al.. Role of PI3K/AKT pathway in cancer: the framework of malignant behavior. Mol Biol Rep 2020;47:4587–629. 10.1007/s11033-020-05435-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Lu M-L, Wikman F, Orntoft TF. et al.. Impact of alterations affecting the p53 pathway in bladder cancer on clinical outcome, assessed by conventional and Array-based methods. Clinical Cancer Research 2002;8:171–79. [PubMed] [Google Scholar]
  • 39. Mansour AM, Abdelrahim M, Laymon M. et al.. Epidermal growth factor expression as a predictor of chemotherapeutic resistance in muscle-invasive bladder cancer. BMC Urol 2018;18:100. 10.1186/s12894-018-0413-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Wheeler DA, Roberts LR. Comprehensive and integrative genomic characterization of hepatocellular carcinoma. Cell 2017;169:1327–1341.e23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Ruan P, Wang Y, Shen R. et al.. Using association signal annotations to boost similarity network fusion. Bioinformatics 2019;35:3718–26. 10.1093/bioinformatics/btz124. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

PartIES_Oxford_suppl_bbae609

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES