Abstract
Cancer, with its inherent heterogeneity, is commonly categorized into distinct subtypes based on unique traits, cellular origins, and molecular markers specific to each type. However, current studies primarily rely on complete multi-omics datasets for predicting cancer subtypes, often overlooking predictive performance in cases where some omics data may be missing and neglecting implicit relationships across multiple layers of omics data integration. This paper introduces Multi-Layer Matrix Factorization (MLMF), a novel approach for cancer subtyping that employs multi-omics data clustering. MLMF initially processes multi-omics feature matrices by performing multi-layer linear or nonlinear factorization, decomposing the original data into latent feature representations unique to each omics type. These latent representations are subsequently fused into a consensus form, on which spectral clustering is performed to determine subtypes. Additionally, MLMF incorporates a class indicator matrix to handle missing omics data, creating a unified framework that can manage both complete and incomplete multi-omics data. Extensive experiments conducted on 12 multi-omics cancer datasets, both complete and with missing values, demonstrate that MLMF achieves results that are comparable to or surpass the performance of several state-of-the-art approaches. MLMF is open source and available at (https://github.com/renyingxuan/MLMF.git).
Keywords: matrix factorization, cancer subtyping, missing data, multi-omics data
Introduction
Cancer is one of the main global health threats, with high rates of incidence and mortality that make it a focal point of current medical research and public health efforts. Its occurrence and development are a biological change with a complex mechanism. Different subtypes of the same cancer can differ in histopathology and clinical features, but the heterogeneity of cancer is mainly due to its intrinsic molecular characteristics [1]. Therefore, making full use of the intrinsic molecular characteristics of cancer to identify cancer subtypes will help to achieve precision medicine for cancer. In precision medicine, the molecular profile of a patient contains multiple molecules that belong to different omics (such as genomics, proteomics, metabolomics, etc.). These omics data reflect different biological processes, such as gene expression, protein function, metabolic pathways, etc. Early studies usually conducted statistics and research on a single omics datasets [2]. However, single omics data can only reflect the cancer characteristics of a certain level of biological process [3], and using different single omics data to address the same question can produce different results. For a heterogeneous disease such as cancer, its occurrence and development are affected by different gene combinations and various factors, so using single omics data cannot fully describe the complete information of cancer [4]. Different omics data are combined to describe the patient’s biological information, which is called “multi-omics data” [5]. Currently, common multi-omics data includes CNV, mRNA expression, miRNA expression, DNA methylation, etc. [6].
Nowadays, cancer subtype identification based on multi-omics data is mainly achieved through the integrated analysis of cancer sample data [7]. The current methods can be roughly divided into three categories: early integration, mid-term integration, and late integration [8]. For early integration, the main principle is to concatenate the input feature matrices of different omics into a multi-omics feature matrix, and then apply traditional clustering algorithms such as K-means, spectral clustering, etc. on the multi-omics feature matrix [9]. Through clustering, each category corresponds to a different cancer subtype. For example, LRAcluster [10] is an integrated probability model based on low-rank approximation. It finds the global optimal solution of the objective function through a simple gradient ascent algorithm, and then uses the K-means method on the latent representation matrix to obtain the results of cancer subtypes [11]. For early integration, data fusion is achieved by direct splicing, hence the integrating process cannot reflect on the correlation between different omics. However, due to overly simple operations, the spliced data often contains redundant information, which increases the data dimension of the input model. The main principle of late integration is to use the clustering algorithm of a single omics on each omics separately, and then integrate the different clustering results obtained from all omics as the final identification result [12]. The PINS method [13] constructs a connectivity matrix by integrating the clustering results of various omics data and integrates the connectivity matrix into a similarity matrix for clustering. The CC algorithm [14] verifies the rationality of clustering by randomly extracting subsets from the original data, specifying the number of clusters, and clustering all data subsets separately. Although the late integration method does not increase the data dimension of the input model, it can adopt a single omics normalization for each data type and use a model adapted to each omics data, but it cannot establish inter-omics associations at the feature level. Mid-term integration is the most common mainstream method, which globally involves data integration in a learning process. The MCCA algorithm [15] uses sparse canonical correlation analysis to find highly correlated omics data. iClusterBayes [16] based on iCluster uses a full Bayesian latent variable model to select valuable latent variables and describe the intrinsic structure in multi-omics data. Xu et al. [17] proposed the MSNE algorithm to integrate multi-omics information by embedding similarity relationships of samples defined by random walks on multiple similarity networks.
Another problem with using multi-omics data to identify cancer subtypes is that the high cost of sequencing technology can lead to incomplete multi-omics data. Some patients may only have their mRNA expression data or DNA methylation data sequenced. In this case, there is no complete available multi-omics data. If a complete clustering algorithm based on multi-omics data is used in incomplete samples, it will inevitably fail and affect the performance of clustering. Some methods proposed recently have begun to address the problem of incomplete data. NEMO [18] allows samples to be missing in one or more datasets. If each pair of samples has a measurement value in at least one omics dataset, cancer subtypes can be identified. MSNE [17] captures the comprehensive similarity of samples by random walks on multiple similarity networks and is also applicable when data is missing. Therefore, how to effectively use these incomplete multi-omics data to better identify cancer subtypes has become an important issue in this research field.
The semi-non-negative matrix factorization is a commonly used representation learning method, and currently, some studies have attempted to use this method to solve bioinformatics problems, such as pathways identification [19], drug–drug interactions prediction [20], gene representation analysis [21], etc. Therefore, a Multi-Layer Matrix Factorization method called MLMF for cancer subtyping via multi-omics data clustering is proposed in this paper. MLMF first takes the feature matrix of multi-omics as input, performs multi-layer linear or nonlinear factorization on the matrix, decomposes the original multi-omics data representation into their respective latent feature representations, and then fuses these representations into a consensus representation. Finally, spectral clustering is performed on this consensus representation. In addition, an indicator matrix is used to represent the missing status of some samples in the omics, thereby unifying the processes of complete multi-omics and missing multi-omics in a common framework.
Method
MLMF obtains consensus representation and then cancer subtyping is carried out on the consensus representation via spectral clustering algorithm [22].
Notation
Let
represents multi-omics dataset, where
is the number of omics.
is a collection of
data samples with dimension
in
omics measurements, where
. The consensus representation is
, where
is the ultimate dimension of consensus embedding space and
is the sample size of total data.
is the Frobenius norm.
Since data may be missing, the sample index matrix
on each omics data is constructed as follows:
![]() |
(1) |
The framework of MLMF
As shown in Fig. 1, MLMF mainly contains two modules. First, the deep semi-non-negative matrix factorization algorithm is used to perform multi-layer factorization of each omics data to obtain a deep low-dimensional representation. According to the mapping way, it can be formulated two strategies: linear mapping and nonlinear mapping. Then in the consensus representation module, indicator matrix is used to represent the missing status of some samples in the omics, and then fuses these representations into a consensus representation. The consensus representation retains as much original information as possible through the minimum reconstruction loss. Finally, cancer subtype is identified on consensus representation via spectral clustering. MLMF employs a matrix factorization approach to derive intermediate representations for each omics data type, subsequently solving an optimization problem to integrate these representations into a unified form. Therefore, MLMF belongs to mid-term integration methods in multi-omics data analysis.
Figure 1.
The framework of MLMF.
Linear MLMF
The optimization problem based on the deep semi-nonnegative matrix factorization can be written as follows.
![]() |
(2) |
Among them,
is the
layer embedding representation of the
th omocs data, and
is the
th layer basis matrix. The
module is used to control the sparsity of
, and the specific formula is as follows:
![]() |
(3) |
is a matrix with all elements equal to 1, and
represents the trace operation of the matrix. The final consistent representation should consider the missing status of some samples, thus it:
![]() |
(4) |
Among them,
is the index matrix that records the missing data. By minimizing the reconstruction error, the purpose of optimizing the consensus representation
and the deep feature matrix
of each omics data can be achieved. So the optimization goal of the reconstruction stage is defined as follows:
![]() |
(5) |
To sum up, the overall optimization object of linear MLMF can be written as:
![]() |
(6) |
Among them,
and
are penalty trade-off coefficients.
The problem is solved using the coordinate-descent iterative algorithm. The detailed solution process for each variable is shown in Supplementary Note 1.
For
, it is updated as follows:
![]() |
(7) |
where
and
.
For
, it is updated as follows:
![]() |
(8) |
where
, and
is the unit matrix.
,
.
For
, it is updated as follows:
![]() |
(9) |
where
.
For
, it is updated as follows:
![]() |
(10) |
Summarizing the above steps, the optimization process of the Linear MLMF is shown in Algorithm 1.
![]() |
Nonlinear MLMF
By linearly decomposing the initial data distribution, it may not be possible to effectively describe the nonlinear relationship between the omics data. Hence, we introduce the Nonlinear MLMF.
First, construct the loss function. Compared with linear factorization, nonlinear factorization uses nonlinear mapping in all factorizations except the first layer. Nonlinear factorization decomposes the given data matrix
into
factors in a nonlinear way, as
.
is the m-level implicit representation of the data, which can be given by the following factorization:
![]() |
(11) |
The optimization goal of the deep matrix nonlinear factorization model is as follows:
![]() |
(12) |
The problem is solved using the gradient descent method. The detailed solution process for each variable is shown in the Supplementary Note 2.
For
, it is updated as follows:
![]() |
(13) |
where
and 
For
, it is updated as follows:
![]() |
(14) |
where
and 
For
, it is updated as follows:
![]() |
(15) |
where 
The optimization process of the deep matrix nonlinear factorization algorithm is shown in Algorithm 2.
![]() |
Finally, the similarity matrix
is constructed as follows:
![]() |
(16) |
where
is a tuning parameter and
is the set of neighborhoods.
Results
Full muti-omics datasets
Several computational experiments evaluate the effectiveness of cancer subtypes with multi-omics data. This paper conducted experiments on 11 cancer datasets (AML, BIC, COAD, GBM, KIRC, LIHC, LUSC, OV, SKCM, SARC, and HNSC) from TCGA [23] and the METABRIC dataset. TCGA datasets include mRNA expression, DNA methylation and miRNA expression data. METABRIC dataset only contains mRNA expression and CNV data. The feature data after dimensionality reduction is standardized using z-score. All data is preprocessed is the same as in Rappoport and Shamir [18]. The detailed information about the processed datasets is shown in Supplementary Table 1.
This article compares MLMF with 11 selected algorithms on complete multi-omics datasets, including K-means and spectral clustering algorithms, as well as 9 integration methods such as LRAcluster [10], PINS ([13], MCCA [15], iClusterBayes [24], SNF [25], CC [14], SNFCC [26], NEMO [18], and IntNMF [27]. The evaluation indicators used for the identified subtype results are the enrichment number of clinical parameters and the significance of survival analysis. The number of subtypes for each cancer type was determined by feature factorization. For simplicity, the penalty coefficients
and
are both set to 1, and the step size is adjusted in an adaptive manner. In the construction of the similarity matrix
, the adjustment parameter
is used to control the scale of the similarity calculation. The maximum number of iterations is set to 50, and the convergence tolerance
is
to balance the computational efficiency and result accuracy. Survival analysis using the Cox proportional hazards model and p-value showed statistically significant differences in the survival spectra of different cancer subtypes [28]. To perform enrichment analysis of clinical signatures, we selected a unified set of patient clinical information for all cancers, such as sex and age at initial diagnosis, as well as quantifying tumor progression (pathology T), lymph node cancer (pathology N), metastasis (pathology M) and overall progression (pathological stage) as four discrete clinicopathological parameters. Following the recommendations of Rappoport and Shamir (2019), the number of clusters in the comparison method was set to the same value as reported in the original paper.
Table 1 and Fig. 2 show the cancer subtype prediction performance of different algorithms on 12 complete TCGA datasets. As can be seen from the results, the clusters discovered by MLMF_Linear and MLMF_nonLinear had significant survival differences in 10 of the 12 cancer datasets. The average logrank p-value of MLMF_Nonlinear reaches 2.5, and the average logrank p-value of MLMF_Linear reaches 2.6. MCCA ranked third with 2.4. None of the methods found significant differences in survival rates for the COAD dataset. MLMF_Linear and MLMF_Nonlinear found at least one enriched clinical parameter in all datasets. The average number of enriched clinical parameters for MLMF_Nonlinear was 2.1, and the average number of enriched clinical parameters for MLMF_Linear was 2.0. These results show that linear factorization and nonlinear factorization of MLMF can identify patient subtypes with significant consistency and clinical relevance.
Table 1.
The comparison of clustering results from different algorithms on 12 full datasets
| Alg./Cancer | AML | BIC | COAD | GBM | KIBC | LIHC | LUSC | OV | SKCM | SARC | HNSC | METABRIC | Mean | Sig |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K-means | 1/2.4 | 2/3.5 | 1/0.4 | 2/2.6 | 1/0.8 | 2/0.2 | 0/1.5 | 2/0.3 | 2/0.9 | 2/1.3 | 2/1.7 | 2/1.4 | 1.6/1.4 | 11/7 |
| Spectral | 1/2.1 | 1/5.0 | 1/0.7 | 2/2.5 | 2/1.8 | 2/0.4 | 0/2.1 | 2/0.8 | 0/0.6 | 2/1.3 | 1/1.5 | 2/1.8 | 1.3/1.7 | 10/8 |
| LRAcluster | 1/1.8 | 2/4.0 | 1/0.1 | 2/1.1 | 2/1.0 | 2/2.4 | 1/1.0 | 2/0.2 | 3/2.9 | 2/2.5 | 2/1.6 | 2/2.0 | 1.8/1.7 | 12/7 |
| CC | 1/3.8 | 1/2.8 | 1/0.5 | 2/2.1 | 3/1.3 | 2/0.5 | 1/1.1 | 1/0.2 | 3/2.5 | 2/1.0 | 0/1.1 | 1/1.3 | 1.5/1.5 | 11/6 |
| PINS | 1/1.6 | 1/2.8 | 0/0.5 | 1/4.4 | 2/1.0 | 2/0.8 | 0/1.9 | 1/0.1 | 1/1.0 | 2/0.8 | 1/0.9 | 1/1.2 | 1.1/1.4 | 10/4 |
| MCCA | 1/1.2 | 1/8.0 | 0/0.2 | 1/2.9 | 2/1.8 | 2/1.1 | 2/2.3 | 0/0.6 | 2/4.7 | 2/1.5 | 2/2.1 | 2/2.4 | 1.4/2.4 | 10/8 |
| iClusterBayes | 1/1.5 | 0/1.3 | 2/0.1 | 1/3.1 | 4/7.3 | 2/2.2 | 0/1.5 | 2/0.9 | 2/0.6 | 2/3.7 | 2/1.5 | 2/2.3 | 1.7/2.2 | 10/9 |
| SNF | 1/3.0 | 2/6.0 | 1/0.2 | 2/2.6 | 3/1.7 | 2/0.3 | 1/1.2 | 2/0.2 | 1/1.1 | 2/1.9 | 1/1.2 | 1/1.3 | 1.6/1.7 | 12/6 |
| SNFCC | 1/3.8 | 3/7.2 | 2/0.6 | 2/2.3 | 2/1.1 | 1/1.2 | 1/1.0 | 1/0.2 | 2/0.6 | 2/1.1 | 2/1.4 | 2/1.6 | 1.8/1.8 | 12/5 |
| NEMO | 1/1.8 | 2/4.2 | 0/0.1 | 1/3.8 | 4/2.2 | 4/4.2 | 0/1.8 | 1/0.4 | 3/4.0 | 2/1.9 | 2/1.3 | 2/1.5 | 1.8/2.3 | 10/10 |
| IntNMF | 1/1.9 | 1/4.3 | 1/0.2 | 1/3.5 | 3/0.2 | 2/2.0 | 0/0.9 | 0/0.7 | 2/4.1 | 2/1.8 | 2/1.7 | 1/1.9 | 1.3/1.9 | 10/8 |
| MLMF_Linear | 1/3.4 | 3/5.9 | 1/0.4 | 2/4.1 | 2/1.4 | 2/3.2 | 1/1.8 | 2/1.9 | 3/2.9 | 2/1.0 | 2/2.2 | 3/2.6 | 2.0/2.6 | 12/10 |
| MLMF_Noninear | 1/3.1 | 4/5.5 | 2/0.3 | 1/4.5 | 3/1.5 | 3/3.1 | 1/1.6 | 1/2.7 | 3/4.3 | 2/0.8 | 2/1.3 | 2/1.5 | 2.1/2.5 | 12/10 |
Note: in each cell A/B, A is significant clinical parameters detected. B is -log10 P-value for survival. 0.05 is the threshold for significance and the bold indicates the significant results. Mean is algorithm average value. Sig is the number of datasets with significant results.
Figure 2.
Mean performance of the different algorithms on 12 cancer datasets.
In order to verify the subtypes obtained by MLMF_Linear and the existing subtypes, and to show the differential expression between different subtypes, this paper designed the following experiments. First, the subtype results of PAM50 on the BIC dataset were selected for comparison. Secondly, since there were 48 mRNA expression features associated with the 50 genes of PAM50, we deleted the 48 features in the original mRNA data of the BIC dataset to eliminate the direct effects of known oncogenes in multi-omics data, and then input the processed mRNA data into MLMF_Linear together with other omics data. Finally, a heat map was drawn using the expression of the 48 mRNAs to show the correlation between oncogenes and subtypes obtained from MLMF_Linear, as well as the overlap of subtypes obtained by MLMF_Linear and PAM50. As shown in Fig. 3, different subtypes have different mRNA expression patterns, and there is a large overlap between MLMF_Linear and PAM50, such as the LumA subtype of PAM and subtype 1 of MLMF_Linear, and the Basal subtype of PAM and subtype 3 of MLMF_Linear.
Figure 3.
The heatmap for BIC dataset.
In order to verify the training effect of the MLMF algorithm, this paper records the changes in the loss function values of MLMF_Linear and MLMF_Nonlinear under 20 epochs, as shown in Fig. 4. It can be seen from the figure that the loss of MLMF_Linear and MLMF_Nonlinear both show a downward and convergent trend. MLMF_Linear has a great improvement in the early stage of training, and the loss drops rapidly. The convergence process of MLMF_Nonlinear is more stable, showing a gradual downward trend. Since the analytical solution at each iteration in coordinate descent of the linear MLMF could be obtained, the linear method had a faster convergence rate [29].
Figure 4.
The change of the loss function values of MLMF_Linear and MLMF_Nonlinear under 20 epochs.
To verify that the cancer subtypes identified by the MLMF_Linear algorithm are biologically meaningful and provide interpretability, we performed GO enrichment analysis on the experimental data results, as detailed in Supplementary Note 3. The analysis results showed that different cancer subtypes exhibited significant differences in key biological processes.
Partial multi-omics datasets
To evaluate the performance of the method on some multi-omics datasets, this paper still selected the 12 datasets analyzed above and simulated some patient loss omics measurements. Specifically, TCGA datasets maintain the complete expression of DNA methylation and miRNA, and randomly extracts samples from a part of patients to remove their mRNA expression, with missing rates of 0.1, 0.3, 0.5, and 0.7. For METABRIC dataset, maintain CNV and remove mRNA. Enrichment analysis and survival analysis are still used to evaluate the performance of the method. Table 2 shows the comparison results of different algorithms on 12 simulated missing datasets.
Table 2.
Performance of different algorithms on 12 simulated missing datasets
| Alg./Cancer | AML | BIC | COAD | GBM | KIBC | LIHC | LUSC | OV | SKCM | SARC | HNSC | METABRIC | Mean | Sig |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
= 0.1 |
||||||||||||||
| MCCA | 1/3.9 | 1/3.5 | 0/0.2 | 1/2.0 | 2/2.2 | 1/0.7 | 0/0.9 | 2/0.3 | 1/2.7 | 0/0.8 | 1/1.5 | 1/1.7 | 0.9/1.7 | 9/7 |
| NEMO | 1/3.1 | 2/4.3 | 1/0.1 | 1/2.8 | 3/1.5 | 3/2.8 | 1/2.2 | 1/0.1 | 1/0.5 | 2/0.9 | 2/1.1 | 2/1.6 | 1.7/1.8 | 12/7 |
| MLMF_Linear | 1/3.4 | 2/5.0 | 1/0.7 | 2/3.3 | 4/2.4 | 3/1.4 | 1/1.3 | 1/1.4 | 1/3.7 | 2/0.8 | 2/1.5 | 2/1.9 | 1.8/2.2 | 12/10 |
| MLMF_Noninear | 1/3.0 | 2/5.8 | 1/1.3 | 2/2.9 | 4/2.4 | 3/1.8 | 1/2.6 | 1/1.9 | 2/1.4 | 2/0.6 | 2/2.3 | 2/2.5 | 1.9/2.4 | 12/11 |
= 0.3 |
||||||||||||||
| MCCA | 1/2.0 | 2/3.6 | 0/0.2 | 0/0.4 | 0/1.5 | 1/1.1 | 0/0.5 | 2/0.1 | 0/1.7 | 0/0.9 | 0/0.8 | 0/1.2 | 0.5/1.2 | 4/4 |
| NEMO | 1/2.4 | 2/4.0 | 1/0.3 | 1/1.6 | 3/1.2 | 3/3.8 | 0/0.7 | 2/0.3 | 0/0.2 | 2/0.9 | 2/1.3 | 2/1.7 | 1.6/1.5 | 10/6 |
| MLMF_Linear | 1/2.4 | 2/5.3 | 1/0.4 | 2/3.7 | 2/1.9 | 1/2.0 | 1/0.9 | 2/1.4 | 1/1.4 | 2/1.1 | 2/1.4 | 2/1.7 | 1.6/2.0 | 12/9 |
| MLMF_Noninear | 1/2.5 | 2/5.6 | 1/0.6 | 2/3.5 | 3/1.9 | 1/2.2 | 1/1.2 | 2/2.7 | 2/2.2 | 2/0.4 | 2/2.5 | 2/2.3 | 1.8/2.3 | 12/9 |
= 0.5 |
||||||||||||||
| MCCA | 1/2.8 | 1/3.6 | 0/0.3 | 1/1.6 | 2/2.7 | 1/0.6 | 0/0.8 | 1/0.1 | 2/1.1 | 1/1.3 | 1/1.0 | 1/1.4 | 1.0/1.4 | 10/6 |
| NEMO | 1/3.1 | 2/4.7 | 1/0.1 | 1/2.7 | 1/1.2 | 2/1.9 | 1/1.1 | 1/0.1 | 0/0.3 | 2/2.1 | 1/1.2 | 2/2.1 | 1.3/1.7 | 11/6 |
| MLMF_Linear | 1/3.1 | 2/4.8 | 1/0.3 | 2/3.4 | 2/2.0 | 2/1.5 | 1/1.3 | 1/0.9 | 0/0.5 | 1/1.0 | 2/1.6 | 2/2.3 | 1.4/1.9 | 11/8 |
| MLMF_Noninear | 1/2.8 | 1/4.7 | 2/0.4 | 2/3.9 | 3/2.5 | 2/1.7 | 1/0.8 | 1/0.7 | 0/1.4 | 2/0.7 | 1/1.8 | 2/2.2 | 1.5/2.0 | 11/8 |
= 0.7 |
||||||||||||||
| MCCA | 1/2.1 | 1/3.8 | 0/0.3 | 1/2.5 | 2/2.6 | 1/1.3 | 0/1.3 | 1/0.1 | 2/2.4 | 0/0.1 | 0/0.8 | 1/2.1 | 0.8/1.6 | 8/8 |
| NEMO | 1/2.9 | 2/4.5 | 1/0.1 | 1/3.3 | 4/2.2 | 2/1.9 | 0/1.1 | 1/0.1 | 0/0.3 | 2/0.9 | 1/1.1 | 2/1.9 | 1.4/1.7 | 10/6 |
| MLMF_Linear | 1/2.6 | 2/4.8 | 1/0.3 | 2/3.6 | 3/1.7 | 1/1.7 | 2/0.3 | 1/0.9 | 0/1.4 | 1/1.9 | 1/1.6 | 2/1.8 | 1.4/1.9 | 11/9 |
| MLMF_Noninear | 1/2.9 | 2/4.4 | 1/0.3 | 2/3.1 | 3/2.7 | 2/2.1 | 1/1.1 | 2/0.5 | 0/1.4 | 1/1.9 | 1/1.8 | 2/2.4 | 1.5/2.1 | 11/9 |
Note:
is the fraction of missing data. In each cell A/B, A is significant clinical parameters detected. B is -log10 P-value for survival. 0.05 is the threshold for significance and the bold indicates the significant results. Mean is algorithm average value. Sig is the number of datasets with significant results.
Table 2 and Fig. 5 show the cancer subtype prediction performance of different algorithms on 12 incomplete TCGA datasets. MLMF_Linear and MLMF_Nonlinear performed better than NEMO and MCCA in survival and enrichment analysis at all missing rates. Under the same missing rate, the average performance of the nonlinear decomposition algorithm is better than that of the linear decomposition. These results indicate that MLMF can be well applied to situations where part of the omics is missing. In general, cancer subtyping by MLMF resulted in statistically significant survival spectrum differences and significant clinical enrichment. In addition, MLMF can effectively solve the challenge of missing parts of the omics.
Figure 5.
Mean performance of the different algorithms on 12 cancer datasets.
is the fraction of missing data.
In order to evaluate the efficiency of the MLMF algorithm, we compared the running time of the MLMF_Linear algorithm and the MLMF_Nonlinear algorithm on the BIC dataset with ten algorithms, namely K-means, spectral clustering algorithms, LRAcluster, PINS, MCCA, iClusterBayes, SNF, SNFCC, and NEMO. As can be seen from Supplementary Fig. 2, the fastest algorithm is spectral clustering and the slowest algorithm is iClusterBayes. MLMF_Linear is at a medium level compared to the benchmark methods, and the running time of the MLMF_Nonlinear is still faster than CC and iClusterBayes. However, the cancer subtyping performance of MLMF is better than many state-of-the-art approaches.
Conclusion
Predicting cancer subtypes using multi-omics data enables researchers and clinicians to adopt a more comprehensive and precise approach to patient treatment. Data from various omics offer distinct insights into biological processes, and by integrating these multi-omics datasets, researchers can uncover unique patterns and molecular features associated with different cancer subtypes. In this paper, we introduce MLMF, a multi-layer matrix decomposition method designed for cancer subtyping through the clustering of multi-omics data. For the first time, MLMF unifies the processing pipelines for complete and missing multi-omics data within a common framework. It performs multi-layer linear or nonlinear decomposition on the multi-omics feature matrix, breaking down the original data representation into respective latent feature representations. These representations are then fused to create a consensus representation. The identification of cancer subtypes is achieved through spectral clustering of this consensus representation. Experimental results from 12 multi-omics datasets demonstrate that MLMF outperforms other related methods. While our study focused on two to three omics levels, MLMF provides a versatile framework that can be easily adapted to scenarios involving additional omics data. We believe that MLMF holds significant promise for advancing precision oncology and enhancing patient outcomes.
Key Points
A new multi-layer matrix factorization algorithm (MLMF) is proposed, which can simultaneously learn multi-omics feature representation and cancer subtype labels
The identification process of complete and missing multi-omics data is unified in a common framework
The experimental results on the TCGA datasets show that MLMF has advantages in the ability to identify cancer subtypes.
Supplementary Material
Contributor Information
Yingxuan Ren, National University of Singapore, 119077, Singapore.
Fengtao Ren, Department of Engineering, The Chinese University of Hong Kong, 999077, Hong Kong, China.
Bo Yang, School of Computer Science, Xi'an Polytechnic University, 710048, Xi'an, China.
Funding
This work was supported by Natural Science Basic Research Program of Shaanxi (2024JC-YBMS-473).
References
- 1. Reis-Filho JS, Pusztai L. Gene expression profiling in breast cancer: classification, prognostication, and prediction. Lancet 2011;378:1812–23. 10.1016/S0140-6736(11)61539-0 [DOI] [PubMed] [Google Scholar]
- 2. Sotiriou C, Neo SY, McShane LM. et al. Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc Natl Acad Sci 2003;100:10393–8. 10.1073/pnas.1732912100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Etcheverry A, Aubry M, De Tayrac M. et al. DNA methylation in glioblastoma: impact on gene expression and clinical outcome. BMC Genomics 2010;11:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Cai Y, Wang S. Deeply integrating latent consistent representations in high-noise multi-omics data for cancer subtyping. Brief Bioinform 2024;25:bbae061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Subramanian I, Verma S, Kumar S. et al. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insight 2020;14:1177932219899051. 10.1177/1177932219899051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Shahrajabian MH, Sun W. Survey on multi-omics, and multi-omics data analysis, integration and application. Curr Pharm Anal 2023;19:267–81. [Google Scholar]
- 7. Yang Y, Tian S, Qiu Y. et al. MDICC: novel method for multi-omics data integration and cancer subtype identification. Brief Bioinform 2022;23:bbac132. [DOI] [PubMed] [Google Scholar]
- 8. Ma Y, Guan J. MOCSC: a multi-omics data based framework for cancer subtype classification. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2853–9. IEEE, 2022. 10.1007/s11606-022-07552-y. [DOI] [Google Scholar]
- 9. Chen W, Wang H, Liang C. Deep multi-view contrastive learning for cancer subtype identification. Brief Bioinform 2023;24:bbad282. [DOI] [PubMed] [Google Scholar]
- 10. Wu D, Wang D, Zhang MQ. et al. Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genom 2015;16:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Duan R, Gao L, Gao Y. et al. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput Biol 2021;17:e1009224. 10.1371/journal.pcbi.1009224 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Yuanyuan Z, Ziqi W, Shudong W. et al. SSIG: single-sample information gain model for integrating multi-omics data to identify cancer subtypes. Chin J Electron 2021;30:303–12. 10.1049/cje.2021.01.011 [DOI] [Google Scholar]
- 13. Nguyen T, Tagett R, Diaz D. et al. A novel approach for data integration and disease subtyping. Genome Res 2017;27:2025–39. 10.1101/gr.215129.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Monti S, Tamayo P, Mesirov J. et al. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 2003;52:91–118. 10.1023/A:1023949509487 [DOI] [Google Scholar]
- 15. Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol 2009;8. 10.2202/1544-6115.1470 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Mo Q, Wang S, Seshan VE. et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci 2013;110:4245–50. 10.1073/pnas.1208949110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Xu H, Gao L, Huang M. et al. A network embedding based method for partial multi-omics integration in cancer subtyping. Methods. 2021;192:67–76. [DOI] [PubMed] [Google Scholar]
- 18. Rappoport N, Shamir R. NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics. 2019;35:3348–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Park S, Kar N, Cheong JH. et al. Bayesian semi-nonnegative matrix tri-factorization to identify pathways associated with cancer phenotypes. Pacific Symp Biocomput 2020;2019:427–38. [PubMed] [Google Scholar]
- 20. Yu H, Mao KT, Shi JY. et al. Predicting and understanding comprehensive drug–drug interactions via semi-nonnegative matrix factorization. BMC Syst Biol 2018;12:101–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Jiang W, Ma T, Feng X. et al. Robust semi-nonnegative matrix factorization with adaptive graph regularization for gene representation. Chin J Electron 2020;29:122–31. 10.1049/cje.2019.11.001 [DOI] [Google Scholar]
- 22. Von Luxburg U. A tutorial on spectral clustering. Stat Comput 2007;17:395–416. 10.1007/s11222-007-9033-z [DOI] [Google Scholar]
- 23. Cancer Genome Atlas Research Network . Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Mo Q, Shen R, Guo C. et al. A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics. 2018;19:71–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Wang B, Mezlini AM, Demir F. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 2014;11:333–7. 10.1038/nmeth.2810 [DOI] [PubMed] [Google Scholar]
- 26. Xu T, Le TD, Liu L. et al. CancerSubtypes: an R/Bioconductor package for molecular cancer subtype identification, validation and visualization. Bioinformatics. 2017;33:3131–3. [DOI] [PubMed] [Google Scholar]
- 27. Hosmer DW Jr, Lemeshow S, May S. Applied Survival Analysis: Regression Modeling of Time-to-Event Data, vol. 618. John Wiley & Sons, 2008. [Google Scholar]
- 28. Yang H, Sheng Y, Jiang Y. et al. Subtype-former: a deep learning approach for cancer subtype discovery with multi-omics data. arXiv preprint 2022;arXiv:2207.14639. 10.48550/arXiv.2207.14639 [DOI]
- 29. Arora S, Cohen N, Golowich N. et al. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint 2018;arXiv:1810.02281. 10.48550/arXiv.1810.02281 [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.























