Abstract
Cancer is a complex gene mutation disease that derives from the accumulation of mutations during somatic cell evolution. With the advent of high-throughput technology, a large amount of omics data has been generated, and how to find cancer-related driver genes from a large number of omics data is a challenge. In the early stage, the researchers developed many frequency-based driver genes identification methods, but they could not identify driver genes with low mutation rates well. Afterwards, researchers developed network-based methods by fusing multi-omics data, but they rarely considered the connection among features. In this paper, after analyzing a large number of methods for integrating multi-omics data, a hierarchical weak consensus model for fusing multiple features is proposed according to the connection among features. By analyzing the connection between PPI network and co-mutation hypergraph network, this paper firstly proposes a new topological feature, called co-mutation clustering coefficient (CMCC). Then, a hierarchical weak consensus model is used to integrate CMCC, mRNA and miRNA differential expression scores, and a new driver genes identification method HWC is proposed. In this paper, the HWC method and current 7 state-of-the-art methods are compared on three types of cancers. The comparison results show that HWC has the best identification performance in statistical evaluation index, functional consistency and the partial area under ROC curve.
Supplementary Information
The online version contains supplementary material available at 10.1007/s13755-024-00279-6.
Keywords: Cancer, Driver genes, Multi-omics data, Hierarchical weak consensus model, Hypergraph
Introduction
Cancer is a global problem, which causes tens of millions of deaths every year and seriously threatens people’s life and health. The researchers found that cancer is a disease of gene mutation, which results from the accumulation of mutations during the evolution of somatic cells [1]. In recent years, with the development of high-throughput sequencing technology, some large-scale cancer sequencing projects have produced a large amount of cancer genomics data, such as the Cancer Genome Atlas (TCGA) [2], the International Cancer Genome Consortium (ICGC) [3]. Based on the multi-omics data generated by these projects, researchers can study the reproduction of cancer cells at the genetic level and understand the causes of cancer, which is important for the development of new therapies and antibiotics. Researches have shown, cancer is caused by a large number of driving mutations (mutations in the driver genes) [4]. Therefore, how to find those driver mutations [5–7] that can promote the reproduction of cancer cells from a large number of genetic mutations is a challenge. In recent years, using the computational method to identify driver genes has become a hot topic in cancer genomics. Generally, these methods are mainly divided into frequency-based methods and network-based methods.
At the early stage, researchers developed many methods based on gene mutation frequency [8–10] to identify driver genes. They calculated the mutation frequency of each gene, and then compared it with the background mutation rate (BMR) [11, 12], so as to find the driver genes from a large number of mutation genes. However, due to the heterogeneity of the cancer and the large amount of noise contained in the mutation data, it is difficult to accurately calculate BMR. The researchers have therefore included some biological characteristics related to variation frequency to identify driver genes. The MuSiC [9] and MutSigCV [10] methods accounted for heterogeneity between cancer types and biological profiles to predict cancer driver genes. The PathScan method [13] took into account the length of genes and the mutation probability values of multiple samples, thereby improving the prediction rate of driver genes. The method [14] proposed by Youn et al. combined protein function effects, mutation frequency and genetic sequence redundancy to identify driver genes. Although these methods improve the recognition rate to a certain extent, the frequency-based methods could not well identify the driver genes with low mutation frequency.
To overcome this problem, researchers in recent years have begun to consider fusing multiple omics data in the protein–protein interaction (PPI) network to identify cancer driver genes [15, 16]. There is evidence that multiple driver genes work together to transform normal cells into tumor cells [17–20], and these genes are often enriched on the same biological networks. The DriverNet method [6] proposed by Bashashati et al. used mutation data, gene expression data and PPI network to construct a bipartite graph to prioritize genes. The same data is used in the DawnRank method [21]. The CRNMF method [22] combined PPI network, mutation data and gene expression data to construct a non-negative matrix of co-regularization to identify driver genes. The IntDriver method [23] used a matrix factorization framework to integrate gene interaction network, gene ontological similarity and mutation data, and then predicted driver genes in different patient samples. The innovation in the NetICS method [24] was graph diffusion, which prioritized genes according to mediation effects. The Subdyquency method [25] used mutation data and gene expression data to construct a bipartite graph, then combined with subcellular localization data to predict driver genes. The EntroRank method [26] was an entropy-based method, which predicted the driver gene by integrating subcellular location data, mutation data, gene interaction network, and specificity of the driver gene. The MinNetRank method [27] adopted a minimal strategy to integrate mutation data and gene expression data for each sample to prioritize genes. The DriverRWH method [28] proposed by Wang et al. was a random walk algorithm based on weighted mutation hypergraph, which combined somatic mutation data and PPI network.
The above methods for identifying driver genes improve the accuracy to some extent. However, they do not consider the association among the features extracted from multi-omics data. After analyzing a large number of methods based on multi-feature fusion, a phenomenon among features is found: most of the feature scores of gene A are higher than the corresponding feature scores of gene B, but it is not excluded that a small number of feature scores are slightly smaller than the corresponding feature scores of gene B, and gene A is more likely to be the driver gene. In this paper, the relationship among features is called weak consensus, and then a multi-feature weak consensus model framework, hierarchical weak consensus model, is proposed, which is able to fuse features layer by layer to identify driver genes. It is also found that current methods often combine somatic mutation data and gene expression data, and rarely use miRNA data. It has been reported that an mRNA target can be controlled by multiple miRNAs, and miRNAs can also promote tumor invasion, angiogenesis, tumor growth, and immune invasion [29, 30]. Therefore, in this paper, miRNA and mRNA data are used for differential expression analysis of genes. Finally, based on hierarchical weak consensus model to integrate mutation data, mRNA, miRNA and PPI, a novel method called as HWC is proposed. Specific contributions are as follows:
Combining hypergraph theory with mutation data and PPI network, a co-mutation hypergraph network is constructed and then the co-mutation entropy is used to extract the network features.
A new topological feature, the co-mutation clustering coefficient (CMCC), is designed, which can well reflect the clustering of genes and the importance of mutation in the network. Firstly, a weighted edge clustering coefficient (ECC) network is constructed by calculating the ECC for each gene; then the characteristics of the ECC network and the co-mutation hypergraph network are fused into a new network feature CMCC.
A dynamic threshold function is designed to filter the noise in the gene expression data.
A hierarchical weak consensus model for fusing multiple features is proposed, which can perform hierarchical fusion of multiple features.
The method proposed in this paper is compared with seven methods such as DawnRank [21], CRNMF [22], IntDriver [23], Subdyquency [25], EntroRank [26], MinNetRank [27], DriverRWH [28] on three cancer species data. The results show that HWC performs the best in statistical evaluation index, functional consistency, and the partial area under ROC curve (PAUC).
Method
The overall flow of the HWC method is shown in Fig. 1. (1) In the PPI network (PIN), the ECC value is calculated for each node, ECC-based weighted network is constructed. (2) Somatic mutation matrix and PIN are used to establish a co-mutation hypergraph network, and then the co-mutation entropy of each gene is calculated. (3) A new feature CMCC is obtained by fusing ECC-based weighted network with co-mutation hypergraph network to new topology score. (4) Differential expression scores DE and Mi_DE of mRNA and miRNA are calculated, respectively, and the dynamic threshold function is used to filter the mRNA matrix. (5) The priority of each feature is calculated firstly, then the hierarchical weak consensus model is used to fuse these feature scores, the final score FS is obtained.
Fig. 1.
The overview of HWC method (Each colored circular area in (2) is a hyperedge
Edge clustering coefficient
Based on the node clustering coefficient theory, Radicchi et al. proposed feature ECC [31]. As an important network topology feature, ECC is usually used to describe the importance of edges between nodes in the network. Here, the improved ECC is used to represent the clustering among nodes in the PPI network [32]. Given a PPI network G = (V,E),V represents the set of nodes and E denotes the set of edges. The ECC value between nodes u and v in the PPI network is defined as follows:
| 1 |
where represents the number of triangles containing edge(u, v). and indicate the degree of nodes u and v, respectively. The ECC value of the gene u is:
| 2 |
where V is the neighborhood set of the node u.
Calculate co-mutation entropy
The somatic mutation matrix SM|N|×|P| (N represents a set of genes, P represents the set of patients) is a 0,1 matrix, in which suv = 1(u = 1,2,…,|N|;v = 1,2…|P|) indicates that gene u is mutated in patient v, otherwise the value is 0. The variation frequency of gene u is defined as follows:
| 3 |
where t(u) represents the set of patients whose gene u is mutated, is the number of patients.
It has been shown that the cooperation of multiple mutant genes can induce normal cells into cancer cell [17–20]. Previous studies mainly considered mutation information for a single gene and ignored the connections among multiple genes. Since the edges of the hypergraph can connect multiple vertices, it is suitable for representing higher-order relationships and can well preserve the co-mutation information among mutated genes [30]. Therefore, the PPI networks and SM matrices are used to build hypergraphs, with the cancer sample of the mutation matrix as the hyperedge and the mutant gene as the vertices. The hypergraph is defined as HG(VHG,EHG), VHG is a set of vertices; EHG is a set of hyperedges. The hypergraph constructed in this paper is called co-mutation hypergraph network.
In the constructed co-mutation hypergraph network, there is a subnetwork within each hyperedge. Then, the variant frequency of genes is taken as the input probability to calculate the co-mutation entropy of mutant genes in each hyperedge. The formula for calculating co-mutation entropy is as follows:
| 4 |
where e is the hyperedge; u is the mutant gene. After calculating the co-mutant entropy of the hyperpoints in all hyperedges, the co-mutant entropy of gene u is to add the co-mutant entropy of gene u in each hyperedge, which has the following formula:
| 5 |
Co-mutation clustering coefficient
In this paper, a weighted PPI network is established by the edge clustering coefficient, and then a co-mutation hypergraph network is established using mutation data. Then, the feature scores of the two networks are fused into a new topological feature, CMCC, which is defined as follows:
| 6 |
where the higher the value of CMCC, the more mutational information the gene contains, and the higher the aggregation.
Calculate the differential expression score
Given gene expression matrix , where G represents genes set, P and represent normal samples set and cancer samples set respectively; the miRNA expression matrix , Gi for the set of miRNAs, and Pi and for the normal and cancer samples set, respectively. In matrix E, auv(u = 1,2,…,|G|;v = 1,2…|P + |) is the expression of gene u on sample v. A bipartite graph BG = (VB,EB),VB = (VE ∪ Vmi) is defined. VE represents a set of genes and Vmi is a set of miRNAs. Each undirected edge(vbu,vbv) ∈ EB, where vbu ∈ VE, vbv ∈ Vmi, indicating the interaction between gene and miRNA.
Because the gene expression data generated by microarray technology contains a lot of noise, a dynamic threshold function [33] is used to filter the gene expression matrix. Equations (7–10) are the mean, variance, volatility function, and dynamic threshold function of gene expression data, respectively, and are defined as follows:
| 7 |
| 8 |
| 9 |
| 10 |
where n is the total number of samples of the matrix E; is the expression value of gene u on the sample t; is the standard deviation of gene u. Subsequently, the gene expression matrix E is filtered according to the calculated dynamic threshold function, as follows:
| 11 |
where is the filtered gene expression matrix, and the genes set and samples set represented by its row and column are consistent with those in E. For each gene in matrix (or each miRNA in matrix Mi), the Bhattacharyya distance [34] is used to calculate their differential expression scores:
| 12 |
where DE(u) is the differential expression score; for matrix (or Mi), and are the mean and standard deviation of the gene(or miRNA) in the normal sample, respectively; and are the mean and standard deviation of the genes in the cancer samples in the matrix (or Mi), respectively.
Studies shown that a gene can be regulated by multiple miRNAs. Therefore, for each gene in matrix , bipartite graph BG is used to represent the mapping relationship between gene and miRNA, and then Mi_DE is used to record the sum of the differential expression scores of the miRNA that is linked to gene. The Mi_DE value of gene u is defined as follows:
| 13 |
The hierarchical weak consensus model
Through experimental analysis, a phenomenon among features is found that most of the feature scores of gene u are higher than the corresponding feature scores of gene v, but it is not excluded that a small number of feature scores are slightly smaller than the corresponding feature scores of gene v, and gene u is more likely to be a driver gene. In this paper, it is said that the genes u and v meet the weak consensus, that is, u weakly dominates v, and u forms a weak dominant relationship with v, which is expressed as u v. The conditions for weak consensus are as follows:
where u and v represent genes, respectively; represents the score of the ith feature; λ is the fault tolerance factor, which is a smaller positive number.
According to the weak dominant relationship among genes, a multi-feature weak consensus model framework, hierarchical weak consensus model, is proposed, which integrates multiple features. The process of the model is as follows. Firstly, the feature scores DE,Mi_DE and CMCC is extracted related to them from multiple omics data, and then use the mean to calculate the central trend of each feature score, where the higher the mean value, the greater the priority of fusion, and in this paper, CMCC has the lowest priority of fusion. Secondly, the differential expression scores of mRNA and miRNA are fused to obtain a new feature score SN. Thirdly, the feature score SN are combined with the feature score CMCC to obtain the final score FS, and then the genes are ranked. The larger the score of gene, the higher the priority ranking. Here is the specific process:
The feature scores DE, Mi_DE and CMCC are extracted from the multiple omics data in turn, and then the mean of each feature is calculated, where the higher the mean value, the greater the priority, and the mean value of the feature score CMCC in this paper is the lowest among the three cancers, so firstly fused DE and Mi_DE.
Level 1. After determining the order of feature fusion, the differential expression scores of mRNA and miRNA are judged by weak consensus condition. If the following conditions hold, state u v:
where λ is the fault tolerance factor. As long as the weak dominant relationship between genes is determined, the score of gene u is calculated as:
| 14 |
where V represents the neighborhood of the node u. S(u) is then normalized by using:
| 15 |
where SN(u) is the result of normalization; max() is the maximum value function.
Level 2. If the feature scores SN(u) and CMCC(u) for the genes u and v satisfy the following conditions, then called u v:
where is the fault tolerance factor. The final score FS for gene u is then calculated using the following formula:
| 16 |
where α is an adjustable parameter between [0,1]; V represents the neighborhood of the node u. Here the parameters λ, , and α are 0.1, 0.001 and 0.8, respectively.
Results
In order to measure the performance of the HWC method, experiments are performed on breast cancer, lung cancer and prostate cancer data, and compared with seven methods, including DawnRank, IntDriver, CRNMF, Subdyquency, EntroRank, MinNetRank and DriverRWH.
Data sources
In this paper, somatic mutation data, gene expression data and miRNA data of breast cancer, lung adenocarcinoma and prostate cancer are downloaded from the TCGA database [35]. Among them, “Mutect2” workflow is used for somatic mutation data, “RNA-Seq” and “STAR-Counts” workflows are adopted for gene expression data, and “Isoform Expression Quantification” workflow is used for miRNA. The specific information is shown in Table 1, in which SM represents somatic mutation data, mRNA denotes gene expression data, miRNA represents miRNA data, N is normal sample data, and C represents cancer sample data. The PPI network is derived from HINT database 2012 [36] and contains 9859 genes and 40,705 edges. The interaction information of miRNA-gene is obtained from miRTarBase database [37], which has interactions between 1884 genes and 826 miRNAs. Two public databases are selected as the benchmark gene sets: COSMIC [38] and OMIM [39], which contain 723 and 1255 genes, respectively.
Table 1.
Data information for the three cancers
| BRCA(breast) | LUAD(lung) | PRAD(prostate) | ||||
|---|---|---|---|---|---|---|
| Gene number | Sample number | Gene number | Sample number | Gene number | Sample number | |
| SM | 18,846 | 986 | 18,497 | 561 | 12,415 | 495 |
| mRNA(N) | 18,067 | 113 | 19,429 | 59 | 17,770 | 52 |
| mRNA(C) | 18,067 | 1,091 | 19,429 | 513 | 17,770 | 495 |
| miRNA(N) | 2,221 | 104 | 2,197 | 46 | 2,080 | 52 |
| miRNA(C) | 2,221 | 1,103 | 2,197 | 521 | 2,080 | 499 |
Evaluating indicator
Three widely used evaluation indicators are adopted to measure the performance of HWC and other methods.
Statistical evaluation index
In the experiments, the top k genes are selected as candidate driver genes, then adopt Precision, Recall and Fscore to evaluate these methods. The formulas for Precision, Recall and Fscore are as follows, respectively:
| 17 |
| 18 |
| 19 |
where TP, FP and FN are true positive, false positive and false negative, respectively.
Function consistency
In order to assess the biological function of the identified genes, functional consistency is used for analysis. Functional consistency FC is defined as follows:
| 20 |
where F(B) records the function of the reference gene set B, and F(A) records the function of the prediction gene set A. In this paper, gene ontology (GO) [40], Kyoto Encyclopedia of Genes and Genomes (KEGG) database [41] and Reactome database [42] are used to obtain GO terms and pathway information of genes. The information of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes(KEGG) can be obtained using the clusterProfifiler R package [43], and the access information of Reactome database can be obtained using the ReactomePA R package [44].
The partial area under ROC curve
In this paper, the partial area under ROC curve (PAUC) is used to measure the gene identification performance of various methods under low false positive rate [24]. A positive gene is a known cancer gene, and a negative gene is a gene that is not in the positive gene set. It calculates a PAUC value from 1 to k by identifying the number of true positive genes with scores higher than the kth highest negative gene score, as follows:
| 21 |
where T is the number of known cancer genes; is the number of true positives whose score is higher than the score of the i-highest negative gene [24].
Parameters analysis
The parameter λ and φ are fault tolerance factor. Their values range from 0.1, 0.2, …, 1 and 0.001, 0.002, …, 0.01, respectively. Parameter α is an adjustable parameter between [0,1]. Table 2 shows the results of the area under the precision-recall curves (AUPRC) for the top 300 genes on breast (The benchmark database is COSMIC). The larger the AUPRC value, the better the performance of the method is. As can be seen from the Table 2, HWC has the best identification results when α = 0.8. Therefore, the optimal value of α is set as 0.8. Table 3 shows the discovery results based on different values of λ and φ on breast to analyze the effects of λ and φ on the ability of HWC (The benchmark database is COSMIC). It can be seen that when the values of λ/φ is 0.1/0.001, so the optimal values of λ/φ is set to 0.1/0.001.
Table 2.
Impact of parameter α on the performance of HWC on breast(the benchmark database is COSMIC)
| α | 0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AUPRC | 0.3366 | 0.3495 | 0.3596 | 0.3679 | 0.3791 | 0.3882 | 0.395 | 0.4012 | 0.4038 | 0.3985 | 0.3862 |
Bold value indicate optimal value or optimal parameters
Table 3.
Impact of parameter λ and φ on the performance of HWC on breast(the benchmark database is COSMIC)
| λ/φ | 0.1/0.001 | 0.2/0.002 | 0.3/0.003 | 0.4/0.004 | 0.5/0.005 | 0.6/0.006 | 0.7/0.007 | 0.8/0.008 | 0.9/0.009 | 1/0.01 |
| AUPRC | 0.5204 | 0.519 | 0.5185 | 0.5177 | 0.5173 | 0.517 | 0.5156 | 0.5153 | 0.515 | 0.515 |
Bold value indicate optimal value or optimal parameters
Breast cancer results
Figure 2 shows the comparison results of the top k genes identified by the eight methods on Precision, Recall and Fscore. In this paper, parameter k is set as 50, 100, 150, 200, 250, 300. As shown in Fig. 2a–c, when the benchmark database is COSMIC, the HWC method performed better overall than the other seven methods. On the top 50 to 300 candidate driver genes, precision value of the HWC method is 0.7, 0.55, 0.46, 0.43, 0.39, 0.36, respectively; EntroRank is 0.56, 0.51, 0.47, 0.4, 0.36, 0.33, respectively. HWC is only slightly inferior to the EntroRank method at k = 150, but has a higher recognition rate than EntroRank method for the other five k values. According to Fig. 2b–f, the comparison results based on the OMIM database are shown. On different k values, the HWC method all has the best performance and improve a lot. Moreover, it is noted that some of the top 300 genes identified by HWC closely associated with breast cancer are in a higher rank(The top 300 genes identified by HWC are shown in Supplementary Data Table S1), but are lower or ignored in other methods, such as BRCA1, ABL1, TGFBR1. Gene BRCA1 is an important tumor suppressor gene. If mutated, it will be unable to regulate cell cycle and repair damaged DNA, etc., thus leading to the generation of malignant tumor cells [45]. BRCA1 is ranked 7th in the HWC method and ranked 567th, 268th and 15th in the MinNetRank, Subdyquency and DriverRWH methods, respectively. Gene ABL1 can lead to the production of invasive breast cancer cells [46], ranking 16th in the HWC method. In DawnRank, IntDriver, CRNMF, Subdyquency, EntroRank, MinNetRank and DriverRWH methods, the rankings are 777th, 486th, 496th, 2285th, 374th, 211th, and 240th, respectively. Gene TGFBR1 ranks 29th in HWC, 1071th, 1976th, 145th, 245th, 5757th and 1832th in EntroRank, Subdyquency, DriverRWH, DawnRank, IntDriver and CRNMF, respectively. It has been reported that TGFBR1 gene can affect the molecular mechanism of breast cancer, thus affecting the evaluation of breast cancer [47]. In order to further analyze the biological functions of BRCA1, ABL1 and TGFBR1 (BRCA1, ABL1 and TGFBR1 data information are retrieved from Gendoma web server (https://ai.citexs.com)), gene regulation analysis is conducted from multiple dimensions in this paper. Figure 3 shows the gene regulatory networks of BRCA1, ABL1 and TGFBR1 with proteins, miRNAs, transcription factors, chemicals and drugs, which enable us to have an in-depth understanding of the interactions among target genes and other regulatory factors, and provide data support for cancer drug target research. It is also noted that some of the top 300 genes identified by HWC are not in COSMIC and OMIM databases, but have been shown to be associated with breast cancer cells in the literature, such as SPI, CDC42 and SHCI et al.
Fig. 2.

Comparison results of each method with Precision, Recall, and Fscore based on BRCA data set. Where the benchmark dataset of a–c is COSMIC, and the benchmark dataset of d–f is OMIM
Fig. 3.

Gene regulatory network of BRCA1, ABL1 and TGFBR1
Figure 4 shows the results of the functional consistency comparison of these methods in the two databases, where the k values are set to 50, 100, 150, 200, 250, 300. From Fig. 4a–c, it can be seen that the HWC method has the highest FC score on GO, KEGG, and REACTOME on the COSMIC data set. As shown in Fig. 4d and f, the HWC method has the best performance on both GO and REACTOME based on the OMIM database. In Fig. 4e, when k = 50, 150, HWC’s FC score is slightly lower than DawnRank, but higher at other k values, especially when k = 300, FC score of HWC is 0.82, while FC score of DawnRank is 0.74. For the top 30 GO terms or pathways enriched by the top 300 genes identified by the HWC method, see Supplementary Data Tables S2 to S4.
Fig. 4.

Comparative functional consistency results for each method based on the BRCA dataset. Where the benchmark dataset of a–c is COSMIC, and the benchmark dataset of d–f is OMIM
Lung cancer results
The Precision, Recall, and Fscore curve comparison results of these methods on lung cancer data are shown in Fig. 5, where k ranges from 50 to 300. From Fig. 5a–f, it can be seen that, based on COSMIC and OMIM data sets, the HWC method has the best performance compared with the other 7 methods on all k values, and the improvement is great. Taking k = 300 based on the COSMIC database as an example, the precision value of the HWC method is 0.36, while the precision values of EntroRank, MinNetRank, Subdyquency, DriverRWH, DawnRank, IntDriver and CRNMF are 0.28, 0.18, 0.21, 0.2, 0.19, 0.14 and 0.17, respectively. It was also found that among the top 300 genes identified by the HWC method (The top 300 genes identified by HWC are shown in Supplementary Data Table S5), some important genes for the formation and development of lung cancer are lower placed or ignored by other methods, such as ABCB1, ESR1 and IRS1. Researchers have found that inhibition of ABCB1 gene plays an important role in the treatment of lung cancer [48]. The expression of ESR1 affects many biological pathways and cell functions related to lung cancer in human body, thus promoting the formation of lung cancer cells [49]. It has been pointed out that IRS1 plays an important role in the formation and development of lung cancer [50]. ABCB1, ESR1, IRS1 are ranked at 9,21,57 in HWC, 899,746,128 in EntroRank, 459,NA,NA in DawnRank, 356,NA,NA in DriverRWH,165,1773,994 in IntDriver,2789,137,2385 in MinNetRank,211,1042,815 in CRNMF,NA,614,326 in Subdyquency respectively. NA denotes the gene is not found by the method. As shown in Fig. 6, this paper conducts multidimensional gene regulation analysis on ABCB1, ESR1 and IRS1, demonstrating the interaction between regulators at the molecular level, which is helpful for us to understand the molecular mechanism of cancer generation and development. In addition, it is noted that some of the top 300 genes in the HWC method are highly associated with lung cancer but not included in COSMIC and OMIM, such as HDAC4, VIM and ME0X2.
Fig. 5.

Comparison results of each method with Precision, Recall, and Fscore based on LUAD data set. Where the benchmark dataset of a–c is COSMIC, and the benchmark dataset of d–f is OMIM
Fig. 6.

Gene regulatory network of ABCB1, ESR1 and IRS1
Figure 7 shows the results of the functional consistency comparisons across these methods. From Fig. 7a–c, it can be seen that in COSMIC database, HWC has the best performance in GO and REACTOME compared with other methods, and is greatly improved. The HWC method is only slightly lower than the MinNetRank method in the KEGG path analysis at k = 50, but has the highest FC value on all the other k values. In Fig. 7d–f based on the OMIN database, it can be seen that the HWC method has a clear advantage over GO and REACTOME, only slightly inferior to the DawnRank method in the KEGG pathway analysis in Fig. 7e, but also significantly higher than the other 6 methods. Taking Fig. 7a as an example, when k = 300, the FC value of HWC is 0.58, while 0.40, 0.37, 0.31, 0.23, 0.41, 0.07, and 0.09 are the FC scores of the EntroRank, MinNetRank, Subdyquency, DriverRWH, DawnRank, IntDriver and CRNMF methods, respectively. For the top 30 GO terms or pathways enriched by the top 300 genes identified by HWC method, see Supplementary Data Table S6 to Table S8.
Fig. 7.

Comparative functional consistency results for each method based on the LUAD dataset. Where the benchmark dataset of a–c is COSMIC, and the benchmark dataset of d–f is OMIM
Prostate cancer results
The accuracy comparison results of the eight methods in prostate cancer are shown in Fig. 8. It can be easily seen that no matter what k value is taken in COSMIC and OMIM database, HWC has the best performance compared with the other seven methods, and greatly improved. Consistent with the above analysis, in PRAD data set, among the top 300 genes identified by the HWC method (The top 300 genes identified by HWC are shown in Supplementary Data Table S9), some genes relate to prostate cancer are in a lower position or ignored in other methods, such as TRAF2, IGF1R, CREB1, etc. It has been reported that gene TRAF2, as an anti-apoptotic signal, can inhibit the growth of prostate cancer cells, and if it is mutated, it will increase the incidence of prostate cancer [51]. Gene TRAF2 is ranked 100th in the HWC method and 3754th, 1531th, 235th, 343th, 459th, 5539th and 312th in EntroRank, MinNetRank, Subdyquency, DriverRWH, CRNMF, IntDriver and DawnRank, respectively. Gene IGF1R are ranked 18th, 184th, 543th, 393th, 203th and 397th in HWC, EntroRank, Subdyquency, CRNMF, IntDriver and DawnRank methods, respectively. It has been reported that the overexpression of IGF1R will promote cell proliferation and neoplastic transformation, which will increase the risk of produced by prostate cancer [52]. Some articles indicate that the combination of gene CREB1 and turnover factor FoxA1 can predict prostate cancer recurrence [53], which is ranked 48th in the HWC method, and 2319th, 2278th, 3279th, 6054th and 739th in MinNetRank, DriverRWH, CRNMF, IntDriver and DawnRank, which is not identified by EntroRank and Subdyquency. As shown in Fig. 9, the gene regulatory networks of TRAF2, IGF1R and CREB1 can help us understand the regulatory mechanisms of gene expression during cell differentiation and development, and reveal the interactions among cancer-related genes. And gene regulatory networks can also be used to predict the effects of gene mutations on cell function, providing theoretical support for gene therapy. In addition, this paper also notes that some of the top 300 genes identified by the HWC method have been confirmed to be associated with the prostate gland, but are not in the COSMIC and OMIM databases, such as SPRY2, SP1, CREB5, etc.
Fig. 8.

Comparison results of each method with Precision, Recall, and Fscore based on PRAD data set. Where the benchmark dataset of a–c is COSMIC, and the benchmark dataset of d–f is OMIM
Fig. 9.

Gene regulatory network of TRAF2, IGF1R, CREB1
Figure 10 shows the results of a functional consistency comparison on the PRAD. As can be seen from Fig. 10a–f, the HWC method has the best performance in GO terms analysis on COSMIC and OMIM databases. HWC method is slightly worse for KEGG and REACTOME path analysis only when k = 50, but has the best FC score for other k values. For the top 30 GO terms or pathways enriched by the top 300 genes identified by HWC method, see Supplementary Data Table S10 to Table S12.
Fig. 10.

Comparative functional consistency results for each method based on the PRAD dataset. Where the benchmark dataset of a–c is COSMIC, and the benchmark dataset of d–f is OMIM
The PAUC comparison results of the three cancer species
Figure 11 shows PAUC comparisons for these methods across three cancer species, where n is set to 50, 100, 150, 200, 250, 300.As shown in Fig. 11a–f, the HWC method has the best performance on all n values compared to the other methods. Taking Fig. 11b as an example, when k = 300, the PAUC value of HWC is 0.11, and the PAUC values calculated by EntroRank, MinNetRank, Subdyquency, DriverRWH, DawnRank, IntDriver and CRNMF are 0.09, 0.03, 0.06, 0.05, 0.05, 0.04 and 0.04, respectively. In conclusion, HWC has a better performance and is able to identify more driver genes than the other seven methods.
Fig. 11.

PAUC comparison results of the 8 methods on the three cancer species data. Where the benchmark dataset of a–c is COSMIC, and the benchmark dataset of d–f is OMIM
Analysis of the recognition ability of HWC
Table 4 shows the identification ability analysis of these methods on the three cancers data when the benchmark database is OMIM. The results show that HWC finds out the most driver genes than the other methods at the six top levels, and HWC has been a significant improvement in the recognition ability than the other methods.
Table 4.
Identification ability analysis of these methods on the three cancers
| CRNMF | IntDriver | DawnRank | DriverRWH | Subdyquency | MinNetRank | EntroRank | HWC | ||
|---|---|---|---|---|---|---|---|---|---|
| TOP50 | 17 | 15 | 33 | 20 | 26 | 15 | 25 | 37 | |
| TOP100 | 26 | 19 | 52 | 34 | 42 | 33 | 40 | 74 | |
| BRCA | TOP150 | 37 | 26 | 72 | 40 | 62 | 51 | 56 | 93 |
| TOP200 | 43 | 33 | 87 | 48 | 73 | 63 | 68 | 110 | |
| TOP250 | 50 | 40 | 103 | 55 | 91 | 83 | 82 | 130 | |
| TOP300 | 59 | 43 | 121 | 63 | 101 | 95 | 95 | 147 | |
| TOP50 | 6 | 5 | 23 | 10 | 15 | 16 | 23 | 28 | |
| TOP100 | 8 | 6 | 44 | 24 | 27 | 32 | 44 | 52 | |
| LUAD | TOP150 | 16 | 13 | 60 | 33 | 40 | 51 | 60 | 74 |
| TOP200 | 23 | 17 | 77 | 44 | 49 | 66 | 72 | 93 | |
| TOP250 | 28 | 21 | 94 | 50 | 52 | 78 | 83 | 108 | |
| TOP300 | 35 | 29 | 110 | 60 | 63 | 90 | 92 | 133 | |
| TOP50 | 11 | 9 | 22 | 19 | 16 | 13 | 25 | 35 | |
| TOP100 | 16 | 15 | 38 | 29 | 25 | 29 | 36 | 65 | |
| PRAD | TOP150 | 25 | 19 | 55 | 44 | 39 | 49 | 52 | 93 |
| TOP200 | 32 | 22 | 80 | 57 | 55 | 62 | 66 | 108 | |
| TOP250 | 36 | 29 | 96 | 63 | 60 | 76 | 79 | 122 | |
| TOP300 | 44 | 37 | 114 | 72 | 72 | 86 | 86 | 142 |
Bold value indicates optimal value or optimal parameters
Conclusion and discussion
The method of fusing multiple features has improved the recognition rate of driver genes to some extent, but current research rarely considers the connection among features. After analyzing a large number of methods based on multiple omics data fusion, a phenomenon among features is found that most of the feature scores of gene A are higher than the corresponding feature scores of gene B, but it is not excluded that a small number of feature scores are slightly smaller than the corresponding feature scores of gene B, and gene A is more likely to be the driver gene. Based on this phenomenon, a multi-feature weak consensus model framework, the hierarchical weak consensus model is proposed. In this paper, the features of co-mutation hypergraph network and ECC network are fused into a new feature, CMCC. It has been proved by experiments that CMCC has good recognition ability. Then, the hierarchical weak consensus model is used to fuse the CMCC, the differential expression score of mRNA and miRNA, and the HWC method is proposed. The main contributions of the HWC method are as follows:
-
A.
In order to preserve the mutation information among genes and their topological properties in PPI network, a new network feature, CMCC, is designed.
-
B.
The dynamic threshold function is used to denoise in gene expression data.
-
C.
The hierarchical weak consensus model is used on features fusion.
-
D.
Compared with the seven state-of-the-art methods, the HWC method has the best identification performance in statistical evaluation index, functional consistency, and PAUC.
About the specificity of HWC, these genes ranked in the top 300 genes identified by HWC but not by the other methods are analyzed, some of them are important in the survival and reproduction of cancer cells.
Of course, the methods in this paper also have the following limitations: (1) The hierarchical weak consensus model requires pre-experiments to determine parameters before features fusion; (2) The influence of the noise data in the PPI network is not considered in the construction of the hypergraph.
In future work, we consider using the KNN algorithm to construct hypergraphs to improve the accuracy of hyperedge segmentation. Furthermore, we intend to use an evolutionary algorithm to calculate the parameters in the hierarchical weak consensus model. The hierarchical weak consensus model can be applied to other fields based on multi-feature fusion, such as essential proteins identification.
Supplementary Information
Below is the link to the electronic supplementary material.
Funding
This research is supported by National Natural Science Foundation of China (No. 61972185, No. 62141207, No. 62302107, No. 62366007), Guangxi Natural Science Foundation (No. 2022GXNSFAA035625), Natural Science Foundation of Yunnan Province of China (No. 2019FA024), Research Fund of Guangxi Key Lab of Multi-source Information Mining & Security (No. 20-A-01-03,19-A-03-01), Guangxi Normal University Science Research Project (Natural Science) (No. 2021JC008), Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, Innovation Project of Guangxi Graduate Education (YCSW2023180).
Data availability
The source code can be obtained at https://github.com/Mrhuhappy/HWC.git.
Declarations
Conflict of interest
The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Jingli Wu, Email: wjlhappy@mailbox.gxnu.edu.cn.
Wei Peng, Email: weipeng1980@gmail.com.
References
- 1.Vandin F, Upfal E, Raphael BJ. De novo discovery of mutated driver pathways in cancer. Genome Res. 2011. [DOI] [PMC free article] [PubMed]
- 2.Mclendon R, et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455(7216):1061–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bobrow M, Zhao S. International network of cancer genome projects. Nature. 2010;464(7291):993–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Peng J, Xue H, Shao Y, Shang X, Wang Y, Chen J. A novel method to measure the semantic similarity of hpo terms. Int J Data Min Bioinform. 2017;17(2):173–88. [Google Scholar]
- 5.Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458(7239):719–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bashashati A, et al. DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer. Genome Biol. 2012;13(12):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Shi K, Gao L, Wang B. Discovering potential cancer driver genes by an integrated network-based approach. Mol BioSyst. 2016;12(9):2921–31. [DOI] [PubMed] [Google Scholar]
- 8.Tian R, Basu MK, Capriotti E. ContrastRank: a new method for ranking putative cancer driver genes and classification of tumor samples. Bioinformatics. 2014;30(17):i572–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Dees ND, et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res. 2012;22(8):1589–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lawrence MS, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ding L, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008;455(7216):1069–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Pon JR, Marra MA. Driver and passenger mutations in cancer. Annu Rev Pathol. 2015;10:25–50. [DOI] [PubMed] [Google Scholar]
- 13.Wendl MC, et al. PathScan: a tool for discerning mutational significance in groups of putative cancer genes. Bioinformatics. 2011;27(12):1595–602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Youn A, Simon R. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics. 2011;27(2):175–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gatza ML, Silva GO, Parker JS, Fan C, Perou CM. An integrated genomics approach identifies drivers of proliferation in luminal-subtype human breast cancer. Nat Genet. 2014;46(10):1051–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Dimitrakopoulos CM, Beerenwinkel N. Computational approaches for the identification of cancer genes and pathways. Wiley Interdiscip Rev. 2017;9(1): e1364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Martincorena I, et al. Universal patterns of selection in cancer and somatic tissues. Cell. 2017;171(5):1029–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Torti D, Trusolino L. Oncogene addiction as a foundational rationale for targeted anti-cancer therapy: promises and perils. EMBO Mol Med. 2011;3(11):623–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hahn WC, Weinberg RA. Modelling the molecular circuitry of cancer. Nat Rev Cancer. 2002;2(5):331–41. [DOI] [PubMed] [Google Scholar]
- 20.Hahn WC, Counter CM, Lundberg AS, Beijersbergen RL, Brooks MW, Weinberg RA. Creation of human tumour cells with defined genetic elements. Nature. 1999;400(6743):464–8. [DOI] [PubMed] [Google Scholar]
- 21.Hou P, Ma J. DawnRank: discovering personalized driver genes in cancer. Genome Med. 2014;6:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Xi J, Wang M, Li A. Discovering mutated driver genes through a robust and sparse co-regularized matrix factorization framework with prior information from mRNA expression patterns and interaction network. BMC Bioinform. 2018;19(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Xi J, Wang M, Li A. Discovering potential driver genes through an integrated model of somatic mutation profiles and gene functional information. Mol BioSyst. 2017;13(10):2135–44. [DOI] [PubMed] [Google Scholar]
- 24.Dimitrakopoulos C, et al. Network-based integration of multi-omics data for prioritizing cancer genes. Bioinformatics. 2018;34(14):2441–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Song J, Peng W, Wang F. A random walk-based method to identify driver genes by integrating the subcellular localization and variation frequency into bipartite graph. BMC Bioinform. 2019;20(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Song J, Peng W, Wang F. An entropy-based method for identifying mutual exclusive driver genes in cancer. IEEE/ACM Trans Comput Biol Bioinform. 2019;17(3):758–68. [DOI] [PubMed] [Google Scholar]
- 27.Wei T, Fa B, Luo C, Johnston L, Zhang Y, Yu Z. An efficient and easy-to-use network-based integrative method of multi-omics data for cancer genes discovery. Front Genet. 2021;11: 613033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wang C, Shi J, Cai J, Zhang Y, Zheng X, Zhang N. DriverRWH: discovering cancer driver genes by random walk on a gene mutation hypergraph. BMC Bioinform. 2022;23(1):1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Choudhury Y, et al. Attenuated adenosine-to-inosine editing of microRNA-376a* promotes invasiveness of glioblastoma cells. J Clin Investig. 2012;122(11):4059–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Stahlhut C, Slack FJ. MicroRNAs and the cancer phenotype: profiling, signatures and clinical implications. Genome Med. 2013;5:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D. Defining and identifying communities in networks. Proc Natl Acad Sci. 2004;101(9):2658–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Li M, Zhang H, Wang J-X, Pan Y. A new essential protein discovery method based on the integration of protein-protein interaction and gene expression data. BMC Syst Biol. 2012;6:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Xiao Q, Wang J, Peng X, Wu F-X. Detecting protein complexes from active protein interaction networks constructed with dynamic gene expression profiles. Proteome Sci. 2013;11(1):1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Bhattacharyya A. On a measure of divergence between two statistical populations defined by their probability distribution. Bull Calcutta Math Soc. 1943;35:99–110. [Google Scholar]
- 35.Tomczak K, Czerwińska P, Wiznerowicz M. Review The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol/Współczesna Onkologia. 2015;2015(1):68–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Patil A, Nakamura H. HINT: a database of annotated protein-protein interactions and their homologs. Biophysics. 2005;1:21–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Huang H-Y, et al. miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database. Nucleic Acids Res. 2020;48(D1):D148–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Tate JG, et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 2019;47(D1):D941–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man(OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005;33(Suppl 1):D514–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Ashburner M, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Fabregat A, et al. Reactome graph database: efficient access to complex pathway data. PLoS Comput Biol. 2018;14(1): e1005968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics. 2012;16(5):284–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yu G, He Q-Y. ReactomePA: an R/bioconductor package for reactome pathway analysis and visualization. Mol BioSyst. 2016;12(2):477–9. [DOI] [PubMed] [Google Scholar]
- 45.Kuchenbaecker KB, et al. Risks of breast, ovarian, and contralateral breast cancer for BRCA1 and BRCA2 mutation carriers. JAMA. 2017;317(23):2402–16. [DOI] [PubMed] [Google Scholar]
- 46.Wang J, Rouse C, Jasper JS, Pendergast AM. ABL kinases promote breast cancer osteolytic metastasis by modulating tumor-bone interactions through TAZ and STAT5 signaling. Sci Signal. 2016;9(413):ra12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Moore-Smith L, Pasche B. TGFBR1 signaling and breast cancer. J Mammary Gland Biol Neoplasia. 2011;16:89–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Sugano T, et al. Inhibition of ABCB1 overcomes cancer stem cell–like properties and acquired resistance to MET inhibitors in non-small cell lung cancer ABCB1 inhibition overcomes resistance to MET inhibitors. Mol Cancer Ther. 2015;14(11):2433–40. [DOI] [PubMed] [Google Scholar]
- 49.Gao X, et al. Estrogen receptors promote NSCLC progression by modulating the membrane receptor signaling network: a systems biology perspective. J Transl Med. 2019;17:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Gorgisen G, et al. Identification of novel mutations of Insulin Receptor Substrate 1 (IRS1) in tumor samples of non-small cell lung cancer (NSCLC): implications for aberrant insulin signaling in development of cancer. Genet Mol Biol. 2019;42:15–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Wei B, et al. TRAF2 is a valuable prognostic biomarker in patients with prostate cancer. Med Sci Monit. 2017;23:4192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Rochester MA, Riedemann J, Hellawell GO, Brewster SF, Macaulay VM. Silencing of the IGF1R gene enhances sensitivity to DNA-damaging agents in both PTEN wild-type and mutant human prostate cancer. Cancer Gene Ther. 2005;12(1):90–100. [DOI] [PubMed] [Google Scholar]
- 53.Sunkel B, et al. Integrative analysis identifies targetable CREB1/FoxA1 transcriptional co-regulation as a predictor of prostate cancer recurrence. Nucleic Acids Res. 2016;44(9):4105–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The source code can be obtained at https://github.com/Mrhuhappy/HWC.git.

