Abstract
Motivation
Tumor stratification has a wide range of biomedical and clinical applications, including diagnosis, prognosis and personalized treatment. However, cancer is always driven by the combination of mutated genes, which are highly heterogeneous across patients. Accurately subdividing the tumors into subtypes is challenging.
Results
We developed a network-embedding based stratification (NES) methodology to identify clinically relevant patient subtypes from large-scale patients’ somatic mutation profiles. The central hypothesis of NES is that two tumors would be classified into the same subtypes if their somatic mutated genes located in the similar network regions of the human interactome. We encoded the genes on the human protein–protein interactome with a network embedding approach and constructed the patients’ vectors by integrating the somatic mutation profiles of 7344 tumor exomes across 15 cancer types. We firstly adopted the lightGBM classification algorithm to train the patients’ vectors. The AUC value is around 0.89 in the prediction of the patient’s cancer type and around 0.78 in the prediction of the tumor stage within a specific cancer type. The high classification accuracy suggests that network embedding-based patients’ features are reliable for dividing the patients. We conclude that we can cluster patients with a specific cancer type into several subtypes by using an unsupervised clustering algorithm to learn the patients’ vectors. Among the 15 cancer types, the new patient clusters (subtypes) identified by the NES are significantly correlated with patient survival across 12 cancer types. In summary, this study offers a powerful network-based deep learning methodology for personalized cancer medicine.
Availability and implementation
Source code and data can be downloaded from https://github.com/ChengF-Lab/NES.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Cancer is an abnormal lesion formed by a cell in a local tissue at the gene level due to carcinogenic factors (Hanahan et al., 2011). Tumor tissues and normal tissues differ in cell morphology and tissue structure, with benign tumors presenting relatively low atypia that are usually similar to the normal tissues from which they are derived, while malignant tumors show relatively high atypia (Bedard et al., 2013; Gerlinger et al., 2012; Meacham and Morrison, 2013). Generally, cancer is not only complex, driven by a combination of mutated genes, but also heterogeneous, where the combination of the genes can differ considerably across patients, presenting a formidable challenge to cancer research. In recent years, we have been overwhelmed by the rapidly accumulating tumor data from the development of the next-generation sequencing technologies and several multi-center cancer exome/genome projects, such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) (International Cancer Genome Consortium et al., 2010; Weinstein et al., 2013), which brings us the new opportunities in cancer research. Artificial intelligence and machine learning benefit from these big data, and are commonly used to accelerate cancer research (Camacho et al., 2018; Eraslan et al., 2019; Topol, 2019), including cancer diagnosis, driver gene identification (Bailey et al., 2018; Tokheim et al., 2016), drug development (Cheng et al., 2019) and precision oncology (Azuaje, 2019; Nussinov et al., 2019).
Tumor stratification aims to produce a subdivision of tumors into clinically and biologically consequential subtypes that would benefit precision oncology. For example, Liu et al. (2017) proposed an Entropy-based Consensus Clustering (ECC) method for patient stratification, which employed an entropy-based utility function to fuse many basic partitions to a consensus model. Importantly, many data-driven approaches have been proposed to classify tumors based on various clinic data. Examples include imaging data to classify skin cancer with a deep neural networks algorithm (Esteva et al., 2017), prognostic multigene classifiers for breast cancer with gene expression profiles (Reis-Filho and Pusztai, 2011), molecular pattern identification based on supervised machine-learning methods for brain tumors with genome-wide methylation data (Wong and Yip, 2018), predicting the survival rate for various subtypes based on neural networks for breast cancers with clinic information such as the tumor size, axillary nodal status and so on (Lundin et al., 1999).
Recently, biomolecular network data has been widely adopted in cancer research (Cheng et al., 2015; Horn et al., 2018; Leiserson et al., 2015; Liu et al., 2020a), with cancer considered a network disease (Hu et al., 2016). Cancer somatic mutation profiles are highly coupled with the biomolecular network (Cheng et al., 2014a), for example, somatic mutations of a cancer driver gene may drive cancer genome evolution by inducing mutations in other genes (Cheng et al., 2015). Therefore, we can identify each patient with its somatic mutation profiles and apply the similarity between patients to mine for tumor stratification. Integrated with the network information, two tumors may be very similar if their mutated genes are located in similar network regions (Hofree et al., 2013). Based on this hypothesis, Hofree et al. proposed the network-based stratification (NBS), which applied network propagation methods to find the cancer subtype by clustering patients with mutations in similar network regions (Hofree et al., 2013). Zhang et al. proposed network-based supervised stratification 2 (NBS2), where tumors are classified by the supervised network propagation of the somatic mutation profiles (Zhang et al., 2018).
However, the network data structure is extremely high-dimensional, and the propagation might not suffice to mine the information well. Thus, attention has shifted to network analysis involving embedding approaches (Goyal and Ferrara, 2018; Nelson et al., 2019), which can extract features automatically and map the nodes into a multi-dimensional vector. The superior performance of network embedding approaches made them widely used in network biology (Liu et al., 2020b; Nelson et al., 2019), including in prediction of disease associated genes (Peng et al., 2019), drug-target associations (Zeng et al., 2019; 2020; Zong et al., 2017) and protein function (Kulmanov et al., 2018). In this work, we aim to design cancer subtype classification on 15 cancer types from TCGA by integrating human protein–protein interactions (PPIs) and patient-specific somatic mutation profiles. To accomplish this, we develop Network Embedding-based Stratification (NES) of tumor mutations, which is a network-based deep learning methodology. We construct the patient vectors based on the network-embedding of the PPI network, and the patients that can be clustered together are identified as the same subtype. The identified subtypes based on NES are strongly associated to the patient survival.
2 Methods and data
2.1 Pipeline of the network-embedding based stratification
To identify clinically relevant patient subtypes well, the NES combines the human protein–protein interactome and the genome-scale somatic mutation profiles. According to the hypothesis that patients with mutations in similar network regions are more likely to be the same subtypes, we can cluster the patients with the mutations in similar regions in the network embedded space as a subtype.Figure 1 illustrates the pipeline of the NES approach, which contains three main components: (i) the genetic vectorization by using the representation learning algorithm on the human protein–protein interactome, which is also referred to as network embedding (Fig. 1a), (ii) patient’s feature construction by integrating the patient’s somatic mutation profiles and the genetic vector generated from step i (Fig. 1b) and (iii) patients subdivision through the machine learning approaches (using the patients’ feature obtained from step ii) complemented by survival curve validation (Fig. 1c) (the detailed description is located in Method section).
Fig. 1.

A diagram illustrating the network-embedding based stratification of tumor missense somatic mutations. (a) Network embedding: represent each gene with a 128-dimensional vector using the network embedding algorithm on the human protein–protein interactome. (b) Patient feature construction: different mutation distribution for each cancer type (left). Construction of the patient vectors were based on the mutation profiles and the gene vector (right). (c) Patients stratification: applying the machine learning algorithms to divide the patients into different clusters (subgroups/subtypes)
2.2 Human protein–protein interactome
We constructed a high-quality human protein–protein interactome covering 346 330 PPIs connecting 17 662 unique proteins (nodes). We assembled human PPIs with five types of experimental evidences: (i) binary PPIs tested by high-throughput yeast-two-hybrid (Y2H) systems through combining two publicly available high-quality Y2H datasets (Luck et al., 2020; Rolland et al., 2014); (ii) protein complexes data identified by robust affinity purification-mass spectrometry methodologies collected from BioPlex V2.016 (Huttlin et al., 2015); (iii) kinase-substrate interactions by literature-derived low-throughput or high-throughput experimental assays from KinomeNetworkX (Cheng et al., 2014b), Human Protein Resource Database (HPRD) (Peri et al., 2004), DbPTM 3.0 (Lu et al., 2013), PhosphoNetworks (Hu et al., 2014), PhosphositePlus (Hornbeck et al., 2015) and Phospho.ELM (Dinkel et al., 2011); (iv) a signaling network by literature-derived low-throughput experiments from SignaLink2.0 (Fazekas et al., 2013); and (v) literature-curated PPIs identified by affinity purification followed by mass spectrometry (AP-MS), Y2H, literature-derived low-throughput experiments, or protein three-dimensional structures from BioGRID (Chatr-Aryamontri et al., 2015), PINA (Cowley et al., 2012), Instruct (Meyer et al., 2013), MINT (Licata et al., 2012), IntAct (Orchard et al., 2014) and InnateDB (Breuer et al., 2013). All genes were mapped to their Entrez ID based on the NCBI database as well as their official gene symbols based on Gene Cards, and self-loop interactions or duplicated PPI pairs were excluded, as described previously (Cheng et al., 2019).
2.3 Somatic mutation profiles in primary tumors
We downloaded the somatic mutation data from TCGA GDC Data Portal. In this project, we only focused on the missense somatic mutations in TCGA tumor-normal matched samples across 15 cancer types. In total, we collected somatic mutation profiles for 7344 tumors. The patient clinical data, including the tumor stage and survival profiles are also downed from TCGA. In this study, we used the tumor-normal pairwise non-synonymous (missense) mutations for patients using R package TCGA-assembler (Zhu et al., 2014).
2.4 Network embedding
In the PPI network, a node in the network represents a single gene, and an edge indicates that the proteins encoded by the two genes have an interaction. In order to mine the network features well, we utilize the network embedding approach to learn a vector to represent the gene in the network. We choose the struc2vec model, which is a structural feature learning representation model that can be used in the process of vectorization of PPI network nodes (Ribeiro et al., 2017). The struc2vec model is improved on the DeepWalk model. It adopts a hierarchical structure to measure the node similarity, construct multi-layer diagrams to code structural similarity, and generate the structural context for nodes. Therefore, gene pairs that are far away but similar in structure are closely embedded together in the PPI network, which is more conducive to the subsequent construction of similar genetic characteristics of patients. The main steps of the struc2vec model are as follows:
-
Calculate the structural similarity. The structural similarity between each pair of nodes and can be denoted as:
where is the set of nodes with distance k () from node , is the ordered degree sequence of node set , is a measure of the distance between the ordered sequences and , and .(1) Generally, and are always of different size, and the elements are integers (node degree). In order to compare such two degree sequences with different size, we use the Dynamic Time Warping (DTW) (Rakthanmanon et al., 2013), which is originally proposed to compare the similarity between two time series with different length, to measure the distance between these two sequences:(2) -
Construct a multilayer weighted graph (Fig. 1a). The network in layer is formed by a weighted undirected completed graph, and the edge weight between nodes and in the same layer is defined as:
where is the diameter of the original network.(3) Directed edges connect the same nodes that belong to different layers. For each node in layer , it is connected to its corresponding node in layer and layer . And the weights of the edge that between different layers are defined as:
where is the number of edges connected to in the layer k that have weight larger than the average edge weight.(4) - Generate node sequence. We can apply the multi-layer graph to generate a sequence for nodes based on the biased random walk. At each step in the random walk, we can assume that the walker will stay in the current layer with probability , or change to the other layers with probability . If the walker stays in the current layer, the probability from node to node in layer k is:
where is the normalization factor for the node in the k-th layer. By performing a random walk, the nodes of each sample are more inclined to select nodes with high structural similarity. If the walker switches between different layers, choose layer and layer with the following probability:(5) (6)
Therefore, we start the random walk for a randomly selected node in layer 0 and perform the random walk with the probability illustrated in Equations (5) and (6) to generate a node sequence. In this study, the length of each random walk sequence is set to be 80, and we generate 20 random walk sequences for each node. When the node sequences are generated, the Skip-Gram model (Mikolov et al., 2013) is further used to train the node sequences, and a corresponding vector is generated for each gene node (Fig. 1a). And each gene is embedded as a 128-dimensional vector in this study (all parameters used for each step are listed in Supplementary Table S1).
2.5 Patient feature construction
In order to describe the patient more precisely, we assemble the mutated genes’ vector to construct the patient features. For the patients across different cancer types, we rank the genes with their degree in the PPI network and divide the genes into 10 blocks with equal size according to their degree. For each patient, we fuse the vector of the mutated genes located in the same block to generate a new 128-dimensional vector, and concatenate the vector of the 10 blocks to generate a 1280-dimensional vector, which can be used to represent patient’s features. Finally, using the combination, all patients are represented by a multi-dimensional vector, as shown in Figure 1b.
We found that gene mutation frequencies are significantly distinct across different cancer types. Some genes are obvious in all cancer types, and some genes are only mutated in specific cancer, as shown in Figure 1b. Therefore, for patients clustering within the specific cancer type, we defined a modified gene weight to enhance the influence of the genes that were significantly mutated in the corresponding cancer type, while weaken the influence of the genes that were frequently mutated in all cancer types. And the weight is defined as:
| (7) |
where is the total number of tumor types, is the number of mutations to gene a in the tumor , is the number of genes with the most mutations in the tumor , and is the number of mutations to gene a across all tumor types. We sort the genes of patients with the same cancer type according to the weight , and then reconstruct the characteristics of the patients similarly. The patient feature is constructed based on the weight and the gene vector. We divide the genes into 10 blocks with equal size according to their weight in each cancer type and combine the characteristics of the mutated genes in the same block to generate the patient’s vector.
2.6 Supervised classification model
We use the lightGBM classification method in supervised learning to classify tumors based on the constructed patient features (Chen and Guestrin, 2016; Chen and Liu, 2018). LightGBM is a GBDT-based (Gradient Boosting Decision Tree) lifting method that uses a histogram-based algorithm to discretize continuous feature values into k integers and construct a histogram with a width k. For the classification over a specific cancer type, the tumors with the corresponding cancer type are considered as the positive sample, while other tumors are considered as the negative samples (the key parameters of lightGBM are listed in Supplementary Table S1).
For the binary classification, we can take the AUC value, which is the area under the receiver operating characteristic (ROC) curve, as the evaluation metric to test the classification performance. We calculate the true positive rate (TPR) and the false positive rate (FPR) by varying the threshold to obtain the ROC curves according to:
| (8) |
where FN and FP are the number of negative and positive samples erroneously identified while TN and TP are the number of negative and positive samples correctly identified.
2.7 Unsupervised clustering model
We applied the unsupervised learning DBSCAN (Density-Based Spatial Clustering of Applications with Noise) density clustering method to perform clustering on different subtypes of patients with the same cancer type (Ester et al., 1996). Given the neighborhood distance ε and the minimum number of neighborhood samples MinPts, for the temporary cluster D, we can obtain the domain of p () as
| (9) |
where is the distance between node p and q. And p is the core points if .
For each temporary cluster, we check whether the inner point is the core point in sequence, and merge the corresponding temporary cluster of the selected core points with the current temporary cluster to obtain the new temporary cluster. Repeat this step until the point in the current temporary cluster is not a core point or the point within the density reach is already in the temporary cluster, and then update the temporary cluster to the cluster. For each cancer type, the parameters ε and MinPts are set to divide the patients into n clusters, where n is the reported subtype number in the literature about the medical tumor subtype discovery for the corresponding cancer type.
3 Results
3.1 Tumor classification across various cancer types
In this work, we collected the genome-scale somatic mutation profiles from 7344 tumor exomes across 15 cancer types from TCGA. Table 1 shows the detailed cancer types and the corresponding size of tumor exomes. For each patient, we can obtain the feature vector by integrating the mutated gene list and the genetic vector from the network-embedding. Here we construct a 1280-dimensional vector to represent a patient. We use the t-distributed stochastic neighbor embedding algorithm (t-SNE) (Van der Maaten and Hinton, 2008), which can reduce the dimensionality to embed similar objects in high-dimensional space close in 2-dimensional coordinates, to visualize the patient vectors (Fig. 2). The t-SNE analysis for the patient vectors across 15 cancer types reveals that patients in the same cancer type are geographically grouped. The embedding for each cancer type forms their own cluster, which is marked with different colors. As shown in Figure 1b, the somatic mutation patterns for different cancer types are very different, therefore patients with different cancer types tend to be away from each other and patients with the same cancer type tend to cluster together in the t-SNE plot.
Table 1.
Illustration of 15 cancer types and corresponding tumor number
| Cancer types | Patient number | |
|---|---|---|
| BLCA | Bladder urothelial carcinoma | 419 |
| BRCA | Breast invasive carcinoma | 1067 |
| CESC | Cervical squamous cell carcinoma | 307 |
| COAD | Colon adenocarcinoma | 464 |
| HNSC | Head and neck squamous cell carcinoma | 512 |
| KIRC | Kidney renal clear cell carcinoma | 370 |
| LIHC | Liver hepatocellular carcinoma | 377 |
| LUAD | Lung adenocarcinoma | 630 |
| LUSC | Lung squamous cell carcinoma | 559 |
| OV | Ovarian serous cystadenocarcinoma | 512 |
| READ | Rectum adenocarcinoma | 164 |
| SKCM | Skin cutaneous melanoma | 470 |
| STAD | Stomach adenocarcinoma | 440 |
| THCA | Thyroid carcinoma | 502 |
| UCEC | Uterine corpus endometrial carcinoma | 551 |
Fig. 2.

Visualization of the learned patients’ vectors using the t-SNE. The abbreviation of 15 cancer types are provided in Table 1
In order to test the efficiency of the NES, we utilized a supervised classification algorithm (lightGBM) to predict subgroups of the patients, and the learned patients’ vectors are the inputs of the classification algorithm. Take BRCA as an example, the known BRCA tumors are labeled as positive samples, and the other tumors (non-BRCA) are labeled as negative samples. Part of patients are randomly selected as the training set, which is used to train the classification model and the rest are selected as the testing set, which is used to validate the prediction. We obtain the AUC value with averaging 100 times of 5-fold cross validation. As shown in Figure 3, the AUC values range from 0.75 to 0.99 across 15 cancer types, where the average AUC value is close to 0.89. Among the 15 types, COAD, KIRC, READ, SKCM, THCA and UCEC have the best performance, which is larger than 0.9. We found that NES is comparable to two state-of-the-art approaches for prediction of cancer subgroups (Supplementary Fig. S1). In addition, we showed that NES outperformed the traditional driver mutation-based patient stratification approach (Supplementary Fig. S2). The NES models built on the selected features (1280 dimensions) and the global PPI network show the best performance (Supplementary Figs S3 and S4). Altogether, these observations reveal that NES offers a potential computational tool for the stratification of tumor mutations.
Fig. 3.

The receiver operating characteristic (ROC) curve of the patient classification across the 15 cancer types. The abbreviation of 15 cancer types are provided in Table 1. AUC: the area under the receiver operating characteristic curve
3.2 Tumor subgroup stratification for individual cancer types
The performance of the classification approach works across different cancer types; however, this work aims to provide a subdivision of patients into subtypes for the same cancer type. Here, we collected the tumor stage information for each patient from TCGA and used it to evaluate the stratification of the tumor mutation. Tumor staging represents the growth range and spread of malignant tumors. The wider the range of growth, the greater the spread, and the worse the patient's prognosis. Pathologically, according to the characteristics of tumor size and invasion range, tumors can be generally divided into stages I, II, III and IV, which represent the early, middle and late stages of the disease through the tumor-node-metastasis (TNM) staging system. Take the tumor stage I of the BRCA as an example, the BRCA tumors in stage I are labeled as positive samples, and the other BRCA tumors (non-Stage I) are labeled as negative samples. Figure 4 shows the visualization of the learned patient vectors using the t-SNE for different tumor stages for BRCA, COAD, LUAD and UCEC respectively. Patients in the same tumor stage tend to be grouped together in the 2-dimensional space, which indicates that the somatic mutation patterns of the patients in different tumor stages for the same cancer type are very different. Additionally, we also provide the AUC value for the classification across various tumor stages for BRCA, COAD, LUAD and UCEC in Figure 4 respectively. The average AUC for all stages is about 0.78, and the AUC for stage IV is much higher than the other stages. The classification for most tumor stages is highly accurate for high AUC values, indicating patient features based on the network embedding of the PPI network are efficient for the patient classification. (Supplementary Figs S5 and S6 illustrate the results for all 15 cancer types).
Fig. 4.

Visualization of the learned patients’ vectors using the t-SNE for 4 cancer types including BRCA, COAD, LUAD and UCEC. Patients in different stages are marked with different colors. And the bar plot shows the corresponding the area under the receiver operating characteristic curve (AUC) of the patients’ classification for each stage
The above analyses indicate that network-embedding based vectors tend to cluster together for patients with some similar clinical features, for example, belonging to the same cancer type and the same tumor stage in the specific cancer type. This implies that we can construct an unsupervised learning method to obtain optimized clustering of the patients, which can be considered as the subdivision of patients into subtypes. Here, we apply the unsupervised DBSCAN, which is a density clustering method, to perform patients clustering. For each cancer type, the number of the patient clusters is consistent with the reported subtype number in the literature related to the medical tumor subtype identification. For example, it is reported in Advances in Breast Cancer Research in TCGA that there are three main clinical subtypes of breast cancer, including Luminal A, Luminal B, Triple-negative/basal-like and HER2-enriched, based on these two classic Nature Reviews (Harbeck et al., 2019; Sims et al., 2007). We also generated three clusters of breast cancer patients (BRCA) in our data. Figure 5 illustrates the clustering results for the darker color in the matrix showing higher clustering for pairs of patients. We also calculate the Silhouette Coefficient and Calinski-Harabasz coefficient to evaluate the clustering results (Supplementary Table S2). The results show that the patients were tightly spaced in the same cluster and dispersed in different clusters for most cancer types. The Kaplan–Meier survival plots in Figure 5 illustrate that the patient survival time and survival probability are significantly different across the identified subtypes (the log-rank P-value). For example, the tumor numbers of the two subtypes of UCEC are 273 and 258 respectively, and the patient survival time patterns are significantly different with log-rank . Among the 15 cancer types, the subtypes identified by the NES in 12 cancer types are significantly correlated with patient survival (), which reveals the efficiency of the stratification of tumor mutations. Supplementary Figures S7 and S8 illustrate the result of the tumor mutation stratification for all the 15 cancer types and Supplementary Table S3 shows the tumor number of each subtypes.
Fig. 5.

The result of the tumor mutation stratification for 4 cancer types including BCRA, COAD, LUAD and UCEC. The left panels represent the patient clustering, where the dark blue area indicate that the corresponding patients should be clustered into the same subtype. The right panels represent the survival analysis for different subtypes, and the P-value is calculated according to the log-rank test
4 Discussion
Tumor stratification provides insight into disease mechanisms which is critical for successful precision. Our integrated somatic mutation profiles and the human protein–protein interactome allow us to obtain a network-embedding based tumor stratification (NES), with the gene vector constructed based on a structural feature learning representation model (struc2vec), and the patient vector combining the mutated gene vectors. Using the patient vectors, we trained the supervised classification algorithm to classify 7344 patients across 15 cancer types from TCGA and across different tumor stages for the specific cancer type. The ROC results show the efficiency of classification based on the constructed patient vectors. Finally, we trained the unsupervised clustering algorithm (DBSCAN) to divide the patients of a specific cancer type into clinically relevant subtypes, and the clinical data showed a significantly different patient survival time for the various subtypes.
Although this study offers a powerful network-based methodology for tumor stratification, there are still several limitations. First, literature bias and incompleteness of the PPI networks used in the current study may influence performance of NES models (Supplementary Figs S3 and S4). In addition, the somatic mutation rates are very different across tumor types; some tumor types have high mutation rate (such as LUAD, COAD, and UCEC), while others have low mutation frequency (such as LAML, BRCA and others). In current NES framework, we only integrate the tumor-normal sample matched somatic mutation profiles. Integration of other types of omics data, including whole-genome sequencing, RNA-sequencing, methylation and proteomics from individual patients may improve NES models further. Second, the framework of the proposed method involves clustering the patients based on the patient vector extracted from the specific dataset. Several other computational algorithms can also be applied within this framework to solve the tumor stratification problem. For example, we can adopt deep neural networks to increase the prediction accuracy or the visual machine learning approaches, such as K-means clustering. In addition, using the explainable deep learning framework, such as xAI, (Jimenez-Luna et al., 2020) and the visible neural networks (Ma et al., 2018), would shield some light on improving the biological interpretation in the future work, which would be a potential extension direction of the proposed method.
Funding
This study was partially supported by Natural Science Foundation of China [61873080 and 61673151] and the Zhejiang Provincial Natural Science Foundation of China [LY18A050004 and LR18A050001]. This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract HHSN261200800001E to R.N. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the US Government. This Research was supported [in part] by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research and the Intramural Research Program of the NIH Clinical Center to R.N.
Conflict of Interest: none declared.
Supplementary Material
Contributor Information
Chuang Liu, Alibaba Research Center for Complexity Sciences, Hangzhou Normal University, Hangzhou 311121, China.
Zhen Han, Alibaba Research Center for Complexity Sciences, Hangzhou Normal University, Hangzhou 311121, China.
Zi-Ke Zhang, Alibaba Research Center for Complexity Sciences, Hangzhou Normal University, Hangzhou 311121, China; College of Media and International Culture, Zhejiang University, Hangzhou 310028, China.
Ruth Nussinov, Cancer and Inflammation Program, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, National Cancer Institute at Frederick, Frederick, MD 21702, USA; Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv 69978, Israel.
Feixiong Cheng, Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA; Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA; Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA.
References
- Azuaje F. (2019) Artificial intelligence for precision oncology: beyond patient stratification. NPJ Precision Oncol., 3, 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bailey M.H. et al. (2018) Comprehensive characterization of cancer driver genes and mutations. Cell, 173, 371–385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bedard P.L. et al. (2013) Tumour heterogeneity in the clinic. Nature, 501, 355–364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breuer K. et al. (2013) InnateDB: systems biology of innate immunity and beyond–recent updates and continuing curation. Nucleic Acids Res., 41, D1228–1233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Camacho D.M. et al. (2018) Next-generation machine learning for biological networks. Cell, 173, 1581–1592. [DOI] [PubMed] [Google Scholar]
- Chatr-Aryamontri A. et al. (2015) The BioGRID interaction database: 2015 update. Nucleic Acids Res., 43, D470–478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen T., Guestrin C. (2016) XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794.
- Chen X., Liu X. (2018) A weighted bagging LightGBM model for potential lncRNA–disease association identification. In Proceedings of 13th International Conference of Bio-inspired Computing: Theories and Applications, pp. 307–314.
- Cheng F. et al. (2014a) Studying tumorigenesis through network evolution and somatic mutational perturbations in the cancer interactome. Mol. Biol. Evol., 31, 2156–2169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng F. et al. (2014b) Quantitative network mapping of the human kinome interactome reveals new clues for rational kinase inhibitor discovery and individualized cancer therapy. Oncotarget, 5, 3697–3710. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng F. et al. (2015) A gene gravity model for the evolution of cancer genomes: a study of 3,000 cancer genomes across 9 cancer types. PLoS Comput. Biol., 11, e1004497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng F. et al. (2019) A genome-wide positioning systems network algorithm for in silico drug repurposing. Nat. Commun., 10, 3476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cowley M.J. et al. (2012) PINA v2.0: mining interactome modules. Nucleic Acids Res., 40, D862–865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dinkel H. et al. (2011) Phospho.ELM: a database of phosphorylation sites—update 2011. Nucleic Acids Res., 39, D261–267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eraslan G. et al. (2019) Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet., 20, 389–403. [DOI] [PubMed] [Google Scholar]
- Esteva A. et al. (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ester M. et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the second ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–231.
- Fazekas D. et al. (2013) SignaLink 2—a signaling pathway resource with multilayered regulatory networks. BMC Syst. Biol., 7, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gerlinger M. et al. (2012) Intra tumor heterogeneity and branched evolution revealed by multiregion sequencing. New Engl. J. Med., 366, 883–892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goyal P., Ferrara E. (2018) Graph embedding techniques, applications and performance: a survey. Knowl. Based Syst., 151, 78–94. [Google Scholar]
- Hanahan D. et al. (2011) Hallmarks of cancer: the next generation. Cell, 144, 646–674. [DOI] [PubMed] [Google Scholar]
- Harbeck N. et al. (2019) Breast cancer. Nat. Rev. Dis. Primers, 5, 66. [DOI] [PubMed] [Google Scholar]
- Hofree M. et al. (2013) Network-based stratification of tumor mutations. Nat. Methods, 10, 1108–1115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horn H. et al. (2018) NetSig: network-based discovery from cancer genomes. Nat. Methods, 15, 61–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hornbeck P.V. et al. (2015) PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res., 43, D512–520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu J. et al. (2014) PhosphoNetworks: a database for human phosphorylation networks. Bioinformatics, 30, 141–142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu J. et al. (2016) Network biology concepts in complex disease comorbidities. Nat. Rev. Genet., 17, 615–629. [DOI] [PubMed] [Google Scholar]
- Huttlin E.L. et al. (2015) The BioPlex network: a systematic exploration of the human interactome. Cell, 162, 425–440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- International Cancer Genome Consortium. et al. (2010) International network of cancer genome projects. Nature, 464, 993–998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jimenez-Luna J. et al. (2020) Drug discovery with explainable artificial intelligence. Nat. Mach. Intell., 2, 573–584. [Google Scholar]
- Kulmanov M. et al. (2018) DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics, 34, 660–668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leiserson M.D.M. et al. (2015) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet., 47, 106–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Licata L. et al. (2012) MINT, the molecular interaction database: 2012 update. Nucleic Acids Res., 40, D857–861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu C. et al. (2020a) Individualized genetic network analysis reveals new therapeutic vulnerabilities in 6,700 cancer genomes. PLoS Comput. Biol., 16, e1007701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu C. et al. (2020b) Computational network biology: data, models, and applications. Phys. Rep., 846, 1–66. [Google Scholar]
- Liu H. et al. (2017) Entropy-based consensus clustering for patient stratification. Bioinformatics, 33, 2691–2698. [DOI] [PubMed] [Google Scholar]
- Lu C.T. et al. (2013) DbPTM 3.0: an informative resource for investigating substrate site specificity and functional association of protein post-translational modifications. Nucleic Acids Res., 41, D295–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luck K. et al. (2020) A reference map of the human binary protein interactome. Nature, 580, 402–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lundin M. et al. (1999) Artificial neural networks applied to survival prediction in breast cancer. Oncology, 57, 281–286. [DOI] [PubMed] [Google Scholar]
- Ma J. et al. (2018) Using deep learning to model the hierarchical structure and function of a cell. Nat. Methods, 15, 290–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meacham C.E., Morrison S.J. (2013) Tumor heterogeneity and cancer cell plasticity. Nature, 501, 328–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer M.J. et al. (2013) INstruct: a database of high-quality 3D structurally resolved protein interactome networks. Bioinformatics, 29, 1577–1579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mikolov T. et al. (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
- Nelson W. et al. (2019) To embed or not: network embedding as a paradigm in computational biology. Front. Genet., 10, 381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nussinov R. et al. (2019) Precision medicine review: rare driver mutations and their biophysical classification. Biophys. Rev., 11, 5–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orchard S. et al. (2014) The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res., 42, D358–363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peng J. et al. (2019) Predicting parkinson’s disease genes based on Node2vec and autoencoder. Front. Genet., 10, 226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peri S. et al. (2004) Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res., 32, D497–D501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rakthanmanon T. et al. (2013) Addressing big data time series: mining trillions of time series subsequences under dynamic time warping. ACM. Trans. Knowl. Discov., 7, 1–31. [PMC free article] [PubMed] [Google Scholar]
- Reis-Filho J.S., Pusztai L. (2011) Gene expression profiling in breast cancer: classification, prognostication, and prediction. Lacet, 378, 1812–1823. [DOI] [PubMed] [Google Scholar]
- Ribeiro L.F. et al. (2017) Struc2vec: learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394.
- Rolland T. et al. (2014) A proteome-scale map of the human interactome network. Cell, 159, 1212–1226 [Database]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sims A.H. et al. (2007) Origins of breast cancer subtypes and therapeutic implications. Nat. Clin. Pract. Oncol., 4, 516–525. [DOI] [PubMed] [Google Scholar]
- Tokheim C.J. et al. (2016) Evaluating the evaluation of cancer driver genes. Proc. Natl. Acad. Sci. USA, 113, 14330–14335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Topol E. (2019) High-performance medicine: the convergence of human and artificial intelligence. Nat. Med., 25, 44–56. [DOI] [PubMed] [Google Scholar]
- Van der Maaten L., Hinton G. (2008) Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res., 9, 2579–2605. [Google Scholar]
- Weinstein J.N. et al. ; The Cancer Genome Atlas Research Network. (2013) The cancer genome atlas pan-cancer analysis project. Nat. Genet., 45, 1113–1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong D., Yip S. (2018) Machine learning classifies cancer. Nature, 555, 446–447. [DOI] [PubMed] [Google Scholar]
- Zhang W. et al. (2018) Classifying tumors by supervised network propagation. Bioinformatics, 34, i484–i493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng X. et al. (2019) deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics, 35, 5191–5198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng X. et al. (2020) Target identification among known drugs by deep learning from heterogeneous networks. Chem. Sci., 11, 1775–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu Y. et al. (2014) TCGA-assembler: open-source software for retrieving and processing TCGA data. Nat. Methods, 11, 599–600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zong N. et al. (2017) Deep mining heterogeneous networks of biomedical linked data to predict novel drug-target associations. Bioinformatics, 33, 2337–2344. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
