Abstract
Cancer development is driven by an accumulation of a small number of driver genetic mutations that confer the selective growth advantage to the cell, while most passenger mutations do not contribute to tumor progression. The identification of these driver genes responsible for tumorigenesis is a crucial step in designing effective cancer treatments. Although many computational methods have been developed with this purpose, the majority of existing methods solely provided a single driver gene list for the entire cohort of patients, ignoring the high heterogeneity of driver events across patients. It remains challenging to identify the personalized driver genes. Here, we propose a novel method (PDRWH), which aims to prioritize the mutated genes of a single patient based on their impact on the abnormal expression of downstream genes across a group of patients who share the co-mutation genes and similar gene expression profiles. The wide experimental results on 16 cancer datasets from TCGA showed that PDRWH excels in identifying known general driver genes and tumor-specific drivers. In the comparative testing across five cancer types, PDRWH outperformed existing individual-level methods as well as cohort-level methods. Our results also demonstrated that PDRWH could identify both common and rare drivers. The personalized driver profiles could improve tumor stratification, providing new insights into understanding tumor heterogeneity and taking a further step toward personalized treatment. We also validated one of our predicted novel personalized driver genes on tumor cell proliferation by vitro cell-based assays, the promoting effect of the high expression of Low-density lipoprotein receptor-related protein 1 (LRP1) on tumor cell proliferation.
Author summary
In this study, using the TCGA dataset studies as benchmark datasets, we explored the application of the commonality among patients of the same cancer type in personalized driver gene prediction. We proposed a hypergraph model and a generalized random walk method to rank the mutated genes of a patient based on their impact on the abnormal expression of downstream genes in a group of samples rather than an individual sample. Following the extensive experimental results on 16 cancer datasets and the comparative analysis across five cancer types, we have observed that the PDRWH method exhibits remarkable effectiveness in identifying known general driver genes and tumor-specific driver genes. In a few words, our method can provide a more accurate personalized catalog of driver mutations for each patient, and the predicted personalized driver genes can be applied to improve tumor stratification. It can also provide oncologists with a reliable candidate gene list to assist treatment decisions, thus potentially promoting the development of personalized medicine.
Introduction
Cancer is a collection of genetic diseases characterized by abnormal and uncontrolled cellular growth, which are primarily caused by the accumulation of genetic alterations [1–3]. Previous evidence has shown that a small fraction of genomic and transcriptomic altered genes, called cancer driver genes, could modify transcriptional programs and result in abnormal cell proliferation and eventually tumorigenesis [4–6]. The majority of detected altered genes are passengers that do not contribute to the oncogenic process. Distinguishing cancer driver genes from numerous functionally neutral passenger mutation genes is critical for providing clinically characterized insights into tumor biology. And it has led to the development of a paradigm of targeted anticancer therapies, and the search for biomarkers of prognosis and response to cancer treatments [7–9].
With recent advances in genomics technologies, comprehensive platforms such as the Cancer Genome Atlas (TCGA), have led to the characterization of the molecular signatures of human cancers spanning 33 cancer types, providing an unprecedented opportunity to develop computational methods for driver gene identification [10]. Many bioinformatics tools are dedicated to identifying driver mutations from passengers in a cohort of patients [11–15]. However, since cancer patients possess different genomes and their disease might be driven by different driver genes [16, 17], it is necessary to investigate personalized cancer drivers specific to an individual patient. For example, DawnRank applies the PageRank algorithm to evaluate the impact of genes on the overall differential expression of its downstream genes in a molecular interaction network [18]. Prodigy prioritizes candidate personalized driver genes by quantifying the impact of mutated genes on deregulated pathway based on the patient’s tumor mutation and expression profiles [19]. According to a single-sample network control strategy, Guo et al. developed SCS to detect the minimum set of driver nodes that could achieve the maximal coverage of individual differentially expressed genes during the transition from the normal state to the disease state [20]. Despite achieving promising results, these personalized driver prioritization methods take into account the data available from a single sample to produce a ranking of drivers for every specific patient, neglecting the availability of data from other samples. More importantly, they are overly dependent on the data quality of individual samples, with poor tolerance for noise and low reliability of the results. To address this issue, PersonaDrive aims to utilize the comprehensive whole cohort data for guiding the personalized driver prediction [21]. This is achieved by constructing a bipartite graph to model pair-wise relationships among the set of mutated genes and the differently expressed genes. However, it is widely acknowledged that the bipartite graph, as a type of simple graph, is limited to capturing pair-wise relationships between nodes and cannot represent more complex relationships. An important aspect of cancer that has been overlooked by existing methods is that patients with the same driver gene mutations are likely to share the same carcinogenic mechanism, unlike patients with only the same passenger mutations [22, 23]. Taking this into account, incorporating higher-order relationships between mutated genes and abnormally expressed genes in computational models holds significant promise for improving personalized cancer driver prediction.
In this study, we present a novel method named PDRWH (Prioritizing Personalized Cancer Driver Genes via Random Walks on a Hypergraph), inspired by the effectiveness of hypergraphs in modeling biology networks, data structures, and other systems [24–27]. Unlike methods that rely on data from a specific sample, PDRWH integrates data from a cohort to generate personalized driver gene predictions, enabling a more comprehensive analysis of the collective information across multiple samples. Under the assumption that the impact of a potential driver gene can be determined by its effect on the genes regulated by it, PDRWH ranks potential driver genes based on the influence of mutated genes on transcriptional networks across the cohort of samples. To achieve this, a hypergraph model is proposed to effectively represent high-order relationships among genes. It captures the implicit intrinsic regulatory associations among genes within each sample by connecting a large number of mutated genes and aberrantly expressed genes in the corresponding hyperedge. Additionally, this model accurately characterizes the coexistence of mutated genes and aberrantly expressed genes across diverse samples. PDRWH quantifies the impact of each mutated gene across the group of samples by performing a generalized random walks algorithm on the personalized hypergraph. Evaluated across datasets from 16 cancer types in TCGA and benchmarked against existing driver gene prediction methods using five cancer type datasets, PDRWH consistently demonstrates superior performance in identifying both known general driver genes and tumor-specific driver genes. Notably, PDRWH excels at simultaneously identifying both common and rare driver genes. The predicted personalized driver gene profiles can not only improve tumor stratification but also provide oncologists with a reliable candidate gene list to assist in treatment decisions. To validate the effectiveness of PDRWH, we experimentally verified a predicted personalized driver gene LRP1 through in vitro cell assays.
Results
An overview of PDRWH
The PDRWH method is a novel integrated genome/transcriptome analysis approach designed to identify candidate personalized driver genes by leveraging the influence of mutated genes on biological networks across cohort samples. We hypothesize that samples with the same cancer types of cancer display higher similarity in molecular characteristics and disease mechanisms compared to samples with different types of cancer. Leveraging samples from the same cancer type, specifically those closely resembling the target sample, has the potential to improve the personalized driver genes prediction. A schematic overview of PDRWH is illustrated in Fig 1. PDRWH requires knowledge of cohort samples from a specific cancer type in TCGA, including the somatic mutation and the gene expression profiles, as well as a gene interaction network (Fig 1A). To ensure uniform baseline values and ranges for all genes across the cohort samples, gene expression data is normalized. Subsequently, a screening process is conducted to identify abnormally expressed genes for each sample (Fig 1B). PDRWH consists of three main steps. The first step involves constructing a personalized hypergraph model, where samples are represented as hyperedges, and the mutated genes as well as the abnormally expressed genes in the target sample are described as vertices (Fig 1C). In this model, a hyperedge in the hypergraph is capable of connecting multiple vertices (not limited to two nodes as in a simple graph), enabling a more complex representation of relationships between genes and facilitating a comprehensive analysis of association across the cohort samples. As depicted in S1 Fig, each hypergraph typically consists of more than 50 hyperedges (samples), indicating that PDRWH generates a relatively dense personalized hypergraph for most individuals. Following this, weights are assigned to hyperedges based on the Pearson correlation of gene expression profiles between the corresponding samples and the target sample. This weighting scheme ensures that patients who closely resemble the target patient have a more substantial impact on the personalized driver gene prediction. The weights of vertices within each hyperedge are set as the degree of the subnetwork induced by genes corresponding to these vertices, characterizing the influence of mutated genes on the abnormally expressed genes in each sample. In the second step, the transition probability matrix is calculated by taking into account the structure of the hypergraph as well as the weights of the hyperedges and vertices. This matrix indicates the probabilities of transitioning from one vertex to another within the hypergraph (Fig 1D). The final step entails performing a random walk with restart on the weighted personalized hypergraph. The walker, starting at a present vertex u, firstly chooses a hyperedge with probability determined by the weights of hyperedges, and then travels towards any vertex (for example v) within the selected hyperedge based on the weights of the vertex in the hyperedge. After the random walks reach a stable state, a stationary distribution of the random walk is achieved. At this point, the PDRWH-scores are generated to quantitatively and rationally prioritize candidate genes (Fig 1E). For a target patient, a mutated gene in the target sample should be ranked higher if it is adjacent to many genes that are abnormally expressed in the target sample as well as in a group of other samples from the same cancer type.
Known drivers have a higher degree of connectivity to abnormally expressed genes in the gene interaction subnetwork
Our approach is built upon a crucial observation: the mutated genes that are adjacent to a higher number of abnormally expressed genes are more likely to have a significant impact. This assumption can be validated through empirical data analysis. To achieve this, we randomly selected some patients from the five cancer types, breast invasive carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), liver cancer (LIHC), glioblastoma (GBM), and stomach adenocarcinoma (STAD), and created personalized gene interaction subnetworks for each patient by mapping the mutated genes and abnormally expressed genes onto STRINGv10. We then categorized all mutated genes in each patient into two groups based on whether they are known driver genes and analyzed their node degrees distributions in the personalized gene interaction subnetwork using the Satterthwaite approximation t-test (S2 Fig). Our analysis revealed that the interactions involved by known driver genes were generally more extensive compared to other mutant genes (p-value < 0.05). Furthermore, when we aggregated samples from all five cancer types into a large cohort, the observed difference became even more statistically significant (p-value 2.2×10−16).
PDRWH outperforms existing driver gene prediction methods in identifying known general drivers
We applied PDRWH to the datasets of 16 cancer types from TCGA (S1 Table). To evaluate the method, we utilized a union of four well-studied cancer gene databases as a general driver gene reference with a total of 758 genes, including the Cancer Gene Census (CGC) [28], the HiConf cancer gene panels [29], the high-confidence drivers (HCD) [30], and Mut-driver genes defined by the ‘20/20 rules’. We evaluated the performance of PDRWH’s ability to identify known general driver genes based on the top-ranked genes. As shown in the S2 Table, in as many as 85% of the total samples, the genes ranked first by PDRWH are known driver genes. Our approach shows exceptional performance in identifying known driver genes, particularly when considering genes at the top of the predictions, except in SKCM. In LGG and UCEC, this proportion even reaches up to 95.8% and 98.2%. On the whole, approximately 50% of the samples have the gene ranked second as a known driver gene. In SKCM, DNAH8 ranked first in 29% of samples for the personalized prediction. Despite not being listed as a known driver gene, previous studies have indicated its significant association with cancer, thereby categorizing it as potential driver gene [31,32]. As the ranking increases, the proportion of samples where these genes are known driver genes decreases, just as we expected.
Subsequently, we conducted a comparative analysis of PDRWH with four other personalized cancer driver prediction methods (PersonaDrive, Prodigy, DawnRank, and SCS) using five cancer datasets (BRCA, KIRC, LIHC, GBM, and STAD). All the methods utilized the same cancer dataset, which included somatic mutation data and the gene expression data of the tumor samples in TCGA, along with the same gene molecular network STRINGv10 for all the network-based methods. As for the SCS method, we had to limit the analysis to 50 randomly selected patients for each cancer type due to its extended runtime. This allowed us to efficiently run the SCS algorithm while still obtaining meaningful results for comparison with the other methods. At the same time, to highlight the impact of utilizing different data information on algorithm results, two naive methods were introduced as baselines. One method ranks genes based on the mutation frequency among samples, while the other method ranks genes based on the degree in the gene subnetwork induced by mutated genes within the sample. As mentioned above, we utilized the general driver gene reference as the benchmark of known drivers. From Figs 2 and S3, we found that methods utilizing the cohort data (PDRWH, PersonaDrive, and Frequency) consistently outperformed those relying on individual data (Prodigy, DawnRank, and Degree) in terms of average precision, recall, and F1-score. This highlights the significant advantage of leveraging collective data over individual data. Among the three cohort data analysis methods (PDRWH, PersonaDrive, and Frequency), PDRWH exhibits superior performance, achieving an outstanding precision rate of up to 88.4% and 76.0% for genes ranked first in BRCA and KIRC. Although PDRWH’s precision for the top-ranked gene in KIRC is slightly lower than PersonaDrive, its superiority becomes more noticeable as the ranking advances, particularly in terms of recall and F1-score. Therefore, PDRWH excels in providing a more precise and rational prioritization of known general driver genes for individual patients.
For each cancer type, we aggregated the personalized candidate driver gene rankings to create a prioritization for the cohort. This allowed us to compare the results of personalized prediction methods with those of cohort-level methods. Using the general reference driver gene list as a benchmark, we generated receiver operating characteristic curves (ROC) and calculated areas under the curve (AUC). In Figs 3 and S4, it is clear that PDRWH outperforms other tools in terms of sensitivity and specificity in identifying known driver genes in BRCA, KIRC, GBM, and STAD. Additionally, the AUC values of PDRWH are consistently at the highest level for BRCA, LIHC, and GBM, while ranked second compared to other tools in KIRC and STAD.
PDRWH achieves reliable results in identifying known tumor-specific drivers
Considering the diversity among cancer types, tumor-specific drivers hold more concern than the general drivers across different tumor types. Therefore, it is highly valuable to assess the ability of methods to accurately identify cancer-specific driver genes. To achieve this, we downloaded a set of tumor-specific driver genes from the IntOGen database as benchmarks. To further support the efficiency of PDRWH by statistical significance, the enrichment p-values of predicted genes in the tumor-specific driver gene lists were evaluated using the hypergeometric test, representing the significance of tumor-specific driver genes rediscovered by PDRWH. The computational details are provided in the Methods and Materials section. From the result of S5 Fig, we can observe that PDRWH shows some variability in performance across 16 cancer types. With the exceptions of LUSC and SKCM, as many as three-quarters of the samples from each cancer type demonstrate enrichment in the tumor-specific driver gene list.
For comparison, we used a naive method that randomly selected mutated genes as predicted personalized drivers as a baseline. From Fig 4A, we can find that the five personalized driver prediction methods show higher percentages of significant samples for identifying the cancer-specific driver genes than randomly chosen. In comparison to other methods, PDRWH consistently outperforms. For instance, in the case of BRCA, our method achieves significant enrichment in BRCA-specific cancer drivers for 85.19% of the samples. Similarly, for KIRC, LIHC, GBM, and STAD, the percentages are 77.41%, 74.16%, 78.67%, and 84.86% respectively. These results indicate that a considerable number of samples support the reliable outcomes produced by PDRWH, highlighting its effectiveness in predicting tumor-specific driver genes for individuals.
Assuming that the data of a particular cancer type from TCGA is sufficiently representative of the majority of samples for that cancer, the known tumor-specific driver genes should be detected in the corresponding cancer samples. For each cancer type, we collected the predicted personalized driver genes of all tumor samples and compared PDRWH with four other methods based on their overlap with the known tumor-specific cancer drivers. The number of known tumor-specific driver genes identified by different methods is illustrated in Fig 4B for each cancer type. PDRWH successfully detected 36 out of 39 known BRCA-specific driver genes and 18 out of 22 KIRC-specific drivers, demonstrating its superiority over the other methods. In the remaining three cancer types, our method exhibited similar performance to PersonaDrive while still outperforming DawnRank, Prodigy, and SCS.
Furthermore, we analyzed the overlap and difference of the identified known tumor-specific driver genes between different methods (Figs 4C and S6A). Since SCS utilized insufficient samples, its results were not included in the comparison. Our findings revealed that PDRWH was able to detect a majority of the known drivers predicted by other methods. It is noteworthy that PDRWH also identified known tumor-specific drivers that were missed by other methods, such as ABL2 for BRCA, SETBP1 for KIRC, NIN and TOP2A for STAD (S3 Table). This suggests that PDRWH can serve as a complementary approach to other methods to promote tumor-specific driver gene identification.
PDRWH efficiently identifies both common and rare drivers
One of the advantages of our method is its ability to identify both common and rare driver genes, demonstrating a balanced performance that effectively trades off algorithm generalizability and specificity. To further demonstrate this ability, we divided the top-ranked predicted driver genes into two categories based on the frequency of occurrence in the respective cancer-type cohort: common and rare. Genes with mutation frequency ≥2% are classified as common drivers, while those with a mutation frequency <2% are classified as the rare. The results are summarized in Figs 5A–5C and S6B. The majority of genes ranked at first are known high-frequency driver genes. As the ranking increases, even if the proportion of known high-frequency drivers decreases, the known low-frequency drivers always have a relatively stable proportion, which indicates the PDRWH’s capability in detecting known rare drivers. Figs 5D–5F and S6C show the scatter plots of the gene mutation frequency versus the frequency that appears as the personalized driver genes in patients. The mutation frequencies of the known driver genes (red dots) exhibit a remarkably high concordance with those drive genes predicted by PDRWH (Pearson R-square > 0.99, p-value < 2.2e-16). Across the five cancer types, the potential driver gene TTN with high mutation frequency is the most promising, which has been observed to be involved in several cancer functions and to become an effective predictor for overall survival and chemotherapy response [31]. In LIHC, ALB, a novel potential driver gene predicted by PDRWH, has been proposed as an effective biomarker for cancer detection [32]. In addition, all the predicted personalized driver genes but not the known drivers are also enriched in multiple cancer-related pathways based on the Database for Annotation, Visualization and Integrated Discovery (DAVID) online database [33] and Kyoto Encyclopedia of Genes and Genomes (KEGG) database [34]. The results of the enrichment analysis are shown in S4 Table and S7 Fig.
Subtypes recovered by expressions of predicted rare drivers are significantly associated with patient survival
Considering that the state of driver genes is supposed to reflect their phenotypic impact on carcinogenesis, we further verified the ability of the personalized drivers identified by PDRWH in stratifying tumor samples. Employing unsupervised K-means clustering, we separated the tumor samples into different subtypes based on the gene expression of the predicted personalized drivers. The number of clusters for each cancer type was determined using a CDF (Cumulative Distribution Function) curve. We suggested that there is a distinction in the effectiveness of the rare (<2%) drivers and the common (≥2%) drivers used for stratifying patients. From Figs 6A, 6B and S8, the expression of known tumor-specific drivers was unable to identify subtypes correlated with patient survival regardless of which part of the gene set was used. When the expression profiles on predicted drivers with high frequency were used, we could obtain a significant survival analysis result among KIRC patients only (Fig 6C). Notably, predicted driver genes with low frequency have significant prognostic values for survival in BRCA, KIRC, and LIHC (Fig 6D).
Similar results were also obtained in GBM and STAD (S9 Fig). It should be noted that the numbers of subtypes (we recognized five subtypes for BRCA, four subtypes for KIRC, four subtypes for LIHC, and three subtypes for both GBM and STAD) are consistent with the approbatory numbers of subtypes in literature [35–39]. To sum up, the rare drivers predicted by PDRWH are more conducive to subtype clustering, and we can assume that our methods capture more precise causative events for carcinogenesis in individuals.
Clinical application of the predicted driver genes
We next evaluated whether our predicted personalized driver genes can provide useful information to the oncologist in deciding on therapy. For each patient, the number of predicted personalized drivers in Therapeutically Applicable Research to Generate Effective Treatments (TARGET, 135 actionable genes), the Drug-Gene Interaction database (DGIdb, 1387 druggable genes) [40], and the union of the two sets (1407 actionable or druggable genes) was counted. As shown in Figs 7 and S10, more than three-quarters of patients have at least one actionable gene, and the majority of patients contain more than one druggable driver. As a case, in the union set, there are only 3.58% of patients on BRCA, 2.48% on KIRC, and 1.12% on LIHC without any actionable genes or druggable genes. These results confirm that the results predicted by PDRWH may be reasonable and useful references in individually tailored therapy.
Experimental validation of predicted novel cancer drivers
Finally, we performed in vitro cell-based assays for a novel drive gene identified by PDRWH (S1 Text). By way of illustration, Low-density lipoprotein receptor-related protein 1 (LRP1) was predicted as a driver gene in Human gastric cancer cells (GC) by our method and PersonaDrive. This gene was not detected by widely used methods, such as DawnRank, Prodigy, and SCS, and was not presented in the known driver gene list. As shown in Fig 8A and 8B, LRP1 was up-regulated in GC cells and GC tissues, especially, higher expressed in the HGC-27, MGC-803, and AGS cells. LRP1 was also negatively associated with the overall survival rates of patients with GC (Fig 8C). The overall survival analysis of LRP1 was based on Gene Expression Profiling Interactive Analysis (GEPIA) [41]. To investigate the potential cancer-related roles of LRP1, loss-of-function assays were performed in HGC-27 cells. After the transfection of three siRNAs respectively, which could produce specific weak knock (knockdown) effects on the LRP1 gene, LRP1 expression in the three experimental groups was significantly on the decline at the mRNA and protein levels (Fig 8D). By Wound healing assay, we also observed that the knockdown of LRP1 suppresses the metastatic ability of HGC-27 cells (Fig 8E). Besides, the knockdown of LRP1 increased the apoptosis rate of gastric cancer cells (Fig 8F), and inhibited cell proliferation reducing the proliferation rate of HGC-27 cells from 4.4% in controls to 1.2%~1.8% in experimental groups (Fig 8G). Furthermore, depletion of LRP1 could induce G1 and G2 phase arrest (Fig 8H). These collective preliminary results indicate that LRP1 predicted by PDRWH as a personalized driver gene is potentially involved in the development of GC.
Discussion
Identifying personalized driver genes that lead to particular cancer initiation and progression of individual patients is a crucial part of precision medicine. In this study, we have presented an unsupervised learning method to identify patient-specific driver genes by leveraging genome and transcriptome datasets from a cohort. PDRWH, similar to many unsupervised algorithms, can directly uncover hidden patterns and structures in data without the need for explicit model training or a large number of labeled genes, aiming to provide comprehensive support for the analysis of driver genes. The novelty of this study lies in the introduction of the concept and methodology of hypergraph random walks to predict personalized driver genes. The hypergraph model offers substantial benefits in terms of data integration and interpretability. By grouping patient-specific mutated genes and abnormally expressed genes within the corresponding hyperedge, the hypergraph model allows for a comprehensive representation of association among genes across multiple samples simultaneously, rather than separately. The random walk algorithm on the hypergraph is tailored to generate a quantitative assessment of the influence on the gene interaction network within the target sample and its neighboring samples, facilitating the systematic prioritization of candidate personalized driver genes. Comparisons based on the TCGA have demonstrated the superior performance of PDRWH over other computational methods in identifying the known cancer drivers. We believe our method will complement existing driver identification methods and will help us discover potential personalized drivers, especially those rare drivers that often escape detection by other methods.
One limitation of the current model is that it relies on a broad-context molecular network rather than a patient-specific one. As a result, it overlooks regulation information that is specific to individual patients, which could potentially lead to false positives in the results. Moreover, PDRWH focuses on prioritizing single genes, while it is well-established that genes often collaborate to drive cancer initiation and progression. Therefore, there is a clear need for computation methods to elucidate how these genetic aberrations collaborate to induce transcriptional abnormalities, and ultimately lead to the onset of cancer. Additionally, our current model primarily focuses on point mutations, including single-nucleotide variants (SNVs) and short insertions or deletions (indels), due to their prevalence, ease of detection, and potential as genetic markers for specific phenotypes that promote tumor formation. However, the impact of other somatic alterations like amplifications, genomic rearrangements, and epigenetic silencing, which are also crucial in tumorigenesis, has not been considered. Integrating information on all these alterations would improve the identification of driver genes in cancer.
Despite these limitations, PDRWH has demonstrated reliable performance in inferring personalized driver genes, which is promising for discovering potential causal genetic variants that would be obscured by tumor heterogeneity. In the future, we expect that PDRWH will assist in the development of optimal personalized treatment.
Methods and materials
Data resources
In this work, we use two types of genomic data from a cohort: somatic mutation data, which includes non-synonymous point mutations and insertions/deletions (indels) in coding regions, and gene expression profiles. We downloaded 16 cancer datasets that contained a sufficient number of samples with both mutation and gene expression data (>150 samples) from the TCGA data portal [42] through the Xena platform [43]. The samples with less than three mutated genes in the cohort were filtered out. PDRWH also uses a gene interaction network: a global PPI network taken from STRINGv10 [44]. This network includes 17084 genes and 3513941 interactions. Information about the databases is given in the S1 Table. It should be noted that we primarily used five common types of cancer as examples to demonstrate the evaluation of the algorithm’s performance, including breast invasive carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), liver cancer (LIHC), glioblastoma (GBM), and stomach adenocarcinoma (STAD). Additionally, we also provided a summary of the performance of PDRWH on the other cancer types.
Pre-processing
In a given cohort gene expression profile X = {xi,j}, where the rows represent genes and columns represent patient samples, the gene expression values are processed using z-score normalization by the following formula:
(1) |
where μi and σi are the mean expression values and standard deviation of the gene i. To identify the abnormally expressed genes for each patient, a threshold θ is set for each gene, which is the smaller value between the absolute values of the 5% and 95% quantiles among {}, to pick the significant high or low expression values of the standard gene expression profile X′. A gene i is regarded as an abnormally expressed gene of sample j if its absolute value of is more than θ, which indicates that the gene is expressed quite differently in this sample compared to the other samples. The advantage of this definition of abnormally expressed genes is that it does not rely on the existence of paired normal and tumor data of the same patient or background gene expression profile from healthy samples. It provides a robust method to detect genes with distinct expression patterns within the cohort. For each sample, the number of genes contained in the set of abnormally expressed genes is approximately about 400 to 1600.
Construction of personalized hypergraph model
At first, a personalized weighted hypergraph model is constructed to accurately capture the implicit inherent similarity of samples and the association between the mutated genes and abnormally expressed genes. For a target sample s0, if a sample shares at least one co-mutant gene with this one, it is defined as a neighbor of that patient. Let S = {s1,…,sn} denote the set of neighbor samples of the target sample. Then, we defined the patient-specific hypergraph H(V,E), where V is a set of vertices representing all mutant genes and abnormally expressed genes of s0, and E = e0∪{e1,…,en} is a set of hyperedges. Hyperedge e0 represents the target sample s0, which is incident with node v if this gene is mutated or abnormally expressed in this sample. Likewise, hyperedges e1,…,en, corresponding samples s1,…,sn, are incident with their mutant genes and abnormally expressed within vertex set V. The incidence matrix H∈ℛ|V|×|E| is defined as follows:
(2) |
which indicates whether vertices are incident with the hyperedges. According to the assumption that patients with similar gene expression profiles may have the more similar pathogenic mechanism to each other than the rest of the patients, the most closely related patient in terms of its gene expression profile will contribute much more to the prediction of driver genes compared to other patients. Therefore, the weight of a hyperedge should be an increased function with their correlation to the target sample. A fairly standard choice for the weights is:
(3) |
in which ρ(e, eo) is the Pearson correlation of gene expression profiles between sample s and the target sample s0, and δ is the bandwidth parameter (default δ = 0.1) controlling how quickly the weight of a sample falls off with the distance of s from the query point s0. Here, if ρ(e, eo) is close to 1, we will also be close to 1, implying that this sample has a high impact in the evaluation of driver genes of target sample s. On the contrary, if ρ(e, eo) is small (e.g., close to 0), we will be relatively small too. In this case, the corresponding neighbor will have a weak contribution to the determination of the driver genes of the target sample. Then the weight matrix of hyperedge is defined as the diagonal matrix:
(4) |
To model the relationship between the mutated genes and the abnormally expressed genes, we project the mutant genes and abnormally expressed genes in each hyperedge onto a human gene interaction network. PDRWH views the gene network as an undirected graph. For gene vi and vj in hyperedge e, an edge exists if the two genes interact in the gene interaction network. This way, the vertices in each hyperedge e induce a corresponding gene interaction subnetwork Ne. Since the driver genes tend to adjacent more abnormally expressed genes in the subnetwork Ne (S1 Fig), the weight of node v in the hyperedge e, denoted as w(ve), can be set as its degree in Ne. The matrix Wv∈ℛ|V|×|E| is defined as follows:
(5) |
Since the interaction of isolated vertices with other genes in the network is unknown, the weights of the isolated vertices are set to a small value of 0.01 instead of 0. Then, the degrees of vertices and hyperedges in the weighted hypergraph are defined as:
(6) |
(7) |
Transition probability matrix of the random walks on hypergraph
A random walk on a hypergraph H(V, E) is similar to the classic random walk, where transitions occur between two incident vertices in the hyperedge [45]. Specifically, the movement between vertices is modeled as a discrete-time Markov chain based on predefined transition probabilities. A standard formulation for a hypergraph random walk can be broken down into two steps. Given the current state :
Starting at a vertex u, a hyperedge is selected with a probability determined by the weights of hyperedges we.
a vertex v is chosen from the selected hyperedge e. The walker can travel to any nodes within the selected hyperedge based on the weights of the vertex in the hyperedge w(v, e).
Thus, transition probabilities from vertex u to vertex v are calculated as follows:
(8) |
which can be written in an alternative matrix form:
(9) |
where H, We, and Wv are defined as previously mentioned, while Dv and Dve represent the diagonal matrix for the degrees of vertices and hyperedges, with d(v) and δ(e) being the respective diagonal elements.
Generalized random walks on the personalized hypergraph model
We implement a random walk with restart on the personalized hypergraph by adding a damping factor. Specifically, all the mutated genes in the target sample s0 are assumed to have an equal probability of being driver genes. Therefore, an equal probability of is assigned to each of the mutated genes in s0 initially, where n represents the number of mutated genes in s0. The initial values of the abnormally expressed genes are set to zero. Let the initial column vector be denoted as . The process can be mathematically represented by the following formula:
(10) |
where the i-th element of state represents the probability that the walker moves to node i at step t. The damping factor, α (0<α<1), is introduced to ensure the graph satisfies ergodic conditions [45]. In our study, we have empirically set the damping factor α to 0.85. The term in the formula means that the random surfer may transition to one of the adjacent vertices, while represents a vector introducing the probability of teleporting the random walk back to the initial state. After several iterations of the random walk, the distribution vector stabilizes when the difference between and measured by L1 norm falls below a small ε (default 10−6). PDRWH algorithm generally converges within ten times iterations. The stationary probability indicates the likelihood of genes being personalized cancer drivers of the target sample. Finally, the values of mutated genes in the stationary probability vector are preserved and then normalized to generate the PDRWH-score. PDRWH-scores can be ranked in descending order to prioritize the personalized candidate driver genes.
Comparison to other methods
We utilized two benchmarking measures for comparison of the methods’ ability to identify known personalized driver genes. One benchmark is the ability to recapitulate many of the well-studied general cancer-associated genes. We assembled a general driver list of 758 known cancer driver genes from various sources, including the Cancer Gene Census (CGC) [28], the HiConf cancer gene panels [29], the high-confidence drivers (HCD) identified by a rule-based method [30], and Mut-driver genes defined by the ‘20/20 rules’, which identifies driver genes based on the characteristic mutational patterns for oncogenes and tumor suppressor genes [46]. This list served as an approximate benchmark of known general drivers for validation. Then, we defined personalized drivers predicted by PDRWH as the top-n ranked genes, where n was assigned as twice the median of number of mutated genes in the general driver set across the population of patients [21]: 8 for breast cancer (BRCA), 10 for kidney clear cell carcinoma (KIRC), 12 for liver cancer (LIHC), 8 for glioblastoma (GBM), and 16 for stomach cancer (STAD). We used the modified REA strategy proposed by PersonaDrive [21] for a comparison of PDRWH with six personalized prediction methods (DawnRank, Prodigy, SCS, PersonaDrive, Degree and Frequency) across five cancer types from TCGA. For each sample, the identified cancer drivers in the general driver list were adopted to compute the Precision, Recall, and F1-score. Three measurements were generated for each individual:
(11) |
(12) |
(13) |
in which |∙| is the number of genes in a set. The averaged values were calculated for the sake of comparison. To predict driver genes for a cohort and compare them with other cohort-level methods, we utilized an adapted version of PageRank, considering the personalized driver gene ranking score as the voters’ preference for candidate driver genes. Using the aforementioned general driver genes as a benchmark, we generated receiver operating characteristic (ROC) curves and calculated areas under the curve (AUCs) to evaluate the true positive and false positive rates. All the details of this study are provided in the S1 Text.
The second benchmark involves the identification of tumor-specific driver genes. As there is a remarkable discrepancy among different cancer types, we downloaded a list of tumor-specific driver genes from the IntOGen database [47]. This list is considered to be the best trade-off between sensitivity and specificity among those currently available (S5 Table). Given the top-n predicted personalized drivers, an enrichment analysis of the personalized driver genes was performed using the hypergeometric test:
(14) |
where N represents the total number of genes in a patient, M is the number of genes in the known tumor-specific driver gene list, n is the number of predicted personalized driver genes of the patient, and m is the number of overlapping genes between the known tumor-specific driver genes and predicted personalized driver genes of the individual. If P(X≥m)<0.05, it indicates that the predicted driver genes for this patient is significantly enriched in known driver genes. Additionally, we investigated the consistency and differences in the identified tumor-specific driver genes among different methods.
Supporting information
Acknowledgments
We acknowledge the Cancer Genome Atlas Research Network for providing publicly accessible data. The findings presented here are partially or entirely based on data produced by the TCGA Research Network: https://www.cancer.gov/tcga.
Data Availability
The source code and data used to generate the results presented in this manuscript are available from the GitHub repository: https://github.com/ShandongUniversityMasterMa/PDRWH.
Funding Statement
This work has been supported by the National Natural Science Foundation of China [62072277 to NZ, 61972257 to XZ and 61877064 to YZ]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. Ca-a Cancer Journal for Clinicians. 2021;71(3):209–49. doi: 10.3322/caac.21660 WOS:000614520100001. [DOI] [PubMed] [Google Scholar]
- 2.Vandin F. Computational Methods for Characterizing Cancer Mutational Heterogeneity. Frontiers in genetics. 2017;8:83. doi: 10.3389/fgene.2017.00083 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Jang HS, Shah NM, Du AY, Dailey ZZ, Pehrsson EC, Godoy PM, et al. Transposable elements drive widespread expression of oncogenes in human cancers. 2019;51(4):611–7. doi: 10.1038/s41588-019-0373-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pham VVH, Liu L, Bracken C, Goodall G, Li J, Le TDJT. Computational methods for cancer driver discovery: A survey. 2021;11(11):5553. doi: 10.7150/thno.52670 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Vogelstein B, Kinzler KWJNEJM. The path to cancer—three strikes and you’re out. 2015;373(20):1895–8. doi: 10.1056/NEJMp1508811 [DOI] [PubMed] [Google Scholar]
- 6.Garraway LA, Lander ESJC. Lessons from the cancer genome. 2013;153(1):17–37. doi: 10.1016/j.cell.2013.03.002 [DOI] [PubMed] [Google Scholar]
- 7.Papaemmanuil E, Gerstung M, Bullinger L, Gaidzik VI, Paschka P, Roberts ND, et al. Genomic classification and prognosis in acute myeloid leukemia. 2016;374(23):2209–21. doi: 10.1056/NEJMoa1516192 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhang JJ, Bajari R, Andric D, Gerthoffert F, Lepsa A, Nahal-Bose H, et al. The International Cancer Genome Consortium Data Portal. Nature Biotechnology. 2019;37(4):367–9. doi: 10.1038/s41587-019-0055-9 WOS:000463006000014. [DOI] [PubMed] [Google Scholar]
- 9.Raphael BJ, Dobson JR, Oesper L, Vandin FJGm. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. 2014;6(1):1–17. doi: 10.1186/gm524 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Tomczak K, Czerwinska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol (Pozn). 2015;19(1A):A68–77. doi: 10.5114/wo.2014.47136 ; PubMed Central PMCID: PMC4322527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8. doi: 10.1038/nature12213 WOS:000321557600063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tian R, Basu MK, Capriotti EJB. ContrastRank: a new method for ranking putative cancer driver genes and classification of tumor samples. 2014;30(17):i572–i8. doi: 10.1093/bioinformatics/btu466 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Han Y, Yang JZ, Qian XY, Cheng WC, Liu SH, Hua X, et al. DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies. Nucleic Acids Research. 2019;47(8). doi: 10.1093/nar/gkz096 WOS:000473754300003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Leiserson MD, Vandin F, Wu H-T, Dobson JR, Eldridge JV, Thomas JL, et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. 2015;47(2):106–14. doi: 10.1038/ng.3168 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hofree M, Shen JP, Carter H, Gross A, Ideker TJNm. Network-based stratification of tumor mutations. 2013;10(11):1108–15. doi: 10.1038/nmeth.2651 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pham VVH, Liu L, Bracken CP, Nguyen T, Goodall GJ, Li JY, Le TD. pDriver: a novel method for unravelling personalized coding and miRNA cancer drivers. Bioinformatics. 2021;37(19):3285–92. doi: 10.1093/bioinformatics/btab262 WOS:000733827400026. [DOI] [PubMed] [Google Scholar]
- 17.Aizarani N, Saviano A, Sagar Mailly L, Durand S, Herman JS, et al. A human liver cell atlas reveals heterogeneity and epithelial progenitors. Nature. 2019;572(7768):199–204. WOS:000479172800038. doi: 10.1038/s41586-019-1373-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hou JP, Ma J. DawnRank: discovering personalized driver genes in cancer. Genome Med. 2014;6(7):56. Epub 20140731. doi: 10.1186/s13073-014-0056-8 ; PubMed Central PMCID: PMC4148527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Dinstag G, Shamir R. PRODIGY: personalized prioritization of driver genes. Bioinformatics. 2020;36(6):1831–9. doi: 10.1093/bioinformatics/btz815 ; PubMed Central PMCID: PMC7703777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Guo WF, Zhang SW, Liu LL, Liu F, Shi QQ, Zhang L, et al. Discovering personalized driver mutation profiles of single samples in cancer by network control strategy. Bioinformatics. 2018;34(11):1893–903. doi: 10.1093/bioinformatics/bty006 . [DOI] [PubMed] [Google Scholar]
- 21.Erten C, Houdjedj A, Kazan H, Taleb Bahmed AA. PersonaDrive: A Method for the Identification and Prioritization of Personalized Cancer Drivers. Bioinformatics. 2022. Epub 20220517. doi: 10.1093/bioinformatics/btac329 . [DOI] [PubMed] [Google Scholar]
- 22.El Tekle G, Bernasocchi T, Unni AM, Bertoni F, Rossi D, Rubin MA, Theurillat JP. Co-occurrence and mutual exclusivity: what cross-cancer mutation patterns can tell us. Trends Cancer. 2021;7(9):823–36. WOS:000686370700004. doi: 10.1016/j.trecan.2021.04.009 [DOI] [PubMed] [Google Scholar]
- 23.Jiang L, Yu H, Ness S, Mao P, Guo F, Tang J, Guo YJC. Comprehensive Analysis of Co-Mutations Identifies Cooperating Mechanisms of Tumorigenesis. 2022;14(2):415. doi: 10.3390/cancers14020415 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Bretto A, SpringerLink. Hypergraph Theory: An Introduction. 1st 2013. ed. Cham: Springer International Publishing: Imprint: Springer; 2013. [Google Scholar]
- 25.Barbarossa S, Tsitsvero M. An Introduction to Hypergraph Signal Processing. Int Conf Acoust Spee. 2016:6425–9. WOS:000388373406116. [Google Scholar]
- 26.Bretto A. Introduction to hypergraph theory and its use in engineering and image processing. Adv Imag Elect Phys. 2004;131:1–64. doi: 10.1016/S1076-5670(04)31001-3 WOS:000222823700001. [DOI] [Google Scholar]
- 27.Zhang SY, Ding Z, Cui SG. Introducing Hypergraph Signal Processing: Theoretical Foundation and Practical Applications. Ieee Internet Things. 2020;7(1):639–60. doi: 10.1109/Jiot.2019.2950213 WOS:000508181000049. [DOI] [Google Scholar]
- 28.Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Research. 2017;45(D1):D777–D83. doi: 10.1093/nar/gkw1121 WOS:000396575500108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kumar RD, Searleman AC, Swamidass SJ, Griffith OL, Bose R. Statistically identifying tumor suppressors and oncogenes from pan-cancer genome-sequencing data. Bioinformatics. 2015;31(22):3561–8. doi: 10.1093/bioinformatics/btv430 WOS:000365134700001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Reimand J, Wagih O, Bader GD. The mutational landscape of phosphorylation signaling in cancer. Scientific Reports. 2013;3. doi: 10.1038/srep02651 WOS:000325126400002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Xue D, Lin H, Lin L, Wei Q, Yang S, Chen X. TTN/TP53 mutation might act as the predictor for chemotherapy response in lung adenocarcinoma and lung squamous carcinoma patients. Transl Cancer Res. 2021;10(3):1284–94. doi: 10.21037/tcr-20-2568 ; PubMed Central PMCID: PMC8798240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Lu Y, Yu H, Qin S, Chen M, Chen H, Hu Z, et al. Single and Combined Use of Preoperative In fl ammatory Biomarkers and CA199 in Diagnosing Pancreatic Cancer. Clin Lab. 2020;66(11). doi: 10.7754/Clin.Lab.2020.200111 . [DOI] [PubMed] [Google Scholar]
- 33.Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50(W1):W216–W21. doi: 10.1093/nar/gkac194 ; PubMed Central PMCID: PMC9252805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27 ; PubMed Central PMCID: PMC102409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ensenyat-Mendez M, iniguez-Munoz S, Sese B, Marzese DM. iGlioSub: an integrative transcriptomic and epigenomic classifier for glioblastoma molecular subtypes. Biodata Mining. 2021;14(1). doi: 10.1186/s13040-021-00273-8 WOS:000687692700001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Luo QZ, Vogeli TA. Reclassification of Kidney Clear Cell Carcinoma Based on Immune Cell Gene-Related DNA CpG Pairs. Biomedicines. 2021;9(2). doi: 10.3390/biomedicines9020215 WOS:000622121400001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Zhu YM, Zhao Y, Cao ZS, Chen ZH, Pan WS. Identification of three immune subtypes characterized by distinct tumor immune microenvironment and therapeutic response in stomach adenocarcinoma. Gene. 2022;818. doi: 10.1016/j.gene.2021.146177 WOS:000750792500010. [DOI] [PubMed] [Google Scholar]
- 38.Cao D, Chen M-K, Zhang Q-F, Zhou Y-F, Zhang M-Y, Mai S-J, et al. Identification of immunological subtypes of hepatocellular carcinoma with expression profiling of immune-modulating genes. 2020;12(12):12187. doi: 10.18632/aging.103395 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hu Z, Fan C, Oh DS, Marron J, He X, Qaqish BF, et al. The molecular portraits of breast tumors are conserved across microarray platforms. 2006;7(1):1–12. doi: 10.1186/1471-2164-7-96 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Freshour SL, Kiwala S, Cotto KC, Coffman AC, McMichael JF, Song JJ, et al. Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res. 2021;49(D1):D1144–D51. doi: 10.1093/nar/gkaa1084 ; PubMed Central PMCID: PMC7778926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Li C, Tang Z, Zhang W, Ye Z, Liu F. GEPIA2021: integrating multiple deconvolution-based analysis into GEPIA. Nucleic Acids Res. 2021;49(W1):W242–W6. doi: 10.1093/nar/gkab418 ; PubMed Central PMCID: PMC8262695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–20. doi: 10.1038/ng.2764 ; PubMed Central PMCID: PMC3919969. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Goldman MJ, Craft B, Hastie M, Repecka K, McDade F, Kamath A, et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol. 2020;38(6):675–8. doi: 10.1038/s41587-020-0546-8 ; PubMed Central PMCID: PMC7386072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research. 2015;43(D1):D447–D52. doi: 10.1093/nar/gku1003 WOS:000350210400068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bellaachia A, Al-Dhelaan M. Random Walks in Hypergraph. Int J Educ Inf Tech. 2021;15:13–20. doi: 10.46300/9109.2021.15.2 WOS:000644871500002. [DOI] [Google Scholar]
- 46.Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Tamborero D, Schroeder MP, Jene-Sanz A, et al. IntOGen-mutations identifies cancer drivers across tumor types. Nature Methods. 2013;10(11):1081–2. doi: 10.1038/nmeth.2642 WOS:000326507600019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Martinez-Jimenez F, Muinos F, Sentis I, Deu-Pons J, Reyes-Salazar I, Arnedo-Pac C, et al. A compendium of mutational cancer driver genes. Nat Rev Cancer. 2020;20(10):555–72. Epub 20200810. doi: 10.1038/s41568-020-0290-x . [DOI] [PubMed] [Google Scholar]