Skip to main content
NPJ Precision Oncology logoLink to NPJ Precision Oncology
. 2025 Mar 29;9:92. doi: 10.1038/s41698-025-00871-3

Element-specific estimation of background mutation rates in whole cancer genomes through transfer learning

Farideh Bahari 1,2,3, Reza Ahangari Cohan 1,, Hesam Montazeri 2,
PMCID: PMC11953285  PMID: 40155429

Abstract

Mutational burden tests are essential for detecting signals of positive selection in cancer driver discovery by comparing observed mutation rates with background mutation rates (BMRs). However, accurate BMR estimation is challenging due to the diversity of mutational processes across genomes, complicating driver discovery efforts. Existing methods rely on various genomic regions and features for BMR estimation but lack a model that integrates both intergenic intervals and functional genomic elements on a comprehensive set of genomic features. Here, we introduce eMET (element-specific Mutation Estimator with boosted Trees), which employs 1372 (epi)genomic features from intergenic data and fine-tunes it with element-specific data through transfer learning. Applied to PCAWG somatic mutations, eMET significantly improves BMR accuracy and has potential to enhance driver discovery. Additionally, we provide an extensive analysis of BMR estimation, examining different machine learning models, genomic interval strategies, feature categories, and dimensionality reduction techniques.

Subject terms: Cancer genomics, Computational biology and bioinformatics, Cancer genetics

Introduction

Genetic mutations are the primary cause of cancer development1,2. Somatic mutations accumulate due to somatic background mutability and natural selection3,4. Various mutational and repair processes lead to high variability in background mutation rates (BMRs) across different genomic regions or cancers4,5. This regional BMR variability is influenced by genomic and epigenomic features, including replication timing, histone modifications, transcriptional activity, chromatin accessibility, and nucleotide context4,68. A small subset of mutations occur more frequently than the BMR, likely due to positive selection in cancer9,10. The genomic regions or bases under positive selection, which confer a selective advantage to cancer cells and facilitate the acquisition of cancer hallmarks, are called cancer drivers11.

Identifying cancer drivers is crucial for understanding cancer progression and developing targeted therapies. However, accurately modeling BMR, especially in non-coding regions, and accounting for the high heterogeneity in cancer complicate this task9. The development of more precise BMR models is essential for improving the identification of cancer drivers, reducing false positives, and discovering novel drivers. Clinical trials such as Molecular Analysis for Therapy Choice (NCI-MATCH) and Molecular Profiling-based Assignment of Cancer Therapy (NCI-MPACT) highlight the importance of driver identification. NCI-MATCH assigns treatments based on genetic mutation profiles rather than tumor histology, while NCI-MPACT tests the efficacy of targeting oncogenic driver mutations in advanced cancer patients (Shin, Bode, and Dong 2017). Identifying cancer drivers enhances our understanding of tumor biology and may lead to new therapeutic options, highlighting the importance of developing new driver discovery methods and accurate BMR modeling.

Collaborative efforts such as the Pan-Cancer Analysis of Whole Genomes (PCAWG) project have significantly advanced cancer genomics by providing high-quality whole genome sequence (WGS) datasets from 2658 patients5. Various methods have been developed to identify cancer drivers by detecting signals of positive selection through an excess of mutations (or mutational recurrence), functional impact bias, trinucleotide-specific bias, and clustering of mutations2. Detecting these signals requires careful modeling of BMR. Previous driver discovery methods define the BMR model using different strategies. CNCDriver constructs the background model by sampling simulated combined scores (functional scores and mutation recurrence within specific genomic elements) from other elements of the same type as the element of interest and by considering covariates of mutation rates, such as replication timing, mutational sequence context, DNase I hypersensitive sites, and histone modification marks12. Other methods, such as ActiveDriverWGS, DriverPower, Dig, and MutSpot, use machine learning algorithms to model mutation rates as a function of genomic and/or epigenomic covariates. These methods differ in the regions of the genome used to construct the input feature matrix for the background model. DriverPower uses intergenic intervals and a comprehensive set of features to train the BMR model and then applies a functional impact test to improve the driver discovery task13. ActiveDriverWGS uses the nucleotide context in a window of at least ± 50 kb around the element of interest as a background14. Other approaches such as Mutspot, a regression-based method, and Dig, a model based on deep neural networks, search the entire genome for evidence of mutations that are under positive selection15,16. However, none of the previous methods have integrated both element-specific and intergenic information along with a comprehensive set of features to build the background model.

In this study, we introduce eMET (element-specific Mutation Estimator with boosted Trees), a novel model designed for building an accurate BMR, facilitating the discovery of cancer drivers in both coding and non-coding regions (Fig. 1). eMET enhances the accuracy of BMR estimation by incorporating mutation rates and a comprehensive list of (epi)genomic features from both intergenic regions and elements of the same type. To achieve this, eMET uses boosted trees to leverage extensive intergenic data and fine-tunes the model using element-specific information. This is achieved by first building an initial model with intergenic data on a comprehensive set of genomic features, followed by enhancing it through bootstrap samples that incorporate element-specific data. The aggregation of predictions from bootstrap samples improves the accuracy of driver discovery. To develop eMET, we conducted a thorough analysis of the factors influencing mutation rates, including feature importance analyses, intergenic interval generation methods, and various machine learning (ML) models. Applying eMET to pan-cancer data from 2253 PCAWG donors, and 23 different cancer types has demonstrated its effectiveness, surpassing other machine learning models that rely solely on intergenic or element-specific information for BMR estimation. By offering a comprehensive benchmark analysis and introducing eMET, this study provides novel insights into BMR estimation and highlights its potential in enhancing the identification of driver genomic elements in cancer.

Fig. 1. Summary of the eMET algorithm.

Fig. 1

The eMET algorithm enhances the accuracy of the XGBoost model trained on intergenic intervals. The workflow begins with a pretrained intergenic XGBoost model (M0), which is then fine-tuned using element-specific XGBoost models (Mi) trained on bootstrap samples of elements from the element type of interest. For the element-specific training set, all known drivers are excluded. The background mutation rate (BMR) estimation for a given element is achieved by aggregating the out-of-bag (OOB) predictions from n bootstrap samples (here n = 100), i.e., using the bootstrapped models learned on the bootstrap samples without the given element. We considered the 5’UTR, 3’UTR, CDS, core promoter, enhancer, and splice sites.

Results

We conducted our analyses on a pan-cancer cohort of 2253 high-quality, non-hypermutated whole cancer genomes from 33 cancer types, as well as 23 different cancer-specific cohorts. Somatic SNV and Indel calls were obtained from the PCAWG project5. For the pan-cancer analysis, we excluded the Skin-Melanoma, Lymph-CLL, Lymph-NOS, and Lymph-BNHL cohorts due to evidence of increased mutation rates in their transcription factor binding sites5,1719. In this cohort, we used 21,183,078 somatic single nucleotide variants (SNVs), dinucleotide variants, multi-nucleotide variants, and small indels to comprehensively assess the performance of our proposed approach, eMET, for BMR estimation compared to other machine learning methods in various settings. In the cancer-specific BMR estimation, we compared eMET with three published tools: ActiveDriverWGS13,14, DriverPower13, and Dig15. Additionally, we evaluated the effectiveness of the eMET method in enhancing the identification of cancer drivers.

XGBoost outperforms other ML models in predicting BMR across various sample sizes

We compared different fixed-size (1M, 100k, 50k, and 10k) and variable-size intergenic intervals using random forest (RF), XGBoost, and two neural network models with different loss functions to determine the optimal interval generation method for predicting BMR. To this end, we trained models on 555,855 variable-size intergenic intervals and various fixed-size intervals, each harboring at least one mutation in the studied cohort. Supplementary Fig. 1f shows the heatmap for one-sided significance analysis across various combinations of machine learning algorithms and intergenic interval generation methods. Our analyses revealed that all ML methods consistently outperform when trained with variable-size intervals compared to fixed-size intervals for both the validation set and functional elements, except for two neural networks in some elements (such as CDSs), where 10k bp fixed-size intervals outperform variable-size intervals (Fig. 2b, Supplementary Fig. 1a, f). Therefore, we used the variable-size intergenic intervals in our subsequent analyses.

Fig. 2. Comparison of various machine learning algorithms on different intergenic interval generation methods and sample sizes.

Fig. 2

a variable-size intergenic intervals are generated by removing PCAWG elements from the callable regions of hg19, and fixed-size intergenic intervals are generated by splitting the hg19 genome into fixed-size intervals (1M, 100k, 50k, and 10k) followed by removing non-callable regions and PCAWG elements. b Correlation between observed and predicted background mutation rate (BMR) using different interval generation methods (fixed-size intervals of 1M, 100k, 50k, and 10k, as well as variable-size intervals) by applying RF, XGBoost, and two neural networks. For the intergenic elements, we used the mean correlation obtained from 10 repeats of train-validation splits. c Effect of sample size on the performance of different BMR prediction models. The average correlation between observed and predicted rates using the intergenic validation set with different sample sizes (error bars represent the standard deviation of the correlations in 10 repeated train-validation sets). d Correlation between observed and predicted XGBoost BMR on functional elements with different sample sizes. DS stands for downsampling. ‘FullSet’ represents the full set of all variable-size intergenic intervals with at least one mutation (n = 1,178,722). DS1M to DS50k indicates the number of randomly selected intergenic intervals in the downsampling procedure, ranging from 1M (1,000,000 intervals) to 50k (50,000 intervals).

Comparing predicted values with observed values using variable-size intervals, tree-based methods (RF and XGBoost) consistently outperformed neural networks on both the intergenic validation set and functional elements (Fig. 2b, Supplementary Figs. 1a, 2). The XGBoost model demonstrated the best performance on the intergenic validation set, with a mean correlation of 0.741 (Fig. 2b). Among the neural networks with the same architecture, the one with the Poisson loss significantly outperformed the one with the mean squared error (MSE) loss on the intergenic validation set (mean correlation = 0.715 vs 0.603, q-value = 2.160e-8) (Fig. 2b, Supplementary Fig. 1f). These results underscore the superiority of tree-based methods, particularly XGBoost, in predicting BMR, and highlight the importance of using variable-size intervals for training models in this context.

We restricted our analysis and model training to mutated intervals because training the XGBoost model on intervals with at least one mutation resulted in enhanced performance compared to training on all intergenic intervals (mean correlation = 0.741 vs. 0.523, q-value = 1.00e-16) (Supplementary Fig. 3). We also assessed whether filtering shorter intervals would enhance performance. Specifically, we examined the influence of model training on all intervals, as well as on intervals longer than 100 base pairs, with and without mutations. The results indicated that training the model on intervals of all lengths with at least one mutation yielded greater performance than training on intervals that also included those without mutations (Supplementary Fig. 3).

To determine the impact of sample sizes on BMR prediction, we performed downsampling on a full set of intergenic intervals. To generate this full set of intergenic intervals, we randomly sampled 1,400,000 intervals from the length distribution of the PCAWG genomic elements, multiplied by a factor of three from the hg19 genome13. After restricting intervals to callable regions, we removed intervals overlapping with functional elements as well as ‘lncrna.ncrna’ or ‘lncrna.promCore’. This resulted in 1,251,185 intergenic intervals, with 1,178,722 intervals containing at least one mutation across the cohort. We then randomly downsampled from the full set to create sample sizes of 1,000,000, 800,000, 600,000, 300,000, 100,000, and 50,000. Applying all four ML algorithms to these different sample sizes showed the performance rankings of ML algorithms remained consistent regardless of sample size, with XGBoost consistently performing the best. However, with larger sample sizes, the difference in performance for XGBoost, RF, and NN with Poisson loss decreased (Fig. 2c, d, Supplementary Fig. 1b–e, g).

eMET enhances the intergenic XGBoost model performance for BMR prediction

By using both intergenic and element-specific information, eMET outperformed both the intergenic XGBoost model and the bootstrap aggregation of element-specific XGBoost models across all element types, in pan-cancer analysis. The rationale behind eMET is that fine-tuning intergenic models with the information specific to each element type enhances the precision of BMR estimation, thereby improving driver discovery precision. According to our comparative tests, while the element-specific XGBoost model outperformed the intergenic XGBoost for core promoters and splice sites, the intergenic model was superior for all other element types. However, eMET consistently delivered the best performance, demonstrating its effectiveness in integrating genomic information across intergenic intervals and functional element types, in pan-cancer, and cancer-specific analysis (Fig. 3a).

Fig. 3. The eMET performance in background mutation rate (BMR) prediction and driver discovery for pan-cancer cohort.

Fig. 3

a Correlation between observed and predicted BMR for comparing the performance of eMET, element-specific XGBoost, and intergenic XGBoost. b Enhanced driver discovery performance for eMET compared to intergenic XGBoost. This figure displays the number of CDS and non-coding driver candidates identified by intergenic XGBoost and eMET. Additionally, it shows the number of driver candidates identified by each method that are present in the OncoKB cancer genes database. c Lollipop plot of the TPTE promoter as a candidate driver, showing two hotspots at chromosome 21 positions 10,990,694 (del C) and 10,990,976 (G > A). d Comparison of RNA expression for TPTE gene in TPTE promoter mutated and wild type donors.

In this study, in addition to pooling SNVs and non-SNVs, we further compared eMET with intergenic XGBoost in predicting BMR, using separate models for SNVs and non-SNVs. Our results demonstrated that eMET outperformed intergenic XGBoost across various cancer types, both by pooling the variants and with separate modeling (Fig. 5 and Supplementary Fig. 4). While some previous tools such as Dig, MutSpot, and ncDriver, model SNVs and non-SNVs separately, other tools such as Dr.Nod, limit the analysis to SNVs20. However, in tools such as DriverPower and ActiveDriverWGS, SNVs, doublets, MNVs, and indels are pooled and modeled together13,14, and Exinator pools SNVs and indels of length 1 together15,21.

Fig. 5. Correlation between observed and predicted mutation rates using the baseline model and various tools across pan-cancer and 23 cancer-specific cohorts.

Fig. 5

Each slice displays the performance of various methods in all cohorts for a specific genomic element type. Cohorts are sorted by the number of mutations per donor in each cohort, with values shown in log10 scale.

eMET demonstrates the best performance in BMR Prediction in both original and lower-dimensional spaces

We analyzed the application of methods for predicting BMR on reduced feature spaces obtained through principal component analysis (PCA) and an autoencoder (AE). Applying PCA to the scaled training data, we found that the first 148 principal components (PCs) explained 80% of the variance. We reduced the dimensionality of the intergenic and functional element feature matrices to the first 148 PCs using the PCA model. For AE, we used a bottleneck with 148 nodes to match the PCA for comparison. We trained eMET, XGBoost, RF, and NNs with both Poisson loss and MSE loss on the PCA-reduced data and compared their performances with those of these algorithms trained on the AE-reduced data. Our analysis showed that training on the original data yielded better results than did using dimensionality reduction before the algorithms were applied (Fig. 4). We also observed that eMET had the best performance across all element types when applied to the full set of features and AE-reduced space. However, when applied on PCA-reduced space, eMET outperformed all other algorithms except the neural network with Poisson loss for enhancers and 3’UTRs. Comparing the performance of all models on PCA and AE-reduced data, we found that XGBoost performed better on AE-reduced space than PCA-reduced space for all functional elements. Conversely, the neural network with Poisson loss outperformed on PCA-reduced space compared to AE-reduced space. Notably, this algorithm performed better on the PCA-reduced space than using the full feature set for CDSs, 3’UTRs, 5’UTRs, and core promoters (Fig. 4).

Fig. 4. Correlation between observed and predicted mutation rates using different models trained on the variable-size intergenic intervals with all of the features, PCA-reduced space, and AE-reduced space.

Fig. 4

For the intergenic elements, we used the mean correlation obtained from the correlations of 10 repeats of train-validation splits. PCA stands for principal component analysis, and AE stands for autoencoder.

eMET outperforms alternative approaches in cancer-specific BMR estimation

We further evaluated eMET’s performance in predicting BMR by comparing observed and expected BMRs against ActiveDriverWGS, DriverPower, Dig, and a custom local mutation rate estimate as the baseline (see ‘Definition of baseline’ subsection of the Methods for more details) across 23 cancer-specific cohorts and a pan-cancer cohort. In nearly all pairwise comparisons between eMET and other methods (538 out of 552 tests), eMET outperformed the other methods in BMR estimation. While eMET performance remains consistent across cancer types with varying cohort’s mutation counts, Dig’s performance is influenced by the cohort’s mutation count, showing a bias toward better performance in cohorts with a greater number of mutations per donor (Fig. 5). For the cancer-specific analysis, we conducted two separate analyses. In the first, we used all 1372 (epi)genomic features; in the second, we restricted the features to functional genomics features based on the cell of origin to match the cancer type under study (the ‘origin’ column in Supplementary Data 1 specifies the cell type of origin for each feature), along with constant genomic features such as nucleotide content and conservation categories. This analysis showed that the foundational intergenic model performed better in predicting the BMR when using the full set of features compared to using only tissue-matched functional genomics features (Supplementary Fig. 5). For the cohorts of brain/neuronal origin—CNS-Medullo, CNS-GBM, and Head-SCC—which had the highest number of tissue-specific features (n = 113), we observed that the intergenic model’s performance using the full feature set closely matched its performance when limited to tissue-specific features. Based on these results, we decided to use the full set of features to run eMET in our cancer-specific analysis.

eMET enhances cancer driver discovery in coding and non-coding elements

Using a burden test based on the binomial distribution followed by multiple-testing correction, we identified a list of candidate drivers for each model. Benchmarking the models against the OncoKB gene list as our gold standard showed that enhancing BMR estimation improved driver discovery tasks (Supplementary Data 2, and Fig. 3b). Supplementary Data 3 summarizes the significant hits identified by eMET in coding and non-coding elements. When comparing eMET with intergenic XGBoost for CDSs, eMET reported less significant but more true positive hits, resulting in enhanced precision, recall, and F1 score. In non-coding elements, eMET still outperformed intergenic XGBoost in precision, recall, F1, AUPR, and AUC, while improvement of intergenic model by eMET was smaller than in CDSs (Supplementary Data 2).

For both intergenic XGBoost and eMET, the top significant hits included well-known cancer drivers such as the CDSs of KRAS, TP53, PIK3A, CDKN2A, CTNNB1, and VHL, as well as the core promoter of TERT, the splice sites of TP53, RB1, and PTEN (Supplementary Data 3). We reported the promoter of TPTE which has two hotspots (Fig. 3c), and the 3’UTR of IFI16 which was exclusively identified by eMET, while intergenic XGBoost did not recognize them as candidate drivers. Expression analysis also supports association of mutations in the promoter of TPTE and expression levels of TPTE gene in pan-cancer cohort (Fig. 3d).

In addition, we integrated eMET’s BMR estimations into the Dig framework to assess whether combining eMET with other driver discovery methods could improve driver discovery performance. Specifically, we replaced Dig’s convolutional neural network- and Gaussian process-based mutation predictions with eMET predictions that were centered to match Dig’s mean values (Dig(eMET)) and then ran the Dig pipeline as before. Across all cancer types, Dig(eMET) demonstrated significant improvements over Dig in identifying coding drivers. However, for non-coding elements, both Dig and Dig(eMET) exhibited extremely low F1 scores, likely due to the absence of a reliable ground truth for non-coding drivers in each cancer type (Supplementary Fig. 6).

eMET facilitates analyses of feature importance by element type

To determine the importance of the seven feature categories, i.e., conservation, replication timing, nucleotide content, epigenetic marks, RNA expression, chromatin compartments, and DNA accessibility, in the intergenic model, we employed two approaches: 1) one-group feature prediction, and 2) leave-one-group-out prediction.

Our analyses showed that epigenetic marks (histone modifications) are the most important group of features for predicting BMR using the intergenic XGBoost model (Fig. 6a, and Supplementary Fig. 7). By using eMET, we additionally analyzed the importance of the feature groups for predicting the BMR of each element type. The analysis showed that ‘epigenetic marks’ were the most important category of features for BMR prediction across all element types, with DNA accessibility and nucleotide content being the next most important feature groups for splice sites and 5’UTR elements (Fig. 6b).

Fig. 6. Analysis of feature importance in predicting background mutation rate (BMR).

Fig. 6

a Average correlation between observed and predicted BMR values using one group of features with the intergenic XGBoost model across 10 training-validation splits (the dashed line represents the average correlation between observed and predicted BMR using the full set of features on 10 repeats of train-validation sets). b Per element correlation ratio for groups of features. c Top 50 important variables for estimating BMR in the intergenic XGBoost model. d Per element improvement of the intergenic model using eMET for each feature category.

We further investigated which feature category can better enhance the BMR prediction for each element type when fine-tuning the intergenic model with eMET. We observed that the conservation category is one of the most important feature categories that cause eMET be improved, especially in CDS, and splice sites (Fig. 6d).

When we examined feature importance at a more granular level, we observed that ENCFF0251LMM (the genome compartments derived from HiC data of the SJCRH30 cell line) was the most important variable for predicting BMR in the pan-cancer cohort. Overall, variables related to HiC, replication timing, and epigenetic marks, are among the most important features for BMR prediction (Fig. 6c).

We also analyzed feature importance across different cancer types to determine whether the tissue-specific features were enriched in the top 50 features. Our analyses showed a significant enrichment of tissue-specific features for Skin-Melanoma, Uterus-AdenoCA, and ColoRect-AdenoCA (Supplementary Table 1).

Discussion

Discovering both coding and non-coding cancer drivers presents a significant challenge in the field of cancer genomics. Various statistical methods have been devised to identify drivers through an analysis of excess mutation burdens. An essential step involves constructing a BMR model, which serves as a baseline for comparing observed mutations. DriverPower, a computational tool presented in the PCAWG collection, uses intergenic information along with a comprehensive set of features to train a BMR model13. In contrast, ActiveDriverWGS employs a background window of at least ±50 kilobases around the element of interest14. Dig and MutSpot scan the entire genome for making background models.

In this study, we introduced eMET, a novel approach for BMR prediction that employs transfer learning to initially incorporate intergenic information into the model and subsequently fine-tunes it to adapt to specific functional element types. We applied eMET to WGS data from a pan-cancer cohort of 2253 non-hypermutated donors, spanning 33 cancer types, and 23 cancer specific cohorts, from the PCAWG project. We implemented eMET with the XGBoost model as the base model and applied it to a set of 1372 features. Our results demonstrated that incorporating element-specific information alongside intergenic data significantly enhanced the estimation of BMR, outperforming models that rely solely on intergenic information. Our foundational intergenic XGBoost model is somewhat equivalent to DriverPower. Therefore, the improvement in BMR estimation compared to DriverPower, achieved by incorporating eMET, highlights the significant performance gains enabled by element-specific BMR modeling. We conducted a comprehensive analysis of the factors influencing BMR prediction under various settings. Our analysis of different genomic interval generation methods demonstrated the superiority of variable-size intergenic intervals for BMR prediction compared to fixed-size intervals (Fig. 2b, Supplementary Fig. 1a, f). Surprisingly, using variable-size intervals with at least one mutation can profoundly increase the performance of predicting the mutation rate, especially for intervals with shorter lengths (Supplementary Fig. 3). This emphasizes the importance of considering genomic interval generation methods and mutation status as critical factors affecting BMR estimation. Our analyses of the importance of features and feature categories revealed that epigenetic marks are the most important category for predicting mutation rates in cancer genomes. These findings align with the previous results that the mutation rates of cancer genomes are correlated with epigenomic and chromatin organization6,22. A deeper examination of each feature’s importance shows that features related to chromatin compartments, replication timing, and epigenetic marks are the most influential. These results are consistent with previous studies showing that epigenetic organization and replication timing can explain most of the variance in mutation rates6. Kadir et al., in a more recent study, showed that changes in mutational load in cancer genomes can be better tracked by considering three-dimensional chromatin rather than replication timing measurements6,23. Our variable importance results are also in concordance with Akdemir et al.’s study, as we found that the most important feature for predicting BMR is the genome compartments derived from HiC data of the SJCRH30 cell line and all of the 12 genome compartment HiC features are among the top most important features23. Although the nucleotide content group is not among the most important feature categories in intergenic models, eMET has revealed that nucleotide content ranks as the third most important category for BMR estimation in splice sites and 5’ UTRs. Additionally, despite the conservation category having the least impact on mutation rate estimation in both intergenic models and eMET, the PhyloP feature remains among the top 50 important features (Fig. 6c). Additionally, the conservation category is one of the most important for enhancing the intergenic model with element-specific information, especially for CDSs, splice sites, promoters, and 5’UTRs (Fig. 6d). These findings align with the high level of conservation observed in coding sequences across species, followed by the enrichment of conserved bases in 5’UTRs and promoter-like elements2426. Nucleotide content, particularly for CDS elements, is another key feature category that enhances the intergenic model when evaluated in the context of element types. This observation is consistent with the fact that protein-coding sequences have a higher GC content compared to the average GC content of the human genome27. Taken together, our analysis reveals that among models using a single feature category to estimate BMR, epigenetic marks are the most influential, as shown by both the intergenic model and eMET (Fig. 6a, b, Supplementary Fig. 7); however, the conservation and nucleotide content categories can most effectively improve the intergenic model through element-specific information (Fig. 6d).

In our recurrence-based driver discovery task, both intergenic XGBoost and eMET identified well-known cancer drivers as their top hits. Notably, eMET reported 79 significant hits for the pan-cancer cohort in coding sequences (CDSs), 58 of which are listed in OncoKB. Nine CDSs were uniquely identified by eMET but not by the intergenic XGBoost. These included TGFBR2, HRAS, ATM, RNF43, RBM10, DDX3X, KMT2C, PDE4DIP, and RPS6KA3. Of these, only RPS6KA3 is not classified as a true positive according to our gold standard, although it was reported as a cancer driver in the ‘meta_Adenocarcinoma’ cohort of the PCAWG project. Additionally, eight out of 21 significant non-true positive CDS hits (Supplementary Data 3) were listed in the Cancer Gene Census (CGC), CGC_literature, or PCAWG-raw drivers (significant hits of the PCAWG list—q-value pre-filtering < 0.1—without post-filtering process9), as cancer drivers. Thirteen significant hits not listed in any cancer gene databases- POTEE, TRIM49C, EYS, AC026310.1, OR2T34, LRP12, OR5W2, CTD, TOR4A, OR10GB, PROS1, OR5F1 and POTEJ- could be considered putative cancer drivers. POTEE, a member of the POTE gene family, originally identified in prostate cancer, has been found to have altered expression in various cancer types28. Shen et al. demonstrated that POTEE mRNA and protein are overexpressed in colorectal tumor samples and cell lines. Through gain-of-function and loss-of-function experiments, they confirmed its oncogenic role by influencing cell proliferation, cell cycle progression, migration, invasion, and apoptosis through both in vivo and in vitro evidence29. TOR4A, which belongs to the ATPase superfamily, was shown in vitro to be involved in cell proliferation, apoptosis inhibition, and cell cycle progression in glioma cell lines and samples30. Abboud-Jarrous et al. showed that by modulating AXL expression, PROS1 could drive oral squamous cell carcinoma and enhance the proliferation and migration of oral squamous cell carcinoma cell lines31. The CDS of the ZNF608 was among eMET driver candidates in the Lymph-BNHL cohort, although it is not listed in CGC, CGC_literature, or OncoKB. However, the Network of Cancer Genes (NCG) database designates ZNF608 as a putative tumor suppressor gene, noting it has prevalent loss-of-function alterations32.

For the non-coding elements, both the eMET and intergenic models identified previously known non-coding driver elements as their top hits. eMET successfully prioritized 15 functional elements of OncoKB genes in the pan-cancer cohort. Additionally, eMET identified 32 other significant hits in this cohort, including 20 functional elements that could be potential drivers based on CGC COSMIC cancer genes, or the PCAWG driver list, along with 12 newly identified elements. These elements are the 3’UTRs of ADH1B (with significant RNA expression (p-value = 0.00007)), BHMT, BRINP2, GLYR1, IFI16, PDPR, and TCP10 genes, and the core promoters of COL4A2, OR5T3, SRPRB, TPTE (with significant RNA expression (p-value = 0.018)), and ZSCAN5B genes. Notable potential driver candidates include the 3’UTRs of the Betaine homocysteine methyltransferase (BHMT) (q-value = 0.0004) and the IFNγ -inducible protein 16 (IFI16) (q-value = 0.028), as well as the core promoter of TPTE (q-value = 0.0458). BHMT is an enzyme that converts homocysteine to methionine, which is crucial for cancer cell proliferation due to methionine addiction, known as the Hoffman effect. This addiction makes methionine deprivation a promising therapy against cancer cells33. IFI16 is involved in cell proliferation, cell cycle inhibition, p53-mediated apoptosis, cellular senescence, and DNA damage sensing34,35. Duan et al. showed that knocking down IFI16 expression could inhibit the ATM/AMPK/p53 signaling pathway and autophagy upon glucose restriction, thus providing a growth advantage for cancer cells in a low-glucose cancer microenvironment36. Zhang et al., showed that the IFI16 expression is lower in osteosarcoma compared to normal bone tissues. By generating stable IFI16-expressing osteosarcoma Saos-2 and chondrosarcoma RCS cell lines, they demonstrated that the tumorigenicity of these cells decreased, which corresponded with decreased levels of cyclin E, cyclin D1, c-Myc, and Ras35. Transmembrane Phosphatase with Tensin homology (TPTE) is a cancer-testis antigen specifically expressed in adult testis. Our pan-cancer analysis showed two hotspots in the TPTE core promoter (Fig. 3c). Previous studies have shown ectopic expression of TPTE in tumor samples: TPTE is expressed in 39% of HCC and 36% of non-small cell lung cancer samples37. Additionally, 20% of lung tumor samples expressed TPTE, whereas expression was undetectable in normal tissue except in the testis, epididymis, and placenta38. A recent study also reported that TPTE overexpression in prostate cancer is clinically significant and is associated with disease prognosis39. In our pan-cancer cohort analysis, we identified several olfactory receptor-related elements as significant hits, specifically the CDSs of OR10G8, OR2T34, OR5F1, OR5W2, and OR7A10, as well as the core promoter of OR5T3. Although recent studies suggest that some olfactory receptors may function as oncogenes or have functional roles in cancer cells, our candidate list requires thorough analysis to validate their potential role in cancer4043.

In addition, eMET identified additional candidate elements through cancer-specific analyses. One such candidate is the 3’UTR of ADH1B, recurrently mutated in the Liver-HCC cohort, and associated with significant differences in RNA expression between mutated and non-mutated Liver-HCC donors (p-value = 0.039). ADH1B has also been suggested as a potential tumor suppressor gene, frequently exhibiting loss-of-function alterations in the NCG database32. Another novel candidate is the 3’UTR of ABI2, recurrently mutated in Liver-HCC. Knockdown of ABI2 in hepatocellular cells was shown to inhibit cell growth, sorafenib resistance, migration, and invasion44. Furthermore, ABI2 overexpression inhibited cell migration, invasion, and proliferation in triple-negative breast cancer45. The identification of the 3’UTR of FGA as a candidate in Liver-HCC aligns with the findings of Wang et al., which demonstrated that knocking out FGA in lung adenocarcinoma cells enhanced cell growth and metastasis via the integrin-AKT signaling pathway46. Additionally, there is evidence of frequent loss-of-function alterations based on the NCG database. The 3’UTR of SENP2 in Panc-AdenoCA, a tumor of epithelial origin, is another potential driver candidate. The SENP2 gene is located in the 3q26-29 region, which is frequently amplified in various epithelial-origin cancers, including lung, esophageal, head and neck, cervical, and ovarian cancers47. Moreover, SENP2 knockdown in esophageal squamous cell carcinoma suppressed cell proliferation, while SENP2 overexpression enhanced proliferation by deSUMOylating SETDB148.

While we employed eMET for a recurrence-based driver discovery task using a binomial distribution, our primary aim in developing eMET was to estimate the BMR and to investigate various factors influencing BMR prediction. We demonstrated eMET’s utility in driver discovery, though we acknowledge that recent driver discovery tools such as DriverPower, Dr.Nod, and Dig use integrated approaches to prioritize cancer drivers within each genomic interval. Driver discovery tools detect signals of positive selection through factors such as trinucleotide context bias, an accumulation of mutations with high functional impact within target elements, mutational excess, and the clustering of mutations within protein functional domains or the 3D structure of the protein. By integrating eMET’s BMR estimates with Dig, we demonstrated eMET’s potential to enhance driver discovery as a building block of other driver discovery methods. However, further investigations are needed to develop a more comprehensive framework. Furthermore, although we modeled SNVs and non-SNVs separately for the BMR estimation, prioritizing driver candidates would benefit from a distinct modeling approach for SNVs and non-SNVs. Nonetheless, we provided a list of candidate drivers for further experimental investigation, but after manually post-filtering of the candidates.

Our study has some limitations. eMET did not account for mutational signatures from specific mutational processes or individual mutation rate heterogeneity. eMET requires the removal of previously known drivers from the list of elements in the element-specific bootstraps. However, remaining unknown cancer drivers, especially in non-coding elements, can influence the model and generate bias in BMR estimation. We attempted to minimize this effect by removing all COSMIC CGC and PCAWG gene lists, along with our gold standard OncoKB genes. Finally, the eMET training process, involving extensive bootstrap sampling and training on numerous features, is time-consuming and requires powerful computational systems. However, by using the GPU version of XGBoost, we efficiently trained our model on a large number of genomic intervals with at least 1372 features.

In summary, our study highlights the importance of leveraging both intergenic and element-specific information for BMR prediction. Additionally, our benchmark analyses on different intergenic interval generation methods, dimension reduction, groups of feature analyses, and sample size effect in BMR prediction offer a robust foundation for future research. Our findings demonstrate the potential of integrative and ensemble approaches in cancer genomics, paving the way for advancements in precision oncology.

Methods

Whole cancer genomes

We analyzed Consensus SNV and Indel somatic mutations of 2583 white-list, non-hypermutated whole cancer genomes of the PCAWG project for 23 cancer types with at least 20 donors, as well as a pan-cancer cohort. The pan-cancer analysis included 2253 donors of 33 cancer types in the PCAWG project, except for samples from the Skin-Melanoma, Lymph-CLL, Lymph-NOS, and Lymph-BNHL cohorts as well as 69 hypermutated donors5. White-list tumors in the PCAWG project are defined as samples exhibiting good quality and free of quality control issues. Hypermutated donors were defined as those with more than 30 mutations per Mb13. Consensus SNV and Indel somatic mutations were obtained from the ICGC Data Portal (https://dcc.icgc.org/releases/PCAWG/). Both controlled and open access data are now available in the ICGC 25K Release Data (see Supplementary Table 5 for details).

Characterization of intergenic intervals and functional genomic elements

We analyzed both the coding and non-coding functional genomic elements of the genome, specifically the coding sequences (denoted as “gc19_pc.cds” in the PCAWG), enhancers (“enhancers”), 3’ untranslated regions (“gc19_pc.3utr”), 5’ untranslated regions (“gc19_pc.5utr”), core promoter regions (“gc19_pc.promCore”), and splice sites (“gc19_pc.ss”) (n = 126,517). The genomic coordinates of functional genomic elements can be obtained from the ICGC 25K Release Data, located at s3://icgc25k-open/PCAWG/drivers/metadata/genomic_intervals_lists/. We used a single processed bed file (https://figshare.com/ndownloader/files/13005854) that concatenated all the genomic coordinates of these functional genomic elements, except for the miRNA and small RNA coordinates. Additionally, we used the callable regions (96.41% of the genome) of the PCAWG study. Callable regions were created by excluding the following genomic intervals from the analyses: (1) black-listed intervals of the hg19 genome, as defined in the UCSC Genome Browser49, (2) PCAWG low mappability intervals5, and (3) all gaps and N bases of the hg19 genome50.

We also incorporated intergenic intervals for training the BMR models. Intergenic intervals were defined in two distinct ways: (1) variable-size intervals, generated directly from the callable regions of the genome, excluding the abovementioned PCAWG functional regions, lncRNAs (“lncrna.ncrna”), and lncRNA core promoters (“lncrna.promCore”), and (2) fixed-size intervals, created by dividing the hg19 genome into intervals of fixed sizes (1 Mb, 100 Kb, 50 Kb, and 10 Kb) using bedtools (version 2.30.0). We ensured that these intergenic intervals neither included non-callable regions nor overlapped with functional genomic elements (Fig. 2a). In addition, only intergenic intervals from autosomal chromosomes were used in this study. Supplementary Fig. 8 shows the distribution of the lengths of functional genomic elements and variable-size intergenic intervals. Supplementary Tables 2 and 3 provide descriptions of the functional genomic elements and intergenic intervals, detailing their lengths, number of mutated elements, and number of mutations in each element type.

Features and response table

We analyzed 1372 genomic and epigenomic features influencing the BMR, as detailed in the DriverPower paper13. These features were categorized into seven groups: conservation, replication timing, nucleotide content, epigenetic marks, RNA expression, chromatin compartments, and DNA accessibility; these data were obtained predominantly from the ENCODE project, Roadmap Epigenomics project, and UCSC genome browser (Supplementary Data 1). For features in bigwig format, we used bigWigAverageOverBed (v2) to extract the average signal within each genomic interval. For bed formatted features, we used bedtools (v2.26.0) to calculate the proportion of bases within each interval intersecting with a bed file. For the nucleotide content category, the fraction of 2- and 3-mers was directly calculated from the hg19 genome. Any missing values were replaced with 0. We independently applied Z-score normalization to each feature using the intergenic data.

For each cancer-specific and pan-cancer cohort, we constructed a response table for all the genomic intervals, comprising unique interval IDs for each genomic interval, the number of mutations in each element across donors (y), the total number of donors (N), and the length of each element (l). The mutation rate (r) for the genomic element with interval ID i was defined as the number of mutations per length per number of donors in the cohort, i.e., ri=yi/(li* N).

Main analysis of this study combines all SNVs, MNVs, and indels together. However, since SNVs and non-SNVs can arise from different mutational processes, we also analyzed them separately. To this end, we created separate response tables for SNVs and non-SNVs.

Different machine learning algorithms for modeling BMR

For the BMR modeling, we used RF and XGBoost, alongside two neural network models with identical architectures but distinct loss functions: mean squared error and Poisson loss functions. Benchmarking of these models was performed on the pan-cancer cohort. The hyperparameters of the models are given in Supplementary Table 4. All the model training was performed using Python utilizing the keras TensorFlow, cuML, and XGBoost packages5153.

Definition of baseline

To compare the models, we used the local mutation rate as the baseline and evaluated the correlation of various machine learning algorithms and BMR prediction methods against it. In particular, the baseline method was defined by extending each block of a PCAWG element to a 100 kb window (±50 kb), excluding other blocks of the same element, and calculating the weighted mean of the observed mutation rates of the element’s blocks based on their lengths.

The eMET algorithm

In this section, we introduce our approach, the eMET algorithm, for predicting BMR across different genomic element types as depicted in Fig. 1. The main hypothesis of eMET is that using mutation rates and (epi)genomic features from elements within the same element type can enhance the accuracy of BMR estimations. The eMET method addresses several challenges in achieving this goal. A significant challenge is the relatively small number of elements per genomic element type compared to the number of intergenic intervals, which could lower the prediction accuracy. To mitigate this, we employed transfer learning, a machine learning technique in which a pre-trained model on one dataset is fine-tuned on another to improve the model’s performance. This approach involves learning patterns from an initial large dataset, then refining the model on a related, smaller dataset. In our case, we initially trained a model on a large number of intergenic intervals to serve as a basis for further fine-tuning across different element types by using an ensemble approach. In addition, for driver discovery, we need to ensure that the machine learning models predicting BMR for a specific element do not include that element in their training data. To accomplish this, we trained multiple element-specific models using bootstrap samples, excluding the element of interest in each model. We then used out-of-bag (OOB) predictions to estimate the BMR for each element. Furthermore, since mutations in known driver elements are more influenced by natural selection than by background mutational processes, we excluded known cancer driver elements from our training dataset. The eMET algorithm is explained in detail as follows:

  1. Intergenic Model Training: We trained an XGBoost model on all mutated intergenic intervals, using the parameters outlined in Supplementary Table 4.

  2. Exclusion of Previously Identified Drivers: All known PCAWG functional drivers were omitted from the training dataset for element-specific models. Specifically, we excluded all protein-coding and non-coding elements of the genes listed in CGC, CGC_literature (according to the PCAWG study, 757 cancer genes, union of 369 genes from exome studies10,54 and 603 CGC-v80 genes are listed in supplementary table 7 of the Rheinbay study 9), OncoKB, and PCAWG driver lists (we excluded all of the elements listed in supplementary tables 4 and 5 of the Rheinbey study)9,5557.

  3. Bootstrap Sampling and Element-Specific Model Training: We generated 100 bootstrap samples for each element type, including enhancers, CDSs, 3’UTRs, 5’UTRs, splice sites, and core promoters. For each bootstrap sample k, the baseline intergenic model was fine-tuned and the BMRs for OOB elements, elements not in the current bootstrap sample, were subsequently predicted.

  4. Ensemble Aggregation of Bootstrap Sample Predictions: We calculated the average of all OOB predictions to provide a robust estimation of the BMR for each element.

In summary, eMET leverages transfer learning by training on large intergenic intervals to capture broad mutation patterns, then fine-tunes on smaller, element-specific datasets for precise BMR estimation using an ensemble approach. By excluding known drivers from training, eMET accurately models background mutation processes that may result in the enhanced identification of novel cancer drivers.

Dimensionality reduction

We also analyzed the performance of different ML methods on low-dimensional representations of the data. For this purpose, we applied PCA and an AE for dimensionality reduction using the Python sklearn and tensorflow packages. We selected the first 148 PCs, which explained 80% of the variability. The AE also used 148 neurons for the bottleneck layer. It also had three hidden layers for encoding and decoding, with 250, 200, and 150 neurons. The Rectified Linear Unit (ReLU) was used as the activation function for all layers, except for the bottleneck layer, where a linear activation function was used. MSE was employed as the loss function. We used the Adam optimizer for training with a total of 2000 epochs and a batch size of 512.

Variable importance and feature category importance

We first investigated the importance of the seven feature categories in intergenic models using two approaches. In the first approach, we trained an XGBoost model using one group of features on 80% of the intergenic intervals as the training set and evaluated the model on the remaining 20% of intergenic intervals. We repeated this procedure 10 times and reported the performance of the groups by averaging the correlations between the observed and predicted rates in each repeat. In the second approach, we trained an XGBoost model on 80% of the intergenic intervals and validated it on the remaining 20% of intergenic intervals, excluding one group of features each time. This process was repeated 10 times. We reported the decrease in average correlation across repeats by comparing the results to those obtained when using all features. Additionally, we analyzed the importance of different feature categories in predicting BMR for each element type. To achieve this goal, we performed eMET on individual feature categories and compared the performance to that of the full eMET model, which uses all features. Specifically, we reported the ratio of the correlation obtained using the simpler eMET model to the correlation obtained with the full model. We also assessed the improvement of the eMET model over the intergenic model for feature category c and element type k as ρkceρkcimaxρkce,ρkci where ρkce and ρkci represent the correlations obtained using the eMET model and the intergenic XGBoost model, respectively. The normalization by the maximum absolute correlation scales the result to account for different ranges of correlation values across element types and feature categories.

We also assessed the variable importance to determine the contribution of each feature to the model’s performance in predicting BMR. Importance was defined by the number of times a feature was included in a tree across all trees in the XGBoost model (importance_type = ‘weight’ in Python XGBoost). We then assessed the enrichment of tissue-specific features within the top 50 most important features identified by the intergenic XGBoost model, which consisted of a total of 1372 features. This evaluation was conducted using the hypergeometric test across various cancer types.

Validation method and evaluation metrics

Spearman correlation, MSE, and mean absolute error (MAE) were used as the primary evaluation metrics to compare observed versus predicted mutation rates on the test data. We used 10 repeated train-test splits to assess the performance of different algorithms in various settings. For training, 80% of the intergenic intervals were randomly selected. For models trained with variable-size intergenic intervals, the remaining 20% were used for the validation set. When training with fixed-size intergenic intervals, variable-size intervals overlapping with these 20% fixed intervals were chosen for validation. We conducted one-sided t-tests to determine the significance of differences between each pair of models across various settings. For each pair of models, we calculated the p-value for the correlation between observed and predicted BMRs in 10 repeats of the validation set, testing whether the correlation in the first model was greater than that in the second model. We then adjusted the resulting p-values using the Benjamini-Hochberg procedure. To evaluate the effectiveness of the models in predicting mutation rates across different functional genomic intervals, we trained using the full set of intergenic intervals and tested the models on PCAWG elements.

Statistical testing for driver discovery

We used the binomial distribution to calculate the p-value for the significance of elements as driver candidates13. Specifically, we assumed the number of mutations follows Yi~Binom(ni,pi), where ni is the total number of possible mutations that can occur in element i in the cohort, defined as ni=N*li, where as previously defined in “Features and response table” subsection, N is the total number of donors in the cohort, li is the length of element i, and pi is the probability of observing a mutation in element i in the cohort under study, defined as pi=Y^i/(Nli), where Y^i is the estimate of the expected number of mutations using eMET or other models. The p-value was defined as the probability of observing at least yi mutations in element i in the cohort (P(Yi>=yi)). P-values were corrected for multiple testing via the Benjamini–Hochberg procedure and elements with q-values < 0.05 were reported as significant driver candidates14,58.

Definition of the gold standard and benchmarking the models for cancer drivers

To evaluate the performance of the eMET and other BMR models in identifying cancer drivers, we used the OncoKB cancer gene list as the gold standard. For CDSs and non-coding elements, significant hits (q-value < 0.05) in the oncoKB cancer genes were considered true positive hits. For significant enhancers, if any target genes of an enhancer were in the list of oncoKB cancer genes, the identified enhancer was considered as a true positive. Models were compared using several metrics: precision (TP/ [TP + FP]), recall (TP/ [TP + FN]), F1 (2*precision*recall/(precision + recall)), area under the ROC curve (AUC), and Area Under precision-recall curve (AUPR), where TP represents true positives, FN represents false negatives (due to the lack of a comprehensive list of true negative cancer genes, we treated genes not listed in OncoKB as FN), and FP represents false positives. The precision metric indicates the fraction of significant hits that are true positives, with higher precision indicating fewer false positives. Recall measures the fraction of true positive hits among all actual positives, with higher recall reflecting better detection of true positives, thus introducing more driver candidates. The F1 score, the harmonic mean of precision and recall, balances type I error (false positives) and the method’s statistical power. AUC assesses the model’s ability to distinguish drivers from non-drivers, while AUPR, focusing on precision and recall, is more informative for imbalanced datasets where true negatives may dominate.

Benchmarking of eMET with existing methods

We compared eMET to three published tools namely, ActiveDriverWGS14, Dig15, DriverPower13, along with the previously defined baseline for the BMR prediction task. All tools were based on mutational excess and were designed to work with both coding and non-coding regions.

ActiveDriverWGS was run with default settings without the optional argument ‘active sites’. This tool estimates the number of mutated samples rather than mutation counts for each genomic element. We used these estimates to calculate BMR. When using DriverPower, we did not include non-mutated intergenic intervals for model training. For Dig, we used the ‘Interval driver model’ with its pretrained mutational maps.

Association of candidate driver mutations with mRNA abundance

To investigate whether mutations in candidate driver elements are associated with RNA expression, we used RNA-seq data, represented as log1p-transformed FPKM-UQ values, alongside consensus gene-level somatic copy number alterations (CNAs) from PCAWG donors. We compared RNA expression between mutated and non-mutated samples for all significant candidate drivers that had at least three samples with both RNA-seq and CNA data. For each candidate driver in a given cohort, we employed a likelihood ratio test using ANOVA on quasi-Poisson family generalized linear models to predict FPKM-UQ based on mutational status (Mut) and CNAs, with or without including the mutational status as a covariate. For the pan-cancer cohort, we additionally included a tissue covariate to account for cancer-type effects on gene expression (Eq. 1):

FPKMUQ~Mut+CN+[Tissue] 1

Here, Mut denotes the binary variable indicating mutational status, CN represents the somatic copy number, and Tissue is a categorical variable indicating the tumor tissue type for the given candidate element in the pan-cancer cohort.

Supplementary information

Supplementary Data 1 (104KB, xlsx)
Supplementary Data 2 (9.5KB, xlsx)
Supplementary Data 3 (50.6KB, xlsx)

Acknowledgements

This work is based upon research funded by Iran National Science Foundation (INSF) under project No. 4022073. Our study was partly supported by the Pasteur Institute of Iran (No. ZS-9604 to F.B.). The student research committee of the Pasteur Institute of Iran partly supported the project financially (to F.B.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Dr. Charlotte K.Y. Ng for her valuable discussions during the initial phase of the study and Dr. Ehsan Zangene for the graphical design of Fig. 1. Development of eMET was performed at the Department of Bioinformatics, University of Tehran.

Author contributions

F.B., R.A.C., and H.M. conceived the presented idea and planned the project. F.B. and H.M. developed the methodology. F.B. implemented the method in R and Python and conducted all the experiments and visualizations. H.M. supervised the study and verified the findings of this work. F.B. and H.M. wrote the manuscript. F.B., R.A.C., and H.M. discussed the results and contributed to the final version of the manuscript. All authors agreed to the final version of the manuscript.

Data availability

The datasets used in this study were publicly available, and are provided in Supplementary Data 1 (URLs of the features) and Supplementary Table 5. Because of data access policies of the ICGC and TCGA data, to access TCGA portion of somatic SNV/indels data we applied to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) (Project ID: 32607, accession: phs000178).

Code availability

All of the codes to run eMET are available on GitHub (https://github.com/FaridehBahari/eMET).

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Reza Ahangari Cohan, Email: cohan.reza@gmail.com.

Hesam Montazeri, Email: hesam.montazeri@ut.ac.ir.

Supplementary information

The online version contains supplementary material available at 10.1038/s41698-025-00871-3.

References

  • 1.Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature458, 719–724 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer20, 555–572 (2020). [DOI] [PubMed] [Google Scholar]
  • 3.Ostroverkhova, D., Przytycka, T. M. & Panchenko, A. R. Cancer driver mutations: predictions and reality. Trends Mol. Med.29, 554–566 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Elliott, K. & Larsson, E. Non-coding driver mutations in human cancer. Nat. Rev. Cancer21, 500–509 (2021). [DOI] [PubMed] [Google Scholar]
  • 5.ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature578, 82–93 (2020).32025007 [Google Scholar]
  • 6.Polak, P. et al. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature518, 360–364 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Supek, F. & Lehner, B. Scales and mechanisms of somatic mutation rate variation across the human genome. DNA Repair81, 102647 (2019). [DOI] [PubMed] [Google Scholar]
  • 8.Koren, A. et al. Differential relationship of DNA replication timing to different forms of human mutation and variation. Am. J. Hum. Genet.91, 1033–1040 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2658 cancer whole genomes. Nature578, 102–111 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell173, 1823 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Greenman, C. et al. Patterns of somatic mutation in human cancer genomes. Nature446, 153–158 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Liu, E. M. et al. Identification of Cancer Drivers at CTCF Insulators in 1,962 Whole Genomes. Cell Syst.8, 446–455.e8 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Shuai, S., PCAWG Drivers and Functional Interpretation Working Group, Gallinger, S., Stein, L. D. & PCAWG Consortium. Combined burden and functional impact tests for cancer driver discovery using DriverPower. Nat. Commun.11, 734 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zhu, H. et al. Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks. Mol. Cell77, 1307–1321.e10 (2020). [DOI] [PubMed] [Google Scholar]
  • 15.Sherman, M. A. et al. Genome-wide mapping of somatic mutation rates uncovers drivers of cancer. Nat. Biotechnol.40, 1634–1643 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Guo, Y. A., Chang, M. M. & Skanderup, A. J. MutSpot: detection of non-coding mutation hotspots in cancer genomes. npj Genom. Med.5, 1–5 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lohr, J. G. et al. Discovery and prioritization of somatic mutations in diffuse large B-cell lymphoma (DLBCL) by whole-exome sequencing. Proc. Natl Acad. Sci. Usa.109, 3879–3884 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Hnisz, D. et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science351, 1454–1458 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature532, 264–267 (2016). [DOI] [PubMed] [Google Scholar]
  • 20.Tomkova, M. et al. Dr.Nod: computational framework for discovery of regulatory non-coding drivers in tissue-matched distal regulatory elements. Nucleic Acids Res.51, e23 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lanzós, A. et al. Discovery of Cancer Driver Long Noncoding RNAs across 1112 Tumour Genomes: New Candidates and Distinguishing Features. Sci. Rep.7, 1–16 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Schuster-Böckler, B. & Lehner, B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature488, 504–507 (2012). [DOI] [PubMed] [Google Scholar]
  • 23.Akdemir, K. C. et al. Somatic mutation distributions in cancer genomes vary with three-dimensional chromatin structure. Nat. Genet.52, 1178–1188 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ponting, C. P. Human genetics seen through an evolutionary lens. Cell Genom.3, 100323 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Christmas, M. J., et al. Evolutionary constraint and innovation across hundreds of placental mammals. Science. (N. Y., N. Y.)380, eabn3943 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Minkin, I. & Salzberg, S. L. Conservation assessment of human splice site annotation based on a 470-genome alignment. bioRxiv 2023.12.01.569581, 10.1101/2023.12.01.569581 (2024) [DOI] [PMC free article] [PubMed]
  • 27.Clamp, M. et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl Acad. Sci.104, 19428–19433 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Bera, T. K. et al. POTE paralogs are induced and differentially expressed in many cancers. Cancer Res. 66, 52–56 (2006). [DOI] [PubMed] [Google Scholar]
  • 29.Shen, Z. et al. POTEE drives colorectal cancer development via regulating SPHK1/p65 signaling. Cell Death Dis.10, 863 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wang, Q. et al. Delta and Notch-like epidermal growth factor-related receptor suppresses human glioma growth by inhibiting oncogene TOR4A. J. Cancer Res. Ther.18, 1372–1379 (2022). [DOI] [PubMed] [Google Scholar]
  • 31.Abboud-Jarrous, G. et al. Protein S drives oral squamous cell carcinoma tumorigenicity through regulation of AXL. Oncotarget8, 13986–14002 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Dressler, L. et al. Comparative assessment of genes driving cancer and somatic evolution in non-cancer tissues: an update of the Network of Cancer Genes (NCG) resource. Genome Biol.23, 1–22 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Pokrovsky, V. S., Qoura, L. A., Demidova, E. A., Han, Q. & Hoffman, R. M. Targeting Methionine Addiction of Cancer Cells with Methioninase. Biochemistry88, 944–952 (2023). [DOI] [PubMed] [Google Scholar]
  • 34.Zou, Y., Zhang, J., Zhang, L. & Yan, X. Interferon-induced protein 16 expression in colorectal cancer and its correlation with proliferation and immune signature markers. Oncol. Lett.22, 687 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Zhang, Y. et al. IFI16 inhibits tumorigenicity and cell proliferation of bone and cartilage tumor cells. Front. Biosci.12, 4855–4863 (2007). [DOI] [PubMed] [Google Scholar]
  • 36.Duan, X., Ponomareva, L., Veeranki, S. & Choubey, D. IFI16 induction by glucose restriction in human fibroblasts contributes to autophagy through activation of the ATM/AMPK/p53 pathway. PLoS One6, e19532 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Dong, X.-Y. et al. Identification of two novel CT antigens and their capacity to elicit antibody response in hepatocellular carcinoma patients. Br. J. Cancer89, 291–297 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Simon, P. et al. Functional TCR retrieval from single antigen-specific human T cells reveals multiple novel epitopes. Cancer Immunol. Res. 2, 1230–1244 (2014). [DOI] [PubMed] [Google Scholar]
  • 39.Zainodini, N., Abolhasani, M., Mohsenzadegan, M., Farajollahi, M. M. & Rismani, E. Overexpression of Transmembrane Phosphatase with Tensin homology (TPTE) in prostate cancer is clinically significant, suggesting its potential as a valuable biomarker. J. Cancer Res. Clin. Oncol.150, 165 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Li, M. et al. Olfactory receptor 5B21 drives breast cancer metastasis. iScience24, 103519 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kalbe, B. et al. Helional-induced activation of human olfactory receptor 2J3 promotes apoptosis and inhibits proliferation in a non-small-cell lung cancer cell line. Eur. J. Cell Biol.96, 34–46 (2017). [DOI] [PubMed] [Google Scholar]
  • 42.Maßberg, D. et al. The activation of OR51E1 causes growth suppression of human prostate cancer cells. Oncotarget7, 48231–48249 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ranzani, M. et al. Revisiting olfactory receptors as putative drivers of cancer. Wellcome Open Res.2, 9 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chen, J. et al. ABI2-mediated MEOX2/KLF4-NANOG axis promotes liver cancer stem cell and drives tumour recurrence. Liver Int.42, 2562–2576 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Lv, L. et al. Inhibition of ABI2 ubiquitination-dependent degradation suppresses TNBC cell growth via down-regulating PI3K/Akt signaling pathway. Cancer Cell Int.24, 1–17 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Wang, M. et al. Fibrinogen Alpha Chain Knockout Promotes Tumor Growth and Metastasis through Integrin-AKT Signaling Pathway in Lung Cancer. Mol. Cancer Res.: MCR18, 943–954 (2020). [DOI] [PubMed]
  • 47.Garvin, A. J. et al. The deSUMOylase SENP2 coordinates homologous recombination and nonhomologous end joining by independent mechanisms. Genes Dev.33, 333–347 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Sun, L. et al. SENP2 promotes ESCC proliferation through SETDB1 deSUMOylation and enhanced fatty acid metabolism. Heliyon10, e34010 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Rosenbloom, K. R. et al. ENCODE data in the UCSC Genome Browser: year 5 update. Nucleic Acids Res.41, D56–D63 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Rosenbloom, K. R. et al. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 43, D670–D681 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Chen, T. & Guestrin, C. XGBoost. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, NY, USA, 2016). 10.1145/2939672.2939785.
  • 52.Abadi, M. et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. 10.48550/ARXIV.1603.04467 (2016)
  • 53.Raschka, S., Patterson, J. & Nolet, C. Machine learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information11, 193 (2020). [Google Scholar]
  • 54.Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature505, 495–501 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Sondka, Z. et al. COSMIC: a curated database of somatic variants and clinical data for cancer. Nucleic Acids Res.52, D1210–D1217 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Suehnholz, S. P. et al. Quantifying the Expanding Landscape of Clinical Actionability for Patients with Cancer. Cancer Discov.14, 49–65 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Chakravarty, D. et al. OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol.1, 1–16 (2017). [DOI] [PMC free article] [PubMed]
  • 58.Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc.57, 289–300 (1995). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data 1 (104KB, xlsx)
Supplementary Data 2 (9.5KB, xlsx)
Supplementary Data 3 (50.6KB, xlsx)

Data Availability Statement

The datasets used in this study were publicly available, and are provided in Supplementary Data 1 (URLs of the features) and Supplementary Table 5. Because of data access policies of the ICGC and TCGA data, to access TCGA portion of somatic SNV/indels data we applied to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) (Project ID: 32607, accession: phs000178).

All of the codes to run eMET are available on GitHub (https://github.com/FaridehBahari/eMET).


Articles from NPJ Precision Oncology are provided here courtesy of Nature Publishing Group

RESOURCES